中文自然语言处理示例__LSTM with Attention Model运用于中文医学报告预测_Part1

中文的自然语言处理和不像英语那么方便,要遇到各种各样的问题. 几个大方向,除了删去一些data里原本的错误之外,还要创造中文和数字的字典,替代中文中的特殊字符,还要处理文本,保持长度的一致,等等.

Part1主要是在model之前,讲讲如何preprocess中文文本. 话不多说,现在开始啦.

data长这样,15997个obs, 目的是用description predict conclusion. 每针对一句description的输入,都有一个相应的conclusion的输出. 额,复制过来的header有点问题.

id	description	conclusion
0	6002920	双肺未见明显实质性病变，心影大小形态正常。双侧膈面尚清，双侧肋膈角锐利。	双肺、心、双膈未见明显异常。
1	6003323	双肺未见明显实质性病变，心影大小形态正常。双侧膈面尚清，双侧肋膈角锐利。	双肺、心、双膈未见明显异常。
2	7462283	胸廓对称，双肺野透亮度可，肺纹理清晰，走行自然，双肺野内未见异常密度影，双肺门影不大。心影大...	两肺、心、膈未见异常。
3	7943475	双肺野透亮度可，双肺野内未见异常密度影，双肺门影不大。心影大小形态正常。双侧膈面光整，肋膈角锐利。	双肺、心、膈未见明显异常。
4	29169834	双肺纹理增强，未见明显实质性病变，双侧肺门未见异常。心影大小形态正常。双侧膈面光整，肋膈角锐利。	双肺纹理增强。

1. Read and load data

1.1 把description和conclusion分开存成txt, 以便日后读取

desc=df[['description']]
con=df[['conclusion']]
desc.to_csv('descri.txt',sep=' ',index=False)
con.to_csv('conclu.txt',sep=' ',index=False)

1.2 read txt data

# read description txt
filename = "descri.txt"
raw_text = open(filename).read()
lines_of_text = raw_text.split('\n')
print(lines_of_text[:4])

['description', '双肺未见明显实质性病变，心影大小形态正常。双侧膈面尚清，双侧肋膈角锐利。', '双肺未见明显实质性病变，心影大小形态正常。双侧膈面尚清，双侧肋膈角锐利。', '胸廓对称，双肺野透亮度可，肺纹理清晰，走行自然，双肺野内未见异常密度影，双肺门影不大。心影大小形态正常。双侧膈面光整，肋膈角锐利。', '双肺野透亮度可，双肺野内未见异常密度影，双肺门影不大。心影大小形态正常。双侧膈面光整，肋膈角锐利。']

# read conclusion text
filename2='conclu.txt'
raw_text=open(filename2).read()
lines_of_target=raw_text.split('\n')
print(lines_of_target[:10])

['conclusion', '双肺、心、双膈未见明显异常。', '双肺、心、双膈未见明显异常。', '两肺、心、膈未见异常。', '双肺、心、膈未见明显异常。', '双肺纹理增强。', '双肺纹理增强。', '双肺纹理增强。', '双肺纹理增强，必要时进一步检查。', '双肺纹理增强；左下肺条片灶，建议进一步检查；双肋膈角钝。']

2 Clean Data

2.1 去除空行以及header, 这里只show代码process input的,也就是description, output也就是conclusion相同. 只是名字不一样

# remove empty line and header 
lines_of_text = [lines for lines in lines_of_text if len(lines) > 0]
lines_of_text = lines_of_text[1:len(lines_of_text)]
# check num of lines (actually no empty line exist)
print(len(lines_of_text))

15997

2.2 创建字典,将每个中文字都用数字来代表, 每个unique的中文都映射一个unique的number. 这些映射在之后的model中都要用到.这个function will apply both on output and input

# create dict converting Chinese to number
def create_lookup_tables(input_data):
    vocab = set(input_data)

    # 文字到数字的映射
    vocab_to_int = {word: idx for idx, word in enumerate(vocab)}

    # 数字到文字的映射
    int_to_vocab = dict(enumerate(vocab))

    return vocab_to_int, int_to_vocab

2.3 处理完中文字,还要处理Python不认识的特殊中文标点符号和字符. 这里先用一些letter代表这些符号,当运行2.2的function时,这些标点符号也会有相应的数字代替. 这些标点符号因data而异,自己创建的,我的data里就出现了这么些.

def token_lookup():
    symbols = list(['。', '，', '“', "”", '；', '！', '？', '（', '）', '——', '\n','+','*',':'])

    tokens = ["P", "C", "Q", "T", "S", "E", "M", "I", "O", "A", "D",'J','K','L']

    return dict(zip(symbols, tokens))

2.4 实现2.2和2.3,将中文变成数字,得到映射的字典. 并且保持数据里原本的分行. 这个项目的目的是一行description对应一个输出的conclusion.所以要保持分行. 仔细看2.3, 分行符\n是用字母D表示的. (D不可以原本就存在于数据中,否则会导致分行错误), 因此每碰到一个D, 就代表一句话结束,那么就形成一个单独的list. 生成的会是一个list里套着15997个list. 如果不这么做, 数据会变成一个超大的list, 失去了分行. 无法进行预测.

这里的len(text)-71,是为了将最后一行也加进去,最后一句长为71. 如果直接小于len(text)会漏掉最后一行.只会加到最后一行之前的那个分隔符为止.

最后将结果存成pickle

output也要做同样的处理. 避免赘述, 这里不贴了,input和output生成的映射是不同的. 这个没有关系. 只要数字和中文是一对一的就可以了. output save成了prepro.p. 之后会看到.

def preprocess_and_save_data(text, token_lookup, create_lookup_tables):
    token_dict = token_lookup()
    # 把标点符号改为token
    for key, token in token_dict.items():
        text = text.replace(key, '{}'.format(token))
    text = list(text)   

    vocab_to_int, int_to_vocab = create_lookup_tables(text)
    int_text = [vocab_to_int[word] for word in text]
    #print(vocab_to_int['D'])
    #print(int_to_vocab[309])
    
    i=0
    result_text=[]
    start=0
    while(i<len(int_text)-71):
        if(int_text[i]==vocab_to_int['D']):
            result_text.append(int_text[start:i])
            start=i
        i+=1
    result_text.append(int_text[i-1:len(int_text)-1])
    print(result_text[-1])
    print(len(result_text))
    
    # python数据持久化
    pickle.dump((result_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

preprocess_and_save_data('\n'.join(lines_of_text), token_lookup, create_lookup_tables)

这就存完了

原本的中文句子就长这样啦:

[406, 320, 381, 107, 289, 355, 630, 481, 594, 89, 517, 489, 199, 12, 263, 516, 362]

全都变成了数字,很神奇有木有

3 Padding or Truncate input and output

3.1 读取pickle data

def load_preprocess():
    return pickle.load(open('preprocess.p', mode='rb'))
def load_preprocess2():
    return pickle.load(open('prepro.p', mode='rb'))

result_desc, vocab1_to_int, int_to_vocab1, token_dict = load_preprocess()
result_target,vocab2_to_int,int_to_vocab2,token_dict = load_preprocess2()

vocab1_to_int: 属于input的字典, key是中文,value是数字

vocab2_to_int: 属于output的字典, key是中文,value是数字

大概长这样:

int_to_vocab1: 属于input的字典, key是数字,value是中文

int_to_vocab2: 属于output的字典, key是数字,value是中文

大概长这样:

值得注意的是,input和output的词汇不是完全相同的,比如input有617个unique字,output有645.

3.2 在字典里,加上一个毕字.

很简单,就像你想把一个长度为20的list变成长度为30的list,就在后面不停加上0直到长度为30. 毕字也是这个作业.毕不可以在data里出现过. 但对于model来说,毕字在data里没有出现过,如果要加入的话,就要将毕以及毕对应的数字加进字典. 在input的字典里,毕对应数字617, output里对应645. 都是加在最后.

int_to_vocab1[617]='毕'
int_to_vocab2[645]='毕'
vocab2_to_int['毕'] = 645
vocab1_to_int['毕']= 617

print(len(vocab1_to_int))
print(len(vocab2_to_int))
print(len(int_to_vocab1))
print(len(int_to_vocab2))

618
646

618

646

这样就加好了,corpus从原来的617,645变成618,646,因为多了一个字嘛

3.3 为什么要加'毕'字呢? 因为我想padding或者truncate我们的句子.

input的句长从20到88不等,output的句子从10到50不等.这一步将所有input的句长padding or truncate成80, output句子为40. 更具体一点,input的句子里不满80个字的,就加上毕对应的数字617,一直到80为止. output就加645直到40为止.超过80 or 40的话,就truncate把句子cut到80或者40.这里写两个function,并且用上我最喜欢的list comprehension

def trp_target(l, n):
    return l[:n] + [645]*(n-len(l))
def trp_desc(l, n):
    return l[:n] + [617]*(n-len(l))

desc=[trp_desc(item,80) for item in result_desc]
target=[trp_target(item,40) for item in result_target]

print(len(desc))
print(len(target))
print(len(desc[3]))
print(len(target[3]))

然后数据就变成这样啦

input:

[405, 75, 108, 282, 221, 614, 435, 53, 93, 315, 375, 108, 282, 507, 13, 596, 29, 295, 563, 322, 119, 545, 545, 606, 459, 213, 435, 605, 617,617,617,617,617,617,617,617...]

output:

[406, 320, 381, 107, 289, 355, 630, 481, 594, 89, 517, 489, 199, 12, 263, 516, 362,645,645,645,645....]

这样处理以后,当model跑完,把数字转为中文后,把毕都删去就好了. 并不影响输出.

input和output都变成了数字.每一个list是一句话,并且input都是80长,output都是40长. Data Pre-process的部分已经完成啦.

接下来就是Model的部分, 本来想一起写完,可是好累,等Part2再写...

中文自然语言处理示例__LSTM with Attention Model运用于中文医学报告预测_Part1

10分钟搞定Mysql主从部署配置

如何使用 JS 判断用户是否处于活跃状态

「Pygors跨平台GUI」2：安装MinGW-w64、MSYS2还是WSL2

[转帖]

python列出centos7内存使用前50的进程信息

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

一键自动化博客发布工具,用过的人都说好(掘金篇)

lightdb数据库超时相关控制参数

lightdb秒级增加列和删除列（not null带默认值）

Java ThreadPoolShutdown

Python- How to format datetime and replace value by multiple datetime conditions

兩種方法解決leetcode 153. Find Minimum in Rotated Sorted Array

三種方法解決Leetcode169. Majority Element in Python

三種方法解決Lintcode39 Recover Rotated Sorted Array in Python

兩種方法解決leetcode 53. Maximum Subarray

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結