中文自然語言處理示例__LSTM with Attention Model運用於中文醫學報告預測_Part1

中文的自然語言處理和不像英語那麼方便,要遇到各種各樣的問題. 幾個大方向,除了刪去一些data裏原本的錯誤之外,還要創造中文和數字的字典,替代中文中的特殊字符,還要處理文本,保持長度的一致,等等.

Part1主要是在model之前,講講如何preprocess中文文本. 話不多說,現在開始啦.

data長這樣,15997個obs, 目的是用description predict conclusion. 每針對一句description的輸入,都有一個相應的conclusion的輸出. 額,複製過來的header有點問題.

id	description	conclusion
0	6002920	雙肺未見明顯實質性病變，心影大小形態正常。雙側膈面尚清，雙側肋膈角銳利。	雙肺、心、雙膈未見明顯異常。
1	6003323	雙肺未見明顯實質性病變，心影大小形態正常。雙側膈面尚清，雙側肋膈角銳利。	雙肺、心、雙膈未見明顯異常。
2	7462283	胸廓對稱，雙肺野透亮度可，肺紋理清晰，走行自然，雙肺野內未見異常密度影，雙肺門影不大。心影大...	兩肺、心、膈未見異常。
3	7943475	雙肺野透亮度可，雙肺野內未見異常密度影，雙肺門影不大。心影大小形態正常。雙側膈面光整，肋膈角銳利。	雙肺、心、膈未見明顯異常。
4	29169834	雙肺紋理增強，未見明顯實質性病變，雙側肺門未見異常。心影大小形態正常。雙側膈面光整，肋膈角銳利。	雙肺紋理增強。

1. Read and load data

1.1 把description和conclusion分開存成txt, 以便日後讀取

desc=df[['description']]
con=df[['conclusion']]
desc.to_csv('descri.txt',sep=' ',index=False)
con.to_csv('conclu.txt',sep=' ',index=False)

1.2 read txt data

# read description txt
filename = "descri.txt"
raw_text = open(filename).read()
lines_of_text = raw_text.split('\n')
print(lines_of_text[:4])

['description', '雙肺未見明顯實質性病變，心影大小形態正常。雙側膈面尚清，雙側肋膈角銳利。', '雙肺未見明顯實質性病變，心影大小形態正常。雙側膈面尚清，雙側肋膈角銳利。', '胸廓對稱，雙肺野透亮度可，肺紋理清晰，走行自然，雙肺野內未見異常密度影，雙肺門影不大。心影大小形態正常。雙側膈面光整，肋膈角銳利。', '雙肺野透亮度可，雙肺野內未見異常密度影，雙肺門影不大。心影大小形態正常。雙側膈面光整，肋膈角銳利。']

# read conclusion text
filename2='conclu.txt'
raw_text=open(filename2).read()
lines_of_target=raw_text.split('\n')
print(lines_of_target[:10])

['conclusion', '雙肺、心、雙膈未見明顯異常。', '雙肺、心、雙膈未見明顯異常。', '兩肺、心、膈未見異常。', '雙肺、心、膈未見明顯異常。', '雙肺紋理增強。', '雙肺紋理增強。', '雙肺紋理增強。', '雙肺紋理增強，必要時進一步檢查。', '雙肺紋理增強；左下肺條片竈，建議進一步檢查；雙肋膈角鈍。']

2 Clean Data

2.1 去除空行以及header, 這裏只show代碼process input的,也就是description, output也就是conclusion相同. 只是名字不一樣

# remove empty line and header 
lines_of_text = [lines for lines in lines_of_text if len(lines) > 0]
lines_of_text = lines_of_text[1:len(lines_of_text)]
# check num of lines (actually no empty line exist)
print(len(lines_of_text))

15997

2.2 創建字典,將每個中文字都用數字來代表, 每個unique的中文都映射一個unique的number. 這些映射在之後的model中都要用到.這個function will apply both on output and input

# create dict converting Chinese to number
def create_lookup_tables(input_data):
    vocab = set(input_data)

    # 文字到數字的映射
    vocab_to_int = {word: idx for idx, word in enumerate(vocab)}

    # 數字到文字的映射
    int_to_vocab = dict(enumerate(vocab))

    return vocab_to_int, int_to_vocab

2.3 處理完中文字,還要處理Python不認識的特殊中文標點符號和字符. 這裏先用一些letter代表這些符號,當運行2.2的function時,這些標點符號也會有相應的數字代替. 這些標點符號因data而異,自己創建的,我的data裏就出現了這麼些.

def token_lookup():
    symbols = list(['。', '，', '“', "”", '；', '！', '？', '（', '）', '——', '\n','+','*',':'])

    tokens = ["P", "C", "Q", "T", "S", "E", "M", "I", "O", "A", "D",'J','K','L']

    return dict(zip(symbols, tokens))

2.4 實現2.2和2.3,將中文變成數字,得到映射的字典. 並且保持數據裏原本的分行. 這個項目的目的是一行description對應一個輸出的conclusion.所以要保持分行. 仔細看2.3, 分行符\n是用字母D表示的. (D不可以原本就存在於數據中,否則會導致分行錯誤), 因此每碰到一個D, 就代表一句話結束,那麼就形成一個單獨的list. 生成的會是一個list裏套着15997個list. 如果不這麼做, 數據會變成一個超大的list, 失去了分行. 無法進行預測.

這裏的len(text)-71,是爲了將最後一行也加進去,最後一句長爲71. 如果直接小於len(text)會漏掉最後一行.只會加到最後一行之前的那個分隔符爲止.

最後將結果存成pickle

output也要做同樣的處理. 避免贅述, 這裏不貼了,input和output生成的映射是不同的. 這個沒有關係. 只要數字和中文是一對一的就可以了. output save成了prepro.p. 之後會看到.

def preprocess_and_save_data(text, token_lookup, create_lookup_tables):
    token_dict = token_lookup()
    # 把標點符號改爲token
    for key, token in token_dict.items():
        text = text.replace(key, '{}'.format(token))
    text = list(text)   

    vocab_to_int, int_to_vocab = create_lookup_tables(text)
    int_text = [vocab_to_int[word] for word in text]
    #print(vocab_to_int['D'])
    #print(int_to_vocab[309])
    
    i=0
    result_text=[]
    start=0
    while(i<len(int_text)-71):
        if(int_text[i]==vocab_to_int['D']):
            result_text.append(int_text[start:i])
            start=i
        i+=1
    result_text.append(int_text[i-1:len(int_text)-1])
    print(result_text[-1])
    print(len(result_text))
    
    # python數據持久化
    pickle.dump((result_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

preprocess_and_save_data('\n'.join(lines_of_text), token_lookup, create_lookup_tables)

這就存完了

原本的中文句子就長這樣啦:

[406, 320, 381, 107, 289, 355, 630, 481, 594, 89, 517, 489, 199, 12, 263, 516, 362]

全都變成了數字,很神奇有木有

3 Padding or Truncate input and output

3.1 讀取pickle data

def load_preprocess():
    return pickle.load(open('preprocess.p', mode='rb'))
def load_preprocess2():
    return pickle.load(open('prepro.p', mode='rb'))

result_desc, vocab1_to_int, int_to_vocab1, token_dict = load_preprocess()
result_target,vocab2_to_int,int_to_vocab2,token_dict = load_preprocess2()

vocab1_to_int: 屬於input的字典, key是中文,value是數字

vocab2_to_int: 屬於output的字典, key是中文,value是數字

大概長這樣:

int_to_vocab1: 屬於input的字典, key是數字,value是中文

int_to_vocab2: 屬於output的字典, key是數字,value是中文

大概長這樣:

值得注意的是,input和output的詞彙不是完全相同的,比如input有617個unique字,output有645.

3.2 在字典裏,加上一個畢字.

很簡單,就像你想把一個長度爲20的list變成長度爲30的list,就在後面不停加上0直到長度爲30. 畢字也是這個作業.畢不可以在data裏出現過. 但對於model來說,畢字在data裏沒有出現過,如果要加入的話,就要將畢以及畢對應的數字加進字典. 在input的字典裏,畢對應數字617, output裏對應645. 都是加在最後.

int_to_vocab1[617]='畢'
int_to_vocab2[645]='畢'
vocab2_to_int['畢'] = 645
vocab1_to_int['畢']= 617

print(len(vocab1_to_int))
print(len(vocab2_to_int))
print(len(int_to_vocab1))
print(len(int_to_vocab2))

618
646

618

646

這樣就加好了,corpus從原來的617,645變成618,646,因爲多了一個字嘛

3.3 爲什麼要加'畢'字呢? 因爲我想padding或者truncate我們的句子.

input的句長從20到88不等,output的句子從10到50不等.這一步將所有input的句長padding or truncate成80, output句子爲40. 更具體一點,input的句子裏不滿80個字的,就加上畢對應的數字617,一直到80爲止. output就加645直到40爲止.超過80 or 40的話,就truncate把句子cut到80或者40.這裏寫兩個function,並且用上我最喜歡的list comprehension

def trp_target(l, n):
    return l[:n] + [645]*(n-len(l))
def trp_desc(l, n):
    return l[:n] + [617]*(n-len(l))

desc=[trp_desc(item,80) for item in result_desc]
target=[trp_target(item,40) for item in result_target]

print(len(desc))
print(len(target))
print(len(desc[3]))
print(len(target[3]))

然後數據就變成這樣啦

input:

[405, 75, 108, 282, 221, 614, 435, 53, 93, 315, 375, 108, 282, 507, 13, 596, 29, 295, 563, 322, 119, 545, 545, 606, 459, 213, 435, 605, 617,617,617,617,617,617,617,617...]

output:

[406, 320, 381, 107, 289, 355, 630, 481, 594, 89, 517, 489, 199, 12, 263, 516, 362,645,645,645,645....]

這樣處理以後,當model跑完,把數字轉爲中文後,把畢都刪去就好了. 並不影響輸出.

input和output都變成了數字.每一個list是一句話,並且input都是80長,output都是40長. Data Pre-process的部分已經完成啦.

接下來就是Model的部分, 本來想一起寫完,可是好累,等Part2再寫...

中文自然語言處理示例__LSTM with Attention Model運用於中文醫學報告預測_Part1

.Net 8.0 下的新RPC，IceRPC之試試的新玩法"打洞"

完美替代postman的軟件

Vue mockjs mock.js

關於遊戲付費的一點想法

我通過CKA和CKS啦！

安裝chromadb注意事項

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

Python- How to format datetime and replace value by multiple datetime conditions

兩種方法解決leetcode 153. Find Minimum in Rotated Sorted Array

三種方法解決Leetcode169. Majority Element in Python

三種方法解決Lintcode39 Recover Rotated Sorted Array in Python

兩種方法解決leetcode 53. Maximum Subarray

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結