中文自然語言預處理總結


#讀取文件列表數據,返回文本數據的內容列表和標籤列表
def filelist_contents_labels(filelist):
    contents=[]
    labels = []
    for file in filelist:
        with open(file, "r", encoding="utf-8") as f:
            for row in f.read().splitlines():
                sentence=row.split('\t')
                contents.append(sentence[-1])
                if sentence[0]=='other' :
                    labels.append(0)
                else:
                    labels.append(1)
    return contents,labels

2、全角與半角的轉化

在自然語言處理過程中，全角、半角的的不一致會導致信息抽取不一致，因此需要統一。中文文字永遠是全角，只有英文字母、數字鍵、符號鍵纔有全角半角的概念,一個字母或數字佔一個漢字的位置叫全角，佔半個漢字的位置叫半角。標點符號在中英文狀態下、全半角的狀態下是不同的。

有規律（不含空格）：全角字符unicode編碼從65281~65374 （十六進制 0xFF01 ~ 0xFF5E）；半角字符unicode編碼從33~126 （十六進制 0x21~ 0x7E）

特例：空格比較特殊，全角爲 12288（0x3000），半角爲 32（0x20）

#全角轉半角
def full_to_half(sentence):      #輸入爲一個句子
    change_sentence=""
    for word in sentence:
        inside_code=ord(word)
        if inside_code==12288:    #全角空格直接轉換
            inside_code=32
        elif inside_code>=65281 and inside_code<=65374:  #全角字符（除空格）根據關係轉化
            inside_code-=65248
        change_sentence+=chr(inside_code)
    return change_sentence

ord() 函數是 chr() 函數（對於8位的ASCII字符串）或 unichr() 函數（對於Unicode對象）的配對函數，它以一個字符（長度爲1的字符串）作爲參數，返回對應的 ASCII 數值，或者 Unicode 數值，如果所給的 Unicode 字符超出了你的 Python 定義範圍，則會引發一個 TypeError 的異常。

#半角轉全角
def hulf_to_full(sentence):      #輸入爲一個句子
    change_sentence=""
    for word in sentence:
        inside_code=ord(word)
        if inside_code==32:    #半角空格直接轉換
            inside_code=12288
        elif inside_code>=32 and inside_code<=126:  #半角字符（除空格）根據關係轉化
            inside_code+=65248
        change_sentence+=chr(inside_code)
    return change_sentence

3、文本中大寫數字轉化爲小寫數字

#大寫數字轉換爲小寫數字
def big2small_num(sentence):
    numlist = {"一":"1","二":"2","三":"3","四":"4","五":"5","六":"6","七":"7","八":"8","九":"9","零":"0"}
    for item in numlist:
        sentence = sentence.replace(item, numlist[item])
    return sentence

4、文本中大寫字母轉化爲小寫字母

#大寫字母轉爲小寫字母
def upper2lower(sentence):
    new_sentence=sentence.lower()
    return new_sentence

5、文本中的表情符號去除（只保留中英文和數字）

使用正則表達式

#去除文本中的表情字符（只保留中英文和數字）
def clear_character(sentence):
    pattern1= '\[.*?\]'     
    pattern2 = re.compile('[^\u4e00-\u9fa5^a-z^A-Z^0-9]')   
    line1=re.sub(pattern1,'',sentence)
    line2=re.sub(pattern2,'',line1)   
    new_sentence=''.join(line2.split()) #去除空白
    return new_sentence

6、去除文本中所有的字符（只保留中文）

#去除字母數字表情和其它字符
def clear_character(sentence):
    pattern1='[a-zA-Z0-9]'
    pattern2 = '\[.*?\]'
    pattern3 = re.compile(u'[^\s1234567890:：' + '\u4e00-\u9fa5]+')
    pattern4='[’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+'
    line1=re.sub(pattern1,'',sentence)   #去除英文字母和數字
    line2=re.sub(pattern2,'',line1)   #去除表情
    line3=re.sub(pattern3,'',line2)   #去除其它字符
    line4=re.sub(pattern4, '', line3) #去掉殘留的冒號及其它符號
    new_sentence=''.join(line4.split()) #去除空白
    return new_sentence

7、中文文本分詞

本文使用的是jieba分詞。

8、中文文本停用詞過濾

#去除停用詞，返回去除停用詞後的文本列表
def clean_stopwords(contents):
    contents_list=[]
    stopwords = {}.fromkeys([line.rstrip() for line in open('data/stopwords.txt', encoding="utf-8")]) #讀取停用詞表
    stopwords_list = set(stopwords)
    for row in contents:      #循環去除停用詞
        words_list = jieba.lcut(row)
        words = [w for w in words_list if w not in stopwords_list]
        sentence=''.join(words)   #去除停用詞後組成新的句子
        contents_list.append(sentence)
    return contents_list

9、將清洗後的數據寫入CSV文件

# 將清洗後的文本和標籤寫入.csv文件中
def after_clean2csv(contents, labels): #輸入爲文本列表和標籤列表
    columns = ['contents', 'labels']
    save_file = pd.DataFrame(columns=columns, data=list(zip(contents, labels)))
    save_file.to_csv('data/clean_data.csv', index=False, encoding="utf-8")

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

中文自然語言預處理總結

中文文本預處理總結

1、文本數據準備

2、全角與半角的轉化

3、文本中大寫數字轉化爲小寫數字

4、文本中大寫字母轉化爲小寫字母

5、文本中的表情符號去除（只保留中英文和數字）

6、去除文本中所有的字符（只保留中文）

7、中文文本分詞

8、中文文本停用詞過濾

9、將清洗後的數據寫入CSV文件

EXCEL中下拉菜單中添加新選項或者刪除選項

Git使用經驗總結5-修改提交信息

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Java中止線程的方式

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

《最新出爐》系列入門篇-Python+Playwright自動化測試-43-分頁測試

HTTP協議相關文檔

NLP數據增強方法總結及實現

基於樹模型的lightGBM文本分類

TextRank算法介紹及實現

Linux環境下編譯TensorFlow C++ API和測試方法總結（完美版）

Python3讀取和寫入excel表格數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結