文本預處理

前言

思路

因爲數據是爬蟲爬下來的，具體內容被寫入到了excel表裏，所以文本預處理分爲2塊。一個是從excel中獲取數據，然後去掉文本中所有的html標籤，最後整理成txt文檔中一行一條評論的形式。另一個是對文本去停用詞、分詞，處理成一個詞一個空格的形式，便於word2vec訓練模型。

代碼與解釋

pre_process_format.py

import re
import os
from openpyxl import load_workbook


def read_from_xlsx(path):
    wb = load_workbook(path)
    ws = wb[wb.sheetnames[0]]
    rows = ws.max_row
    cols = ws.max_column
    for row in range(2, rows + 1):
        with open("書評\\format\\" + ws.cell(row, 1).value + ".txt", 'w',
                  encoding='utf-8') as book_file:
            # book_file.write(ws.cell(row, 1).value + "\n")
            contents = []
            for col in range(7, cols + 1):
                content = ws.cell(row, col).value
                if str(content) == 'None':
                    continue
                content = str(re.sub("<[^>]+>", " ", content))
                content = str(re.sub("\n", " ", content))
                # print(content)
                contents.append(content + '\n')
            book_file.writelines(contents)


if __name__ == '__main__':
    xlsxBase = "書評\\xlsx\\"
    xlsxs = os.listdir(xlsxBase)
    for xlsx in xlsxs:
        read_from_xlsx(xlsxBase+xlsx)

代碼使用了openpyxl包，主要是讀取數據，因爲excel第一行是表頭，第二行第7列開始纔是評論主體，所以行號列號需要稍微規定一下。

讀取一個cell的數據，使用正則表達式”<[^>]+>”替換所有html標籤爲空格，然後替換所有換行爲空格，最後寫入文本時，在末尾加上換行符即可。

pre_process_segment.py

import os
import time
import jieba.posseg as pseg


def seg_book(book_base, book_name, outfile_path):
    infile = open(book_base + book_name, 'r', encoding='utf-8')
    outfile = open(outfile_path + "seg_" + book_name, 'w', encoding='utf-8')
    for line in infile:
        line = line.strip()
        # print(line)
        words = pseg.cut(line)
        for word, flag in words:
            if flag.startswith('x'):
                continue
            if word in cn_stopwords_set | en_stopwords_set:
                continue
            outfile.write(word + ' ')
        outfile.write('\n')
    outfile.close()
    infile.close()


if __name__ == '__main__':
    cn_stopwords_file = open("util\\stopwords_csdn.txt", 'r', encoding='utf-8')
    en_stopwords_file = open("util\\stopwords_google.txt", 'r',
                             encoding='utf-8')
    cn_stopwords_set = set(cn_stopwords_file.read().splitlines())
    en_stopwords_set = set(en_stopwords_file.read().splitlines())
    start = time.time()
    infileBase = "書評\\format\\"
    books = os.listdir(infileBase)
    for book in books:
        print(book + " 分詞中...")
        seg_book(infileBase, book, "書評\\seg\\")
    # seg_book(infileBase, "追風箏的人.txt", "書評\\seg\\")
    end = time.time()
    print("共計用時: %d seconds" % (end - start))

做的是原始數據的去停用詞和分詞處理，使用了jieba分詞，去掉了標點，停用詞使用的google英文停用詞和csdn某博客提供的中文停用詞。

後期會考慮使用tf-idf來動態去除停用詞。

總結

其實後來訓練了word2vec模型發現，很多結果不盡人意，比如“中”這個字沒有去除掉，而這個字單獨出現意味着它表示英文中的 in，應當放入停用詞當中。

是的，停用詞表不一定簡單的使用別人列出來的，知乎上查到的比較合理的做法：去除其中很常見的停用詞，然後使用tf-idf或者人工篩選去除另一部分。
因爲我們是每一本書一個模型來迭代獲取書評標籤，所以沒辦法爲每一本書人工篩選，後期再使用tf-idf篩選一波吧。先做出一個快速原型纔是重中之重。

[書蘊筆記-0]文本預處理

文本預處理

前言

思路

代碼與解釋

pre_process_format.py

pre_process_segment.py

總結

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

各電腦用途計劃

玄學問題日後瞭解一下1【已解決，智障問題】

以後發博客必保存

操作系統實驗之進程管理——生產者消費者問題

歡迎使用CSDN-markdown編輯器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結