（已修改）機器學習之文本分類（附帶訓練集+數據集+所有代碼）

本博客是我對之前博客進行的一些優化，對文件的處理，以及添加更多的註釋讓大家在NLP，文本分類等領域能夠更快的讓代碼跑起來。

原文鏈接：https://blog.csdn.net/qq_28626909/article/details/80382029

關於TF-IDF，樸素貝葉斯，分詞，停用詞等前面的博客（原文鏈接開頭以貼出）已經講得非常詳細了，這裏就不囉嗦了，本博客是講如何將代碼跑起來，因爲之前的代碼是我還是個菜鳥時候寫的，所以很多東西大家看不清楚，這裏我貼上當時大家問的主要問題以及在本博客中提出的解決方案

1.dat文件不能查看解決方案：生成詳細的txt文件，大家可以直接查看

2.不清楚生成的文件內容解決方案：生成詳細的txt文件，大家可以直接查看

3.文件路徑的修改（我之前沒有註釋）解決方案：全部替換絕對路徑爲相對路徑，並且添加註釋，讓大家下載下來之後可以直接跑

4.有的同學有環境問題解決方案：博客最後會放出大多數同學出現的問題以及解決方案

文件（文件夾名稱爲CSDN，進入之後的截圖如下）：

大多數同學用的編譯器是pycharm，所以這裏我將演示pycharm的運行代碼

請大家將文件夾移動至pycharm中，

這一個python文件我寫的都是相對路徑，所以大家應該不用改任何路徑即可運行（只要放在一起就行）

    datapath = "./data/"  #原始數據路徑
    stopWord_path = "./stop/stopword.txt"#停用詞路徑
    test_path = "./test/"            #測試集路徑
    '''
    以上三個文件路徑是已存在的文件路徑，下面的文件是運行代碼之後生成的文件路徑
    dat文件是爲了讀取方便做的，txt文件是爲了給大家展示做的，所以想查看分詞，詞頻矩陣
    詞向量的詳細信息請查看txt文件，dat文件是通過正常方式打不開的
    '''
    test_split_dat_path =  "./test_set.dat" #測試集分詞bat文件路徑
    testspace_dat_path ="./testspace.dat"   #測試集輸出空間矩陣dat文件
    train_dat_path = "./train_set.dat"  # 讀取分詞數據之後的詞向量並保存爲二進制文件
    tfidfspace_dat_path = "./tfidfspace.dat"  #tf-idf詞頻空間向量的dat文件
    '''
    以上四個爲dat文件路徑，是爲了存儲信息做的，不要打開
    '''
    test_split_path = './split/test_split/'   #測試集分詞路徑
    split_datapath = "./split/split_data/"  # 對原始數據分詞之後的數據路徑
    '''
    以上兩個路徑是分詞之後的文件路徑，大家可以生成之後自行打開查閱學習
    '''
    tfidfspace_path = "./tfidfspace.txt"  # 將TF-IDF詞向量保存爲txt，方便查看
    tfidfspace_arr_path = "./tfidfspace_arr.txt"  # 將TF-IDF詞頻矩陣保存爲txt，方便查看
    tfidfspace_vocabulary_path = "./tfidfspace_vocabulary.txt"  # 將分詞的詞彙統計信息保存爲txt，方便查看
    testSpace_path = "./testSpace.txt"  #測試集分詞信息
    testSpace_arr_path = "./testSpace_arr.txt"  #測試集詞頻矩陣信息
    trainbunch_vocabulary_path = "./trainbunch_vocabulary.txt" #所有分詞詞頻信息
    tfidfspace_out_arr_path = "./tfidfspace_out_arr.txt"   #tfidf輸出矩陣信息
    tfidfspace_out_word_path = "./tfidfspace_out_word.txt" #單詞形式的txt
    testspace_out_arr_path = "./testspace_out_arr.txt"     #測試集輸出矩陣信息
    testspace_out_word_apth ="./testspace_out_word.txt"    #測試界單詞信息
    '''
    以上10個文件是dat文件轉化爲txt文件，大家可以查詢信息，這是NLP（自然語言處理）非常珍貴的資源
    '''

這段代碼是對各個文件的註釋，裏面的內容應該算是比較詳細了。下面貼出完整代碼:

#!D:/workplace/python
# -*- coding: utf-8 -*-
# @File  : TFIDF_naive_bayes_wy.py
# @Author: WangYe
# @Date  : 2019/5/29
# @Software: PyCharm
# 機器學習之文本分類（附帶訓練集+數據集+所有代碼）
# 博客鏈接：https://blog.csdn.net/qq_28626909/article/details/80382029
import jieba
from numpy import *
import pickle  # 持久化
import os
from sklearn.feature_extraction.text import TfidfTransformer  # TF-IDF向量轉換類
from sklearn.feature_extraction.text import TfidfVectorizer  # TF_IDF向量生成類
from sklearn.datasets.base import Bunch
from sklearn.naive_bayes import MultinomialNB  # 多項式貝葉斯算法


def readFile(path):
    with open(path, 'r', errors='ignore') as file:  # 文檔中編碼有些問題，所有用errors過濾錯誤
        content = file.read()
        file.close()
        return content


def saveFile(path, result):
    with open(path, 'w', errors='ignore') as file:
        file.write(result)
        file.close()


def segText(inputPath, resultPath):
    fatherLists = os.listdir(inputPath)  # 主目錄
    for eachDir in fatherLists:  # 遍歷主目錄中各個文件夾
        eachPath = inputPath + eachDir + "/"  # 保存主目錄中每個文件夾目錄，便於遍歷二級文件
        each_resultPath = resultPath + eachDir + "/"  # 分詞結果文件存入的目錄
        if not os.path.exists(each_resultPath):
            os.makedirs(each_resultPath)
        childLists = os.listdir(eachPath)  # 獲取每個文件夾中的各個文件
        for eachFile in childLists:  # 遍歷每個文件夾中的子文件
            eachPathFile = eachPath + eachFile  # 獲得每個文件路徑
            #  print(eachFile)
            content = readFile(eachPathFile)  # 調用上面函數讀取內容
            # content = str(content)
            result = (str(content)).replace("\r\n", "").strip()  # 刪除多餘空行與空格
            # result = content.replace("\r\n","").strip()

            cutResult = jieba.cut(result)  # 默認方式分詞，分詞結果用空格隔開
            saveFile(each_resultPath + eachFile, " ".join(cutResult))  # 調用上面函數保存文件


def bunchSave(inputFile, outputFile):
    catelist = os.listdir(inputFile)
    bunch = Bunch(target_name=[], label=[], filenames=[], contents=[])
    bunch.target_name.extend(catelist)  # 將類別保存到Bunch對象中
    for eachDir in catelist:
        eachPath = inputFile + eachDir + "/"
        fileList = os.listdir(eachPath)
        for eachFile in fileList:  # 二級目錄中的每個子文件
            fullName = eachPath + eachFile  # 二級目錄子文件全路徑
            bunch.label.append(eachDir)  # 當前分類標籤
            bunch.filenames.append(fullName)  # 保存當前文件的路徑
            bunch.contents.append(readFile(fullName).strip())  # 保存文件詞向量
    with open(outputFile, 'wb') as file_obj:  # 持久化必須用二進制訪問模式打開
        pickle.dump(bunch, file_obj)
        # pickle.dump(obj, file, [,protocol])函數的功能：將obj對象序列化存入已經打開的file中。
        # obj：想要序列化的obj對象。
        # file:文件名稱。
        # protocol：序列化使用的協議。如果該項省略，則默認爲0。如果爲負值或HIGHEST_PROTOCOL，則使用最高的協議版本


def readBunch(path):
    with open(path, 'rb') as file:
        bunch = pickle.load(file)
        # pickle.load(file)
        # 函數的功能：將file中的對象序列化讀出。
    return bunch


def writeBunch(path, bunchFile):
    with open(path, 'wb') as file:
        pickle.dump(bunchFile, file)


def getStopWord(inputFile):
    stopWordList = readFile(inputFile).splitlines()
    return stopWordList


def getTFIDFMat(inputPath, stopWordList, outputPath,
                tftfidfspace_path,tfidfspace_arr_path,tfidfspace_vocabulary_path):  # 求得TF-IDF向量
    bunch = readBunch(inputPath)
    tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[],
                       vocabulary={})
    '''讀取tfidfspace'''
    tfidfspace_out = str(tfidfspace)
    saveFile(tftfidfspace_path, tfidfspace_out)
    # 初始化向量空間
    vectorizer = TfidfVectorizer(stop_words=stopWordList, sublinear_tf=True, max_df=0.5)
    transformer = TfidfTransformer()  # 該類會統計每個詞語的TF-IDF權值
    # 文本轉化爲詞頻矩陣，單獨保存字典文件
    tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
    tfidfspace_arr = str(vectorizer.fit_transform(bunch.contents))
    saveFile(tfidfspace_arr_path, tfidfspace_arr)
    tfidfspace.vocabulary = vectorizer.vocabulary_  # 獲取詞彙
    tfidfspace_vocabulary = str(vectorizer.vocabulary_)
    saveFile(tfidfspace_vocabulary_path, tfidfspace_vocabulary)
    '''over'''
    writeBunch(outputPath, tfidfspace)


def getTestSpace(testSetPath, trainSpacePath, stopWordList, testSpacePath,
                 testSpace_path,testSpace_arr_path,trainbunch_vocabulary_path):
    bunch = readBunch(testSetPath)
    # 構建測試集TF-IDF向量空間
    testSpace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[],
                      vocabulary={})
    '''
       讀取testSpace
       '''
    testSpace_out = str(testSpace)
    saveFile(testSpace_path, testSpace_out)
    # 導入訓練集的詞袋
    trainbunch = readBunch(trainSpacePath)
    # 使用TfidfVectorizer初始化向量空間模型  使用訓練集詞袋向量
    vectorizer = TfidfVectorizer(stop_words=stopWordList, sublinear_tf=True, max_df=0.5,
                                 vocabulary=trainbunch.vocabulary)
    transformer = TfidfTransformer()
    testSpace.tdm = vectorizer.fit_transform(bunch.contents)
    testSpace.vocabulary = trainbunch.vocabulary
    testSpace_arr = str(testSpace.tdm)
    trainbunch_vocabulary = str(trainbunch.vocabulary)
    saveFile(testSpace_arr_path, testSpace_arr)
    saveFile(trainbunch_vocabulary_path, trainbunch_vocabulary)
    # 持久化
    writeBunch(testSpacePath, testSpace)


def bayesAlgorithm(trainPath, testPath,tfidfspace_out_arr_path,
                   tfidfspace_out_word_path,testspace_out_arr_path,
                   testspace_out_word_apth):
    trainSet = readBunch(trainPath)
    testSet = readBunch(testPath)
    clf = MultinomialNB(alpha=0.001).fit(trainSet.tdm, trainSet.label)
    # alpha:0.001 alpha 越小，迭代次數越多，精度越高
    # print(shape(trainSet.tdm))  #輸出單詞矩陣的類型
    # print(shape(testSet.tdm))
    '''處理bat文件'''
    tfidfspace_out_arr = str(trainSet.tdm)  # 處理
    tfidfspace_out_word = str(trainSet)
    saveFile(tfidfspace_out_arr_path, tfidfspace_out_arr)  # 矩陣形式的train_set.txt
    saveFile(tfidfspace_out_word_path, tfidfspace_out_word)  # 文本形式的train_set.txt

    testspace_out_arr = str(testSet)
    testspace_out_word = str(testSet.label)
    saveFile(testspace_out_arr_path, testspace_out_arr)
    saveFile(testspace_out_word_apth, testspace_out_word)

    '''處理結束'''
    predicted = clf.predict(testSet.tdm)
    total = len(predicted)
    rate = 0
    for flabel, fileName, expct_cate in zip(testSet.label, testSet.filenames, predicted):
        if flabel != expct_cate:
            rate += 1
            print(fileName, ":實際類別：", flabel, "-->預測類別：", expct_cate)
    print("erroe rate:", float(rate) * 100 / float(total), "%")



# 分詞，第一個是分詞輸入，第二個參數是結果保存的路徑

#
if __name__ == '__main__':
    datapath = "./data/"  #原始數據路徑
    stopWord_path = "./stop/stopword.txt"#停用詞路徑
    test_path = "./test/"            #測試集路徑
    '''
    以上三個文件路徑是已存在的文件路徑，下面的文件是運行代碼之後生成的文件路徑
    dat文件是爲了讀取方便做的，txt文件是爲了給大家展示做的，所以想查看分詞，詞頻矩陣
    詞向量的詳細信息請查看txt文件，dat文件是通過正常方式打不開的
    '''
    test_split_dat_path =  "./test_set.dat" #測試集分詞bat文件路徑
    testspace_dat_path ="./testspace.dat"   #測試集輸出空間矩陣dat文件
    train_dat_path = "./train_set.dat"  # 讀取分詞數據之後的詞向量並保存爲二進制文件
    tfidfspace_dat_path = "./tfidfspace.dat"  #tf-idf詞頻空間向量的dat文件
    '''
    以上四個爲dat文件路徑，是爲了存儲信息做的，不要打開
    '''
    test_split_path = './split/test_split/'   #測試集分詞路徑
    split_datapath = "./split/split_data/"  # 對原始數據分詞之後的數據路徑
    '''
    以上兩個路徑是分詞之後的文件路徑，大家可以生成之後自行打開查閱學習
    '''
    tfidfspace_path = "./tfidfspace.txt"  # 將TF-IDF詞向量保存爲txt，方便查看
    tfidfspace_arr_path = "./tfidfspace_arr.txt"  # 將TF-IDF詞頻矩陣保存爲txt，方便查看
    tfidfspace_vocabulary_path = "./tfidfspace_vocabulary.txt"  # 將分詞的詞彙統計信息保存爲txt，方便查看
    testSpace_path = "./testSpace.txt"  #測試集分詞信息
    testSpace_arr_path = "./testSpace_arr.txt"  #測試集詞頻矩陣信息
    trainbunch_vocabulary_path = "./trainbunch_vocabulary.txt" #所有分詞詞頻信息
    tfidfspace_out_arr_path = "./tfidfspace_out_arr.txt"   #tfidf輸出矩陣信息
    tfidfspace_out_word_path = "./tfidfspace_out_word.txt" #單詞形式的txt
    testspace_out_arr_path = "./testspace_out_arr.txt"     #測試集輸出矩陣信息
    testspace_out_word_apth ="./testspace_out_word.txt"    #測試界單詞信息
    '''
    以上10個文件是dat文件轉化爲txt文件，大家可以查詢信息，這是NLP（自然語言處理）非常珍貴的資源
    '''

    #輸入訓練集
    segText(datapath,#讀入數據
            split_datapath)#輸出分詞結果
    bunchSave(split_datapath,#讀入分詞結果
              train_dat_path)  # 輸出分詞向量
    stopWordList = getStopWord(stopWord_path)  # 獲取停用詞表
    getTFIDFMat(train_dat_path, #讀入分詞的詞向量
                stopWordList,    #獲取停用詞表
                tfidfspace_dat_path, #tf-idf詞頻空間向量的dat文件
                tfidfspace_path, #輸出詞頻信息txt文件
                tfidfspace_arr_path,#輸出詞頻矩陣txt文件
                tfidfspace_vocabulary_path)  #輸出單詞txt文件
    '''
    測試集的每個函數的參數信息請對照上面的各個信息，是基本相同的
    '''
    #輸入測試集
    segText(test_path,
            test_split_path)  # 對測試集讀入文件，輸出分詞結果
    bunchSave(test_split_path,
              test_split_dat_path)  #
    getTestSpace(test_split_dat_path,
                 tfidfspace_dat_path,
                 stopWordList,
                 testspace_dat_path,
                 testSpace_path,
                 testSpace_arr_path,
                 trainbunch_vocabulary_path)# 輸入分詞文件，停用詞，詞向量，輸出特徵空間(txt,dat文件都有)
    bayesAlgorithm(tfidfspace_dat_path,
                   testspace_dat_path,
                   tfidfspace_out_arr_path,
                   tfidfspace_out_word_path,
                   testspace_out_arr_path,
                   testspace_out_word_apth)

然後我們運行代碼：

代碼的輸出仍然不變，但是會生成很多文件：

split文件夾中是訓練集和測試集的文詞文件

剩下的dat文件是打不開的，但是我轉成相應的txt文件了，每個文件在上面都有註釋，大家針對自己想要的一一對應查閱，這是非常好的NLP的學習資源，我這裏隨便截取兩個

第一張圖已經是詞頻矩陣了，將tfidf的值已經計算出來了，第二個是單詞出現頻率，詳細請參考開始我放出的原博客鏈接

（如果你的打開有亂碼，請轉爲GBK，記事本自動轉換不用擔心，pycharm請手動點擊，如下圖）

最後，我想說一下，因爲很多人可能是新手或者剛入行，我這裏附上常見的一些問題，因爲我當時開始學的時候也是有個大哥在幫我。

以下爲同學們給我發的微信bug圖片：

出現這種問題是缺少包，我們可以在終端輸入

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

some-package 到時候替換爲缺少的模塊，以上圖代碼爲例，分別替換爲 jieba，numpy ，scikit-learn

然後這裏肯定有人問，終端在哪？兩個辦法進入終端：

1.window下按win + r ，輸入cmd，然後複製上面的代碼（路徑無所謂）

linux下直接輸入即可

2.pycharm下點這個

然後輸入，回車就ok了

當然還有很多其他辦法，我這裏就說兩個比較適合新手的方法

也有同學出現pycharm中缺少環境的，但是大家的疑問是我裝過python或者 anaconda了，怎麼缺少環境呢？

這裏我放出其他博客的鏈接，大家可以參考

https://blog.csdn.net/weixin_41923961/article/details/86584683

正常學習文件以及代碼下載鏈接（僅有輸入文件，運行後可生成輸出文件，推薦大家學習使用）：

鏈接：https://pan.baidu.com/s/1IW6kMev17sjyPFdizsS13g
提取碼：ap7m

最後啊，因爲有人是給學校交作業啊什麼，比較急，什麼明天不交就掛科了什麼的。。。我這裏再放一個鏈接，這是我生成好的數據文件，大家可以直接交了。。。但是我不推薦啊，畢竟我都這麼費勁寫博客教大家怎麼運行我的代碼了

急着明天交作業的同學的生成文件，代碼，以及運行截圖（無水印）下載鏈接（非常不推薦，不值得學習）：

鏈接：https://pan.baidu.com/s/1arv3b-poyMUFxz3dcaSm5g
提取碼：ofa2

由於提問評論人太多，這裏我留下個人微信：wy1119744330 添加好友請備註：CSDN博客

你們的問題我都會盡量滿足，謝謝大家

最後再附上原博客鏈接：https://blog.csdn.net/qq_28626909/article/details/80382029

（已修改）機器學習之文本分類（附帶訓練集+數據集+所有代碼）

用python實現對數組排序，輸出座標

牛客網：六一兒童節，老師帶了很多好喫的巧克力到幼兒園。每塊巧克力j的重量爲w[j]，對於每個小朋友i，當他分到的巧克力大小達到h[i] (即w[j]>=h[i])，

使用神經網絡（ICNet）對航拍圖片(遙感圖像)進行圖像語義分割（數據集+代碼+最終訓練模型）

Linux下docker容器的打包（導出）

分佈式深度學習環境配置，NVIDIA驅動+cuda+cudnn+docker

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結