新聞分類系統的實現

1 系統開發工具和平臺

本文選擇Python作爲主要開發語言，作爲一個簡潔而又強大的腳步語言，Python整合了大量的第三方數據分析，算法處理框架，爲開發帶來極大的便利。

系統完整開發工具如圖所示:

圖1 分類系統開發工具彙總

在數據庫方面，選擇Mongodb來存儲爬取到的新聞信息。Mongodb作爲一個非關係型數據庫，只需將爬取到的新聞信息轉化爲鍵值對的類型便能完成存儲。在服務器構建上使用sqlite3作爲存儲數據庫，用於展示網頁上的新聞排序信息，sqlite3是python原生自帶的數據庫，使用sqlite3不用配置數據庫環境，只用將服務器框架與其鏈接，便能在本地配置一個簡易的數據庫。

在對新聞文本數據預處理上，使用python的Pandas讀取Mongodb數據庫中存儲信息，Pandas獨特的DataFrame數據結構，會調用Numpy和Scipy封裝的數據處理方法，將對數據表的處理轉化成類似數學上線性代數的矩陣運算，只用告知程序行列屬性，便能直接運算。相較C與Java對行列的複雜循環，Python的數據分析框架具有較高的代碼可讀性。但這會加大一定的運算時長作爲代價。對處理好的數據，使用Matpoltlib來完成繪圖。

與之同理，在處理中文分詞，和擬合算法模型上，使用Python的第三方庫jieba來作爲分詞工具，通過調用Scikit-learn封裝的機器學習算法來完成分類器模塊設計，而深度學習算法通過調用Keras封裝的算法模型來完成（Keras的後端爲TensorFlow）。在特徵提取上Word2vec算法會通過調用Gensim來實現。

在最後的服務器實現上，使用Django搭建了一個新聞發佈網站，前端使用Ajax請求來完成前後端交互，接收到新聞信息後，會通過Scikit-learn讀取本地保存的分類器PKL文件，來完成算法匹配。

運行電腦配置如下：

·CPU：(英特爾)Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz(3601 MHz)

·內存：8.00 GB ( 2400 MHz)

·顯卡：Intel(R) HD Graphics 630 (1024 MB) （本項目使用的是cpu加速）

2 爬蟲模塊功能實現

如圖所示，實現爬蟲模塊主要以下5個步驟：

圖1爬蟲模塊整體流程

Step1：使用python自帶的urllib庫，對新浪新聞發送http請求，得到API的內的數據，實則是一串json格式封裝的新聞數據，包含新聞標題，新聞發佈時間，新聞鏈接，新聞評論人數，新聞來源等信息。

Step2：使用python自帶的json解析庫，解析json數據，得到需要的新聞標題，發佈時間，評論人數，新聞鏈接等信息。

Step3：異步加載通過訪問之前解析出來的新聞鏈接，爬取相應的新聞內容。

Step4：將所有數據存入mongodb數據庫中。爲之後的機器學習建模提供數據集。

Step5：使用多進程庫multiprocessing開啓進程池，使用多線程庫hreading開啓多線程。循環翻頁抓取全部所需信息。

由於API爬蟲的高效性，爬蟲程序直接訪問網頁端存儲的新聞信息Json數據，再配合多線程與多進程技術，爬蟲模塊半小時可以實現上百萬的新聞數據爬取。將爬取信息從數據庫導出結果如圖2所示：

圖2爬蟲導出結果

預處理模塊功能實現

通過pandas將數據庫的新聞信息讀入內存，數據格式爲dataframe，做數據預處理工作。爬來的新聞數據中，部分新聞內容信息是缺失的，我們將其從102萬條新聞數據中去除。最後可以使用的有48萬條新聞數據。

對這48萬條數據進行分組，總共分爲15個類的新聞數據。對不同新聞類別的統計如表所示：

可以發現lable 14 和lable 15的只具有幾百條，lable8和lable 11一個也沒有，數據分佈也十分不均勻。綜合考慮下，最後選擇剩下的11個lable，每個lable隨機抽取兩千條新聞信息。

每個類別分別代表：

汽車	財經	IT	健康	體育	旅遊	教育	軍事	文化	娛樂	時尚
1	2	3	4	5	6	7	8	9	10	11

數據清洗好後，對每條新聞內容的長度進行統計發現新聞句子的長度基本在0至100內分佈。於是我們選擇100作爲原始的新聞數據截取長度。

（x：句子長度，y：新聞數量）

捨去長度高於100的那部分內容，只用前100字作爲訓練樣本，讓數據分別更均勻，同時也減少了特徵的維度。接下來，我們使用jieba分詞工具來處理這些新聞信息。

在使用jieba前，使用正則表達式將原文本信息轉換爲只有中文組成的句子，去除標點符號或者分割符對詞頻統計的影響。

Jieba分詞後效果如下：

Lable	分詞示例
6	幸福時刻全家福旅遊照片年月日陝西省翠花山全家福旅遊照片年月日 .....
1	佛山市佛陳大橋下週三起實施全封閉維修廣佛都市網訊佛山日報已經 ......
5	圖文長沙站預選賽頒獎儀式鯊魚球杆樑寶忠月日英倫汽車喬氏杯決戰亨德利 ......
11	銀曼專注生活細節探尋美麗真諦愛美的你是否曾希望每天都與大自然保持 .......
....	........
6	社區小時熱貼推薦歡迎來到河南心情時時舒展歡迎加入河南版友羣

分詞後需要去除停用詞，如圖展示的是對未去除停用詞的詞頻統計（TOP10）：

可以發現在沒有去停用詞操作前，文檔中出現大量‘的’，‘在’，‘是’這類詞，但是其對分類貢獻率極低。通過去除停用詞我們可以得到下面結果：

4.5 分類器模塊功能實現

1 CNN

深度學習的CNN算法項目中使用到了python的第三方庫keras，keras爲使用TensorFlow提供的一定的便捷化的接口。它一定程度上降低了我們學習的難度，項目中使用keras可以便捷的構架神經網絡，而不用耗費大量的時間去學習TensorFlow的編碼解碼及構建特徵的編程方法。

在配置參數上我們選擇

100作爲每條新聞的最大長度，

單詞的向量空間維度爲200,

20%的數據作爲測試集

以及16%的數據作爲驗證集。

a. 不使用word2vec算法訓練cnn模型步驟：

·使用Tokenizer對所有文本數據做特徵提取。將新聞的文本數據轉化成由單詞的索引對應的序列。

·按配置參數的比例分割訓練集，驗證集，測試集。

·通過embedding技術對新聞特徵序列降維，生成100*200的二維向量矩陣。

· 設置1 層卷積層與池化層減少向量的長度，，通過一層 Flatten 層將 2 維向量壓縮到 1 維，最後由兩層Dense將向量長度收縮到 11 上，對應新聞數據集的 11 個類類別。

實驗結果如下：

訓練集準確率：0.8652

準確率爲：0.81450513

耗時：96s

對11個新聞類別的分類通過簡單的搭建一個神經網絡達到81%，但是從測試集準確度88%來看，存在一定的過擬合現象。總體來說在訓練的時間偏長，效率較低。

b. 使用word2vec算法的CNN模型步驟：

· 使用 word2vec 模型替代embedding層的1312 萬個參數。替換後embedding矩陣爲65604 x 200。65604表示65604個單詞。

· 其餘步驟如上所示。

實驗結果：

訓練集準確度：0.8629

測試集準確度：0.8262257

耗時：77s

模型的shape與之前一樣，過擬合現象減輕，準確率由原來的81%提升到了85.3%，這說明具備語義推理能力的word2vec可以一定程度上提高模型的準確率和運算性能。

2 LSTM

LSTM是深度學習算法中相對比較適合文本分類的一個模型，這裏以同樣的方法通過keras搭建LSTM網絡。

a. 不使用word2vec算法訓練LSTM模型

使用LSTM構架神經網絡的步驟和參數與CNN的相同，這裏不再做詳細說明。

在訓練集的準確率：0.7899

在測試集的準確率：0.7539733

耗時：161.9S

由於訓練的數據量偏小，LSTM並沒有發揮出其在自然語言處理上的優勢，另外使用LSTM訓練模型的時間爲CNN的2倍，略爲低效。

b. 使用word2vec算法的LSTM模型:

在訓練集的準確率：0.8746

在測試集的準確率：0.821892816

耗時：162S

間接說明新聞文檔的數量對LSTM的影響，使用word2vec產生的大量參數提升了語料庫的容量。使得準確率有所提升。

3.樸素貝葉斯

貝葉斯算法我們使用python的機器學習庫scikit-learn來完成，傳統的機器學習模型，我們不用將詞向量模型構建成二維的矩陣，來以分析圖像的思維來訓練文本，處理後的原數據矩陣爲65604*1。

a. 使用TF-IDF算法建立貝葉斯模型：

使用TF-IDF來做特徵提取，這裏我們使用CountVectorizer來建立特徵語料庫，語料庫的數量爲文檔出現所有詞的集合。

將測試集與訓練集分佈與之擬合：

因爲提前分割了數據集，所以在使用TF-IDF做特徵提取的時候需要主要詞對應語料庫的位置。應當使用總得語料庫來擬合訓練集與測試。如果分開擬合將導致訓練集與測試集相同索引對應不同單詞，從而造成較大的誤差。最後使用樸素貝葉斯算法建模，用測試集來驗證貝葉斯模型的準確率。

簡單的貝葉斯模型，準確率卻略高於LSTM，而且在運行時間上，貝葉斯模型的運行時間不到2S，是LSTM的90分之一。

b. 使用word2vec訓練貝葉斯模型：

配置參數爲，100的新聞文章最大長度，N爲4的字流窗大小，使用多核cpu加速，使用skip-gram做特徵提取，迭代次數爲10次。

Word2vel處理後的文本矩陣：

每個字對應了大量的權重，且由於word2vec賦予產生的矩陣一種連續性，使用樸素貝葉斯將不能處理這些連續得向量矩陣，這裏我們使用高斯貝葉斯，假設矩陣滿足正態分佈。

訓練結果如下：

訓練正確率出其的低，且耗時爲樸素貝葉斯模型的22倍。從直觀的想法中，樸素貝葉斯正確率偏高，而使用word2vec將貝葉斯的單詞獨立性的假設得到了補充，按理應該正確率得到提升。下一章實驗結果展示將會討論這個問題。

4. Svm

Svm處理思路與樸素貝葉斯相同，這裏指明一下，項目中SVM使用線性核。這是對比過高斯與多項式得出的。

a. 使用TF-IDF算法建立SVM模型：

由於特徵構建原理相同，這裏不做解釋，詳細可以參考上面貝葉斯的文檔。

正確率：84.4% 雖不及貝葉斯，但是整體效果還是不錯的。

b. 使用word2vec算法建立SVM模型：

正確率只有77.08%，但是相較word2vec在高斯貝葉斯模型模型的表現已經很好了。相同參數下，運行時間上達到了134.9s，效率很低。

4.4系統界面的實現

系統界面使用python的Django服務器框架開發，使用python原生的sqlite3作爲數據庫支持。前端使用JavaScript，CSS建立了一個較爲簡約的UI界面。主頁建立了一個發佈按鈕。

點擊發布按鈕（1），產生如圖一個彈框：

彈框中，添加鏈接這欄可以添加一個URL（2），因爲本系統是針對新浪新聞而開發的，使用目前僅支持對新浪新聞相應鏈接的提取。添加鏈接後，點擊獲取標題（3），可以自動爬取對應的新聞標題和對應的摘要（4），點擊發布按鈕（5）。系統後臺接收新聞標題與對應的摘要，經過分詞後，通過與TF-IDF算法特徵提取保存的單詞及對應TF-IDF值矩陣匹配，轉化成向量矩陣。這個向量矩陣，再與我們離線訓練出來的貝葉斯模型，通過貝葉斯算法計算其所屬類別的概率，達到預測的目的。

觀察上圖可以發現，貝葉斯算法具有較高的準確率，可以做到良好的分類效果。

第五章實驗結果分析

5.1 系統評估指標(ROC，AUC，訓練所需時間)

a. ROC，AUC

在介紹系統評估指標前，我們先了解4個概念：

·True Positive（TP）：意思是對於某一個類別的新聞信息，算法對新聞信息做出預測，且預測類別與此類別相同，TP的值表示預測該類別相同的個數。

·False Positive（FP）：數值表示預測某一類別預測類別與真實類別不同的個數；

·True Negative（TN）：數值表示預測某一類別預測爲此類別，但是真實值非此類別的個數；

·False Negative（FN）：數值表示預測某一類別預測爲非此類別，且真實值也非此類別的個數；

基於此，我們就可以計算出準確率（precision rate）、召回率（recall rate）。

		預測類別
真實結果		財經	非財經
	財經	170（TP）	300（FN）
	非財經	30（FP）	1700（TN）

以上表爲例，TP的值爲200，FN的值爲30，FP的值爲300，TN的值爲2000。

那麼, 準確率=170/(170+30) = 85% ,召回率=170/(170+300)= 36.17%。

ROC曲線就是準確率隨召回率的變化情況。ROC曲線越接近左上角,分類效果越好。AUC曲線表示ROC曲線下的面積,AUC面積越大,分類效果越好

5.2 算法擬合數據集說明

爲了讓算法能夠均勻計算到每個類別，項目數據集上選擇從每個類別中隨機抽取2000條新聞數據進行算法擬合。其中文化類別只爬取到1924條，所以使用對文化類別選取全部的1924條作爲樣本。

類別字典映射表如下：

汽車	財經	IT	健康	體育	旅遊	教育	軍事	文化	娛樂	時尚
1	2	3	4	5	6	7	8	9	10	11

在深度學習算法CNN,LSTM中選擇20%的數據作爲測試集以及16%的數據作爲驗證集。

在傳統機器學習算法中，我們選擇80%的數據作爲訓練集，20%的數據作爲驗證集。

5.3 分類系統算法評估

算法	準確度	訓練所需時間
樸素貝葉斯+TF-IDF	0.85430157	＜2s
SVM+TF-IDF	0.84435707	45s
CNN+word2vec	0.8262257	77s
LSTM+word2vec	0.821892816	172s
CNN+Tokenizer	0.8150513	96s
SVM+word2vec	0.77081406	23s
LSTM+Tokenizer	0.7539733	161.9s
樸素貝葉斯+word2vec	0.5790934	13s

樸素貝葉斯算法+TF-IDF ROC曲線：

SVM算法+TF-IDF ROC曲線：

CNN算法word2vec ROC曲線：

LSTM算法word2vec ROC曲線：

通過對比訓練所需時間，樸素貝葉斯算法配合TF-IDF只需要2S就能完成對1萬4千多條信息的數學建模。且對比不同算法模型的ROC圖，貝葉斯算法綜合下來ROC曲線最接近左上角,分類效果最好。ROC曲線下的面積對比中,樸素貝葉斯AUC面積最大大,分類效果最好。

由此判斷，樸素貝葉斯配合TF-IDF是最爲適合作爲本項目新聞分類器的算法模型。其算法成功率爲0.85430157，也支持對大量新聞數據的分類預測。

源碼：

深度學習使用到的語料庫爲維基百科訓練出來的語料庫。

CNN：

#coding:utf-8
import sys
import keras
import matplotlib.pyplot as plt




VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2




print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels






print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np


tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)




print ('(3) split data set...')
# split the data into training set, validation set, and test set
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))




print ('(5) training model...')
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding, GlobalMaxPooling1D
from keras.models import Sequential




model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(Dropout(0.2))
model.add(Conv1D(250, 3, padding='valid', activation='relu', strides=1))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
#plot_model(model, to_file='model.png',show_shapes=True)


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print (model.metrics_names)
model.fit(x_train, y_train, callbacks=[history],validation_data=(x_val, y_val), epochs=2, batch_size=128)
#model.save('cnn.h5')


print ('(6) testing model...')
print (model.evaluate(x_test, y_test))






import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

CNN+word2vec

#coding:utf-8
import sys
import keras




VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 128
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2




print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels




print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np


tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)




print ('(3) split data set...')
# split the data into training set, validation set, and test set
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))




print ('(4) load word2vec as embedding...')


import gensim
from keras.utils import plot_model
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(VECTOR_DIR, binary=True)
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
not_in_model = 0
in_model = 0
for word, i in word_index.items(): 
    if word in w2v_model:
        in_model += 1
        embedding_matrix[i] = np.asarray(w2v_model[word], dtype='float32')
    else:
        not_in_model += 1
print (str(not_in_model)+' words not in w2v model')
from keras.layers import Embedding
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)




print ('(5) training model...')
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding, GlobalMaxPooling1D
from keras.models import Sequential


model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(250, 3, padding='valid', activation='relu', strides=1))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
#plot_model(model, to_file='model.png',show_shapes=True)


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print( model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=2, batch_size=128)
model.save('word_vector_cnn.h5')


print ('(6) testing model...')
print (model.evaluate(x_test, y_test))


        


import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

LSTM

#coding:utf-8


import keras
import matplotlib.pyplot as plt
VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2


print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels




print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np


tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)




print ('(3) split data set...')
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))


from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential


model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])


history = model.fit(x_train, y_train,validation_data=(x_val, y_val), epochs=2, batch_size=128)
#model.save('lstm.h5')


print (model.evaluate(x_test, y_test))




import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

LSTM+word2vec

#coding:utf-8
import sys
import keras




VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 128
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2




print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels




print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np


tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)




print ('(3) split data set...')
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))




print ('(4) load word2vec as embedding...')
import gensim
from keras.utils import plot_model
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(VECTOR_DIR, binary=True)
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
not_in_model = 0
in_model = 0
for word, i in word_index.items(): 
    if word in w2v_model:
        in_model += 1
        embedding_matrix[i] = np.asarray(w2v_model[word], dtype='float32')
    else:
        not_in_model += 1
print (str(not_in_model)+' words not in w2v model')
from keras.layers import Embedding
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)




print ('(5) training model...')
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential


model = Sequential()
model.add(embedding_layer)
model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
#plot_model(model, to_file='model.png',show_shapes=True)


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print (model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
#model.save('word_vector_lstm.h5')


print ('(6) testing model...')
print (model.evaluate(x_test, y_test))


        
#畫圖
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

樸素貝葉斯

#coding:utf-8


from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer   
from sklearn.naive_bayes import MultinomialNB  
from sklearn import metrics


train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_text = train_texts + test_texts




count_v0= CountVectorizer();  
counts_all = count_v0.fit_transform(all_text);
count_v1= CountVectorizer(vocabulary=count_v0.vocabulary_);  
counts_train = count_v1.fit_transform(train_texts);   
print ("the shape of train is "+repr(counts_train.shape) ) 
count_v2 = CountVectorizer(vocabulary=count_v0.vocabulary_);  
counts_test = count_v2.fit_transform(test_texts);  
print ("the shape of test is "+repr(counts_test.shape) ) 
 


tfidftransformer = TfidfTransformer();    
train_data = tfidftransformer.fit(counts_train).transform(counts_train);
test_data = tfidftransformer.fit(counts_test).transform(counts_test); 


x_train = train_data
y_train = train_labels
x_test = test_data
y_test = test_labels




clf = MultinomialNB(alpha = 0.01)   
clf.fit(x_train, y_train);  
preds = clf.predict(x_test);
num = 0
preds = preds.tolist()
for i,pred in enumerate(preds):
    if int(pred) == int(y_test[i]):
        num += 1
print ('precision_score:' + str(float(num) / len(preds)))




import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = clf.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

高斯貝葉斯+word2vec

#coding:utf-8
from sklearn.naive_bayes import MultinomialNB 
from sklearn.preprocessing import scale
from sklearn.naive_bayes import GaussianNB




VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 128
TEST_SPLIT = 0.2




print ('(1) load texts...')
train_docs = open('train_contents.txt',encoding = 'utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding = 'utf-8').read().split('\n')
test_docs = open('test_contents.txt',encoding = 'utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding = 'utf-8').read().split('\n')


print ('(2) doc to var...')
import gensim
import numpy as np
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(VECTOR_DIR, binary=True)


def buildWordVector(text, size):
    '''
        利用函數獲得每個文本中所有詞向量的平均值來表徵該特徵向量。
    '''
    vec = np.zeros(128).reshape((1, size))
    count = 0
    for word in text:
        try:
            vec += w2v_model[word].reshape((1, 128))
            count += 1
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec
    
'''獲取需要所有文檔的詞向量，並且標準化出來'''
x_train1 = np.concatenate([buildWordVector(x, 128) for x in train_docs])
print ("the shape of train is "+repr(x_train1.shape) ) 
x_train = scale(x_train1)
x_test1 = np.concatenate([buildWordVector(x, 128) for x in test_docs])
print ("the shape of train is "+repr(x_test1.shape) ) 
x_test = scale(x_test1)
y_train = train_labels
y_test = test_labels


clf = GaussianNB()  
clf.fit(x_train, y_train);  
preds = clf.predict(x_test);
num = 0
preds = preds.tolist()
for i,pred in enumerate(preds):
    if int(pred) == int(y_test[i]):
        num += 1
print ('precision_score:' + str(float(num) / len(preds)))

SVM

#coding:utf-8
import sys
VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
TEST_SPLIT = 0.2




print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_text = train_texts + test_texts


print ('(2) doc to var...')
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer   
count_v0= CountVectorizer();  
counts_all = count_v0.fit_transform(all_text);
count_v1= CountVectorizer(vocabulary=count_v0.vocabulary_);  
counts_train = count_v1.fit_transform(train_texts);   
print ("the shape of train is "+repr(counts_train.shape) ) 
count_v2 = CountVectorizer(vocabulary=count_v0.vocabulary_);  
counts_test = count_v2.fit_transform(test_texts);  
print ("the shape of test is "+repr(counts_test.shape) ) 
  
tfidftransformer = TfidfTransformer();    
train_data = tfidftransformer.fit(counts_train).transform(counts_train);
test_data = tfidftransformer.fit(counts_test).transform(counts_test); 


x_train = train_data
y_train = train_labels
x_test = test_data
y_test = test_labels


print ('(3) SVM...')
from sklearn.svm import SVC   
svclf = SVC(kernel = 'linear') 
svclf.fit(x_train,y_train)  
preds = svclf.predict(x_test);  
num = 0
preds = preds.tolist()
for i,pred in enumerate(preds):
    if int(pred) == int(y_test[i]):
        num += 1
print ('precision_score:' + str(float(num) / len(preds)))








import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = clf.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

SVM+word2vec

#coding:utf-8


from sklearn.preprocessing import scale






VECTOR_DIR = 'vectors.bin'


MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
TEST_SPLIT = 0.2




print ('(1) load texts...')
train_docs = open('train_contents.txt',encoding = 'utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding = 'utf-8').read().split('\n')
test_docs = open('test_contents.txt',encoding = 'utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding = 'utf-8').read().split('\n')


print ('(2) doc to var...')
import gensim
import numpy as np
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(VECTOR_DIR, binary=True)


def buildWordVector(text, size):
    '''
        利用函數獲得每個文本中所有詞向量的平均值來表徵該特徵向量。
    '''
    vec = np.zeros(128).reshape((1, size))
    count = 0
    for word in text:
        try:
            vec += w2v_model[word].reshape((1, 128))
            count += 1
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec
    


'''獲取需要所有文檔的詞向量，並且標準化出來'''
x_train1 = np.concatenate([buildWordVector(x, 128) for x in train_docs])
x_train = scale(x_train1)
x_test1 = np.concatenate([buildWordVector(x, 128) for x in test_docs])
x_test = scale(x_test1)
y_train = train_labels
y_test = test_labels


print ('(3) SVM...')
from sklearn.svm import SVC   
svclf = SVC(kernel = 'linear') 
svclf.fit(x_train,y_train)  
preds = svclf.predict(x_test);  
num = 0
preds = preds.tolist()
for i,pred in enumerate(preds):
    if int(pred) == int(y_test[i]):
        num += 1
print ('precision_score:' + str(float(num) / len(preds)))










        


import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp


y_score  = clf.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])


# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])




# Compute macro-average ROC curve and ROC area


# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))


# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])


# Finally average it and compute AUC
mean_tpr /= n_classes


fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])






# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)


plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)


colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))


plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

【實戰】TF-IDF,WORD2VEC,機器學習算法，深度學習算法在新浪新聞分類表現。

新聞分類系統的實現

1 系統開發工具和平臺

2 爬蟲模塊功能實現

預處理模塊功能實現

4.5 分類器模塊功能實現

1 CNN

a. 不使用word2vec算法訓練cnn模型步驟：

b. 使用word2vec算法的CNN模型步驟：

2 LSTM

a. 不使用word2vec算法訓練LSTM模型

b. 使用word2vec算法的LSTM模型:

3.樸素貝葉斯

a. 使用TF-IDF算法建立貝葉斯模型：

b. 使用word2vec訓練貝葉斯模型：

4. Svm

a. 使用TF-IDF算法建立SVM模型：

b. 使用word2vec算法建立SVM模型：

4.4系統界面的實現

第五章實驗結果分析

5.1 系統評估指標(ROC，AUC，訓練所需時間)

5.2 算法擬合數據集說明

5.3 分類系統算法評估

源碼：

CNN：

CNN+word2vec

LSTM

LSTM+word2vec

樸素貝葉斯

高斯貝葉斯+word2vec

SVM

SVM+word2vec

【支持向量機SVM】算法原理公式推導 python編程實現

【TextRank】關鍵詞提取算法原理公式推導源碼分析

【邏輯迴歸LR】算法原理公式推導 python編程實現

【決策樹DT】算法原理公式推導 python編程實現

【word2vec】算法原理公式推導

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

【實戰】TF-IDF,WORD2VEC,機器學習算法，深度學習算法在新浪新聞分類表現。

新聞分類系統的實現

1 系統開發工具和平臺

2 爬蟲模塊功能實現

預處理模塊功能實現

4.5 分類器模塊功能實現

1 CNN

a. 不使用word2vec算法訓練cnn模型步驟：

b. 使用word2vec算法的CNN模型步驟：

2 LSTM

a. 不使用word2vec算法訓練LSTM模型

b. 使用word2vec算法的LSTM模型:

3.樸素貝葉斯

a. 使用TF-IDF算法建立貝葉斯模型：

b. 使用word2vec訓練貝葉斯模型：

4. Svm

a. 使用TF-IDF算法建立SVM模型：

b. 使用word2vec算法建立SVM模型：

4.4系統界面的實現

第五章 實驗結果分析

5.1 系統評估指標(ROC，AUC，訓練所需時間)

5.2 算法擬合數據集說明

5.3 分類系統算法評估

源碼：

CNN：

CNN+word2vec

LSTM

LSTM+word2vec

樸素貝葉斯

高斯貝葉斯+word2vec

SVM

SVM+word2vec

第五章實驗結果分析