mahout Newsgroups 貝葉斯分類實例

1、首先下載newsgroups數據集

數據集網址爲http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz，將數據集解壓，會得到兩個文件夾20news-bydate-test和20news-bydate-train，將兩個文件夾合併存入20news-all文件夾

2、將數據集轉化爲sequencefile方便mahout操作。

本地運行需要配置本地MAHOUT

 export MAHOUT_HOME=MAHOUT_DIR                          //MAHOUT_DIR爲本地MAHOUT目錄

export MAHOUT_LOCAL=$MAHOUT_HOME

此過程需要在本地完成，因爲hadoop擅長處理大文件，而此過程需要將很多小文件進行序列化形成一個大的序列化文件。

$ mahout seqdirectory 
        -i ${WORK_DIR}/20news-all      //input FileDirectory
        -o 20news-seq                  //output FileDirectory
        -ow                            //overwrite

運行結束後將本地MAHOUT_LOCAL配置去掉

  export -n MAHOUT_LOCAL                 //刪除MAHOUT_LOCAL本地設置

3、將20news-seq上傳至hdfs

hadoop fs -put ${WORK_DIR}/20news-seq ${HDFS_DIR}/

4、轉換和預處理數據集變成 <Text,VectorWritable>格式

$ mahout seq2sparse 
        -i ${HDFS_DIR}/20news-seq    
        -o ${HDFS_DIR}/20news-vectors
        -lnorm             //(Optional) Whether output vectors should be logNormalize. If set true else false
        -nv                //(Optional) Whether output vectors should be NamedVectors. If set true else false
        -wt tfidf          //The kind of weight to use. Currently TF or TFIDF. Default: TFIDF

5、將數據集進行分片，分爲訓練集和測試集。

 $ mahout split 
        -i ${HDFS_DIR}/20news-vectors/tfidf-vectors 
        --trainingOutput ${HDFS_DIR}/20news-train-vectors  //The training data output directory
        --testOutput ${HDFS_DIR}/20news-test-vectors       //The test data output directory
        --randomSelectionPct 20                         //Percentage of items to be randomly selected as test data when using
        --overwrite

        --sequenceFiles            //Set if the input files are sequence files.  Default is false

        -xm sequential             //串行執行

6、訓練貝葉斯分類模型

 $ mahout trainnb 
        -i ${WORK_DIR}/20news-train-vectors   
        -el                                   //Extract the labels from the input
        -o ${WORK_DIR}/model                  
        -li ${WORK_DIR}/labelindex            //The path to store the label index in
        -ow                                   //overwrite
        -c                                    //train complementary?

7、測試貝葉斯分類模型

$ mahout testnb 
        -i ${WORK_DIR}/20news-test-vectors
        -m ${WORK_DIR}/model                //The path to the model built during training
        -l ${WORK_DIR}/labelindex           //The path to the location of the label index
        -ow 
        -o ${WORK_DIR}/20news-testing       //test complementary?
        -c

8、運算結果

=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 2107 91.2121%
Incorrectly Classified Instances : 203 8.7879%
Total Classified Instances : 2310

=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 4 0 | 97 a = alt.atheism
0 104 0 2 1 3 3 0 0 0 0 0 1 0 1 0 0 0 0 0 | 115 b = comp.graphics
0 14 84 14 4 8 2 0 0 0 0 0 1 0 0 0 0 0 1 1 | 129 c = comp.os.ms-windows.misc
0 2 2 121 4 0 0 1 0 0 0 0 3 0 0 0 0 0 0 0 | 133 d = comp.sys.ibm.pc.hardware
0 4 0 2 122 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 | 133 e = comp.sys.mac.hardware
0 6 1 3 2 109 1 0 0 0 0 0 0 0 1 0 0 0 0 0 | 123 f = comp.windows.x
0 1 0 4 1 1 93 4 2 0 0 1 3 0 0 0 0 0 0 0 | 110 g = misc.forsale
0 0 0 0 0 0 1 101 5 0 0 0 3 0 1 0 0 0 0 0 | 111 h = rec.autos
0 0 0 0 1 0 0 3 132 0 0 0 0 0 0 0 0 0 0 0 | 136 i = rec.motorcycles
0 0 0 1 0 0 0 0 1 119 1 0 0 0 0 0 0 0 0 0 | 122 j = rec.sport.baseball
0 0 0 0 0 0 1 1 0 0 125 0 0 0 0 0 0 0 0 0 | 127 k = rec.sport.hockey
0 2 0 0 0 2 0 0 0 0 0 117 0 0 1 0 0 0 0 1 | 123 l = sci.crypt
0 2 0 1 3 0 2 1 0 0 0 0 112 1 0 0 0 0 0 0 | 122 m = sci.electronics
0 0 0 1 0 0 1 0 1 0 0 1 2 100 2 0 0 0 0 2 | 110 n = sci.med
0 1 0 0 0 0 0 0 0 0 0 0 0 0 112 0 0 0 0 0 | 113 o = sci.space
1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 96 0 0 2 1 | 104 p = soc.religion.christian
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 110 0 0 2 | 115 q = talk.politics.mideast
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 111 0 0 | 113 r = talk.politics.guns
9 0 0 0 0 0 0 0 1 1 1 0 0 0 0 4 1 0 66 0 | 83 s = talk.religion.misc
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 3 2 3 82 | 91 t = talk.politics.misc

=======================================================
Statistics
-------------------------------------------------------
Kappa 0.8648
Accuracy 91.2121%
Reliability 86.7628%
Reliability (standard deviation) 0.2128

mahout Newsgroups 貝葉斯分類實例

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

eclipse下mahout0.9開發實戰（不使用hadoop eclipse plugins）

mahout Newsgroups 貝葉斯分類實例

hadoop streaming 編程參數設置

機器學習算法優缺點及其應用領域

Mahout基於hadoop實現itembased協同過濾流程解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結