Mahout Naive Bayes中文新聞分類示例

原創

2020-02-22 19:05

轉載原文：http://www.cnblogs.com/panweishadow/p/4320720.html

一、簡介

關於Mahout的介紹，請看這裏：http://mahout.apache.org/

關於Naive Bayes的資料，請戳這裏：

Mahout實現了Naive Bayes分類算法，這裏我用它來進行中文的新聞文本分類。

官方有一組分類例子，使用20 newsgroups data (http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) 總大小約爲85MB。

對於中文文本，相比英文文本，只多一步切詞的步驟，使用搜狗實驗室的語料庫，總大小約爲300M。請戳這裏：http://www.sogou.com/labs/resources.html?v=1

二、詳細步驟

1.寫切詞小程序，工具包爲IK，用空格分開，將所有新聞集中到一個文本中，一行代表一篇新聞~

2.上傳數據到hdfs，數據量大小，親測數小時~~~

user@hadoop:~/workspace$hadoop dfs -cp /share/data/Mahout_examples_Data_Set/20news-all .

3.從20newsgroups data創建序列文件(sequence files)

user@hadoop:~/workspace$mahout seqdirectory -i 20news-all -o 20news-seq

4.將序列文件轉化爲向量

user@hadoop:~/workspace$mahout seq2sparse -i ./20news-seq -o ./20news-vectors -lnorm -nv -wt tfidf

5.將向量數據集分爲訓練數據和檢測數據，以隨機40-60拆分

user@hadoop:~/workspace$mahout split -i ./20news-vectors/tfidf-vectors --trainingOutput ./20news-train-vectors --testOutput ./20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

6.訓練樸素貝葉斯模型

user@hadoop:~/workspace$mahout trainnb -i ./20news-train-vectors -el -o ./model -li ./labelindex -ow -c

7.檢驗樸素貝葉斯模型

user@hadoop:~/workspace$mahout testnb -i ./20news-train-vectors -m ./model -l ./labelindex -ow -o 20news-testing –c

8.檢測模型分類效果

user@hadoop:~/workspace$mahout testnb -i ./20news-test-vectors -m ./model -l ./labelindex -ow -o ./20news-testing -c

參考資料：http://openresearch.baidu.com/activitybulletin/448.jhtml;jsessionid=28BD4187550DCA6F8AD6FEA4DCCA2480

--__2__--

發佈了1 篇原創文章 · 獲贊 5 · 訪問量 14萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Mahout Naive Bayes中文新聞分類示例

一、簡介

二、詳細步驟

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

h30 HTML Layout Elements

瞭解顯卡

Shell/Python中的用戶名獲取

IKAnalyzer詞典佔用內存大小分析

基於Mahout的電影推薦系統

Lucene建立索引使用IKAnalyzer擴展詞庫

文本相似度算法_基礎

JUnit4：多組數據的單元測試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結