mahout -傳統樸素貝葉斯分類

Naive Bayes

樸素貝葉斯

Naive Bayes is an algorithm that can be used to classify objects into usually binary categories. It is one of the most common learning algorithms in spam filters. Despite its simplicity and rather naive assumptions it has proven to work surprisingly well in practice.

Before applying the algorithm, the objects to be classified need to be represented by numerical features. In the case of e-mail spam each feature might indicate whether some specific word is present or absent in the mail to classify. The algorithm comes in two phases: Learning and application.
During learning, a set of feature vectors is given to the algorithm, each vector labeled with the class the object it represents, belongs to. From that it is deduced which combination of features appears with high probability in spam messages. Given this information, during application one can easily compute the probability of a new message being either spam or not.

The algorithm does make several assumptions, that are not true for most datasets, but make computations easier. The worst probably being, that all features of an objects are considered independent. In practice, that means, given the phrase "Statue of Liberty" was already found in a text, does not influence the probability of seeing the phrase "New York" as well.

樸素貝葉斯算法,可使用對象進行分類,通常是二進制類。這是垃圾郵件過濾器中一種最常見的學習算法 。儘管它的簡單而原始的假設,它在實踐中已被證明是出人意料地好。


在應用算法之前,需要以被分類的對象所表示的數值的功能。在過濾垃圾郵件的情況下,每個功能可能會顯示一些特定的單詞是否存在或不存在的郵件進行分類。算法分爲兩個階段:學習和應用。
在學習過程中的 算法中給定的特徵矢量,每個矢量標記爲一個分類。從它推導出的功能組合出現在垃圾郵件中的概率高。有了這個信息,在使用過程中,可以很容易地計算概率的一個新的消息是垃圾郵件或不。


該算法做了幾個假設,那不是真正的大多數數據集,但使計算更容易。最壞的可能是,所有的功能被認爲是獨立的對象。在實踐中,這意味着,給定的短語“自由女神像”,已經發現在文本中,看到那句“紐約”,以及不影響概率。

Strategy for a parallel Naive Bayes

一個平行的樸素貝葉斯戰略

See https://issues.apache.org/jira/browse/MAHOUT-9.

Examples

20Newsgroups - Example code showing how to train and use the Naive Bayes classifier using the 20 Newsgroups data available athttp://people.csail.mit.edu/jrennie/20Newsgroups/

發佈了22 篇原創文章 · 獲贊 1 · 訪問量 4萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章