【機器學習實戰之二】:C++實現基於概率論的分類方法--樸素貝葉斯分類(Naive Bayes Classifier)

樸素貝葉斯分類算法是機器學習中十分經典而且應用十分廣泛的算法,下面將逐步學習和說明。

一、條件概率:

      條件概率是概率論中的一個重要實用的概念。所考慮的是事件A已經發生的條件下事件B發生到概率。

        (一)條件概率 定義 設A,B是兩個事件,且P(A)>0,稱:P(B|A) = P(AB) / P(A); 爲在事件A發生的條件下事件B發生的條件概率

        (二)乘法定理設P(A) > 0 , 則有P(AB) = P(B|A) * P(A); 此式成爲乘法公式

      (三)全概率公式和貝葉斯公式

         樣本空間劃分定義:假設S爲試驗E的樣本空間,B1,B2,B3..Bn爲E的一組事件。若:

         (i) 

         (ii)

         則稱B1,B2,B3...Bn爲樣本空間S的一個劃分。若B1,B2,...,Bn是樣本空間的一個劃分,那麼,對每次試驗,事件B1,B2,B3...Bn中必有一個且僅有一個發生。

         定理 設試驗E的樣本空間爲S,A爲E的事件,B1,B2,...Bn爲S的一個劃分,且P(Bi)>0(i=1,2,...n),則:

         P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2)+...+P(A|Bn)*P(Bn).  此式成爲全概率公式

         在很多實際問題中P(A)不易直接求得,但是卻容易找到S的一個劃分B1,B2,...Bn,且P(Bi)和P(A|Bi)或爲已知,或容易求得,那麼就可以全概率公式求得P(A)。

         定理 設試驗E的樣本空間爲S,A爲E的事件,B1,B2,...,Bn爲S的一個劃分,且P(A)>0,P(Bi)>0 (i=1,2,...,n),則

                        此式成爲貝葉斯公式


二、基於貝葉斯決策分類的分類方法:

優點:在數據較少的情況下仍然有效,可以處理多類別問題;

缺點:對於輸入數據的準備方式較爲敏感;

使用數據類型:標稱性數據;

樸素貝葉斯是貝葉斯決策理論的一部分,所以講述樸素貝葉斯分類之前有必要了解貝葉斯決策理論。我們之所以稱之爲“樸素”,是因爲整個形式化過程只做最原始,最簡單的假設。

假設我們有一個數據集,它由兩類數據組成,數據分佈:

1:兩個參數已知的概率分佈,參數決定分佈形狀。

我們現在用p1(x,y)表示數據點(x,y)屬於類別1(圖中用圓點表示的類別)的概率,用p2(x,y)表示數據點(x,y)屬於類別2(圖中用三角形表示的類別)的概率,那麼對於一個新數據點(x,y),可以用下面的規則來判斷它的類別:

·如果p1(x,y) > p2(x,y) ,那麼類別爲1。

·如果p1(x,y) < p2(x,y) ,那麼類別爲2。

也就是說,我們也選擇搞概率對應的類別。這就是貝葉斯決策理論的核心思想,即選擇具有最高概率的決策。


三、樸素貝葉斯的一般過程:

(1)收集數據:可以使用任何方法。

(2)準備數據:需要數值型或者布爾型數據。

(3)分析數據:有大量特徵時,繪製特徵作用不大,此時使用直方圖效果更好。

(4)訓練算法:計算不同的獨立特徵的條件概率。

(5)測試算法:計算錯誤率。

(6)使用算法:一個常見的樸素貝葉斯應用是文檔分類。可以在任意的分類場景中使用樸素貝葉斯分類器,不一定非要是文本。


四、文本分類:

要從文本中獲取特徵,需要先拆分文本。具體如何做?這裏的特徵是來自文本的詞條(token),一個詞條是字符的任意組合。可以把詞條想象爲單詞,也可以使用非單詞詞條,如URL、IP地址或者任意其他字符串。

以在線社區的留言板爲例。爲了不影響社區的發展,我們要屏蔽侮辱性的言論,所以要構建一個快速過濾器,如果某條留言使用了負面或者侮辱性的語言,那麼就該留言標識爲內容不當。過濾這類內容是一個很常見的需求。對此問題建立兩個類別,侮辱性和非侮辱性,使用1和0分別表示。

下面將用C++來設計數據結構和算法。

4.1 準備數據:從文本中構建詞向量

把文本看成單詞向量或者詞條向量,也就是說將句子轉換爲向量。考慮出現在所有文檔中的所有單詞,再決定將哪些詞納入詞彙表或者說所要的詞彙集合,然後必須要將每一篇文檔轉換爲在詞彙表上的向量,爲什麼要這麼做,不着急,先往下看。

4-1:詞表到向量的轉換函數:

/*
 * code list 4-1 : transfer func from docs list to vocabulary list
 * */

#include<iostream>
#include<map>
#include<set>
#include<vector>
#include<algorithm>
#include<numeric>
#include<cstring>
#include<stdio.h>
#include<cstdlib>
using namespace std;

string posting_list[6][10]={
	{"my","dog","has","flea","problems","help","please","null"},
	{"maybe","not","take","him","to","dog","park","stupid","null"},
	{"my","dalmation","is","so","cute","I","love","him","null"},
	{"stop","posting","stupid","worthless","garbage","null"},
	{"mr","licks","ate","my","steak","how","to","stop","him","null"},
	{"quit","buying","worthless","dog","food","stupid","null"}
};
int class_vec[6] = {0,1,0,1,0,1};   //1 is abusive ,0 not

class NaiveBayes
{
	private:
		vector< vector<string> > list_of_posts;  //詞條向量 
		vector<int> list_classes;
		map<string,int>  my_vocab_list;  //單詞列表 
		int *return_vec;

	public:
		NaiveBayes()
		{ 
            //posting_list --> list_of_posts 
			vector<string> vec;
			for(int i=0;i<6;i++)
			{
				vec.clear();
				for(int j=0;posting_list[i][j]!="null";j++)
				{
					vec.push_back( posting_list[i][j] );
				}
				list_of_posts.push_back( vec );
			}
            
            //class_vec --> list_classes
			for(int i=0;i<sizeof(class_vec)/sizeof(class_vec[0]);i++)
			{
				list_classes.push_back( class_vec[i] );
			}
		}

		void create_vocab_list()
		{
			vector< vector<string> > :: iterator it = list_of_posts.begin();
			int index = 1;
			while( it!=list_of_posts.end() )
			{
				vector<string> vec = *it;

				vector<string> :: iterator tmp_it = vec.begin();

				while( tmp_it!=vec.end() )
				{
					if( my_vocab_list[*tmp_it] == 0 )
					{
						my_vocab_list[*tmp_it] = index++; //index is the location of the vocabulary
					}
					tmp_it++;
				}
				it++;
			}
			
		   map<string,int>::const_iterator itt = my_vocab_list.begin();
		   while( itt!=my_vocab_list.end() )
		   {
		   cout<<itt->first<<" "<<itt->second<<"   ";
		   itt++;
		   }
			 
		}//create_vocab_list

		//set some one doc to vec with 0 and 1.
		void set_of_words_to_vec(int idx)
		{
			cout<<"set of words to vec begin the document id is : "<<idx<<endl;
			int len = my_vocab_list.size()+1;
			return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero.
			fill(return_vec,return_vec+len,0);
			for(int i=0;i<len;i++)
				cout<<return_vec[i]<<" ";
			for( int i=0;posting_list[idx][i]!="null";i++ )
			{
				int pos = my_vocab_list[ posting_list[idx][i] ];
				if( pos != 0 )
				{
					return_vec[pos] = 1;
				}
			}
			cout<<endl;
		}//set_of_words_to_vec
	
		void print()
		{
			cout<<"print the return_vec begin :"<<endl;
			int len = my_vocab_list.size()+1;
			cout<<"len = "<<len<<endl;
			for(int i=0;i<len;i++)
			{
				cout<<return_vec[i]<<" ";
			}
			cout<<endl;
			delete [] return_vec;
		}//print()
};

int main()
{
	NaiveBayes nb;
	nb.create_vocab_list();
	nb.set_of_words_to_vec(5);
	nb.print();
	system("pause") ;
	return 0;
}

分析:

·NaiveBayes():構造函數做了兩方面的工作,其一,初始化了詞條切分後的文檔集合list_of_posts,即將posting_list轉換爲list_of_posts,其中list_of_posts中的每一個分量就是一個文檔,這些文檔來自斑點犬愛好者留言板;其二,用class_vec去初始化類的私有成員變量list_classes,它是類別標籤的集合,分爲兩類,侮辱性和非侮辱性。這些文本的類別由人工標註,這些標註信息用於訓練程序以便自動檢測侮辱性留言。

·create_vocab_list():創建一個包含在所有文檔中出現的不重複詞的列表。定義私有成員變量map<string,int>  my_vocab_list; key是代表單詞,value則代表單詞在my_vocab_list中的位置(下標)。

·set_of_words_to_vec(int idx):該函數的輸入參數爲某個文檔的下標值idx,得到了return_vec,即下標爲idx的文檔向量,向量的每個元素爲1或者0,分別表示詞彙表中的單詞在輸入文檔中是否出現。首先根據詞表長度獲得文檔向量長度,並用STL中的fill將其元素都設爲0;遍歷下標爲idx的文檔中所有單詞,得到在詞彙表my_vocab_list中的位置pos,然後根據pos將return_vec中對應的值設置爲1。該函數使用詞彙表或者想要檢查的所有單詞作爲輸入,一旦給定一篇文檔(斑點犬網站上的一條留言),該文檔就會被轉換爲詞向量。

·print():打印得到的文檔向量return_vec。


結果:



4.2:訓練算法:從詞向量計算概率:

前面介紹瞭如何將一組單詞轉換爲一組數字,接下來看如何使用這些數字來計算概率。現在已經知道一個詞是否出現在一篇文檔中,也知道該文檔所屬的類別。那麼我們將對某個文檔轉換成爲文檔向量W後進行分類,實際上就是計算在W的條件下,類別爲Ci的概率。

p(Ci | W) = p( W | Ci ) / p( W );   W:就是需要分類的詞向量;

我們將使用上述公式,對於每個類計算該值,然後比較這兩個概率值的大小。如何計算?

p(Ci):首先可以通過類別i(侮辱性留言或非侮辱性留言)中文檔數除以總的文檔數來計算概率p(Ci)。

p(W|Ci):這裏就要用到樸素貝葉斯假設。如果將向量W展開爲一個個獨立特徵,那麼就可以將上述概率寫作p(W0,W1,W2...Wn)。這裏假設所有詞都相互獨立,該假設也稱作條件獨立性假設,它意味着可以用P(W0|Ci)P(W1|Ci)P(W2|Ci)...P(P(Wn|Ci))來計算上述概率,這就極大地簡化了計算的過程。

該函數的僞代碼如下:

計算每個類別中的文檔數目:
對每篇訓練文檔:
<span style="white-space:pre">	</span>對每個類別:
	<span style="white-space:pre">	</span>如果詞條出現文檔中-->增加該詞條的計數值
		增加所有詞條的計數值
	對每個類別:
		對每個詞條:
	<span style="white-space:pre">		</span>將該詞條的數目除以總詞條數目得到條件概率
<span style="white-space:pre">	</span>返回每個類別的條件概率

4-2:樸素貝葉斯分類器訓練函數:

/*
 * code list 4-1 : transfer func from docs list to vocabulary list
 * add code list 4-2 : training func on Naive Bayes Classifier
 * */


#include<iostream>
#include<map>
#include<set>
#include<vector>
#include<algorithm>
#include<numeric>
#include<cstring>
#include<stdio.h>
#include<cstdlib>
using namespace std;


string posting_list[6][10]={
	{"my","dog","has","flea","problems","help","please","null"},
	{"maybe","not","take","him","to","dog","park","stupid","null"},
	{"my","dalmation","is","so","cute","I","love","him","null"},
	{"stop","posting","stupid","worthless","garbage","null"},
	{"mr","licks","ate","my","steak","how","to","stop","him","null"},
	{"quit","buying","worthless","dog","food","stupid","null"}
};
int class_vec[6] = {0,1,0,1,0,1};   //1 is abusive ,0 not


class NaiveBayes
{
	private:
		vector< vector<string> > list_of_posts;
		vector<int> list_classes;
		map<string,int>  my_vocab_list;
		int *return_vec;
		vector< vector<int> > train_mat;


	public:
		NaiveBayes()
		{
			vector<string> vec;
			for(int i=0;i<6;i++)
			{
				vec.clear();
				for(int j=0;posting_list[i][j]!="null";j++)
				{
					vec.push_back( posting_list[i][j] );
				}
				list_of_posts.push_back( vec );
			}

			for(int i=0;i<sizeof(class_vec)/sizeof(class_vec[0]);i++)
			{
				list_classes.push_back( class_vec[i] );
			}

		}

		void create_vocab_list()
		{
			vector< vector<string> > :: iterator it = list_of_posts.begin();
			int index = 1;
			while( it!=list_of_posts.end() )
			{
				//vector<string> vec( *it.begin(),*it.end() );
				vector<string> vec = *it;

				vector<string> :: iterator tmp_it = vec.begin();

				while( tmp_it!=vec.end() )
				{
					//cout<<*tmp_it<<" ";
					if( my_vocab_list[*tmp_it] == 0 )
					{
						my_vocab_list[*tmp_it] = index++; //index is the location of the vovabulary
					}
					tmp_it++;
				}
				it++;
			}

		}//create_vocab_list

		//set some one word to vec with 0 and 1.
		void set_of_words_to_vec(int idx)
		{
			cout<<"set of words to vec begin the document id is : "<<idx<<endl;
			int len = my_vocab_list.size()+1;
			return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero.
			fill(return_vec,return_vec+len,0);
			for(int i=0;i<len;i++)
				cout<<return_vec[i]<<" ";
			for( int i=0;posting_list[idx][i]!="null";i++ )
			{
				//cout<<posting_list[idx][i]<<" ";
				int pos = my_vocab_list[ posting_list[idx][i] ];
				if( pos != 0 )
				{
					return_vec[pos] = 1;
				}
			}
			cout<<endl;
		}//set_of_words_to_vec

		void get_train_matrix()
		{
			cout<<"get train matrix begin : "<<endl;
			train_mat.clear();
			for(int i=0;i<6;i++)
			{
				set_of_words_to_vec(i);
				vector<int> vec( return_vec , return_vec + my_vocab_list.size()+1 );
				train_mat.push_back(vec);
				delete []return_vec;
			}
		}//get train matrix

		void print()
		{
			cout<<"print the train matrix begin : "<<endl;
			vector< vector<int> > :: iterator it = train_mat.begin();
			while(it!=train_mat.end())
			{
				vector<int> vec = *it;
				vector<int> :: iterator itt = vec.begin();
				while( itt!=vec.end())
				{
					cout<<*itt<<" ";
					itt++;
				}
				cout<<endl;
				it++;
			}

		}//print()

		void train_NB0()
		{
			int num_train_docs = train_mat.size();//sizeof(posting_lists)/sizeof(posting_lists[0]);
			cout<<"num_train_docs = "<<num_train_docs<<endl;
			int num_words = train_mat[0].size() - 1 ;
			/* calculatr the sum of the abusive classes */	
			int sum = accumulate(list_classes.begin(),list_classes.end(),0); //C++ STL accumulate() 
			cout<<"sum = "<<sum<<endl;
			float p_abusive = (float)sum/(float)num_train_docs;
			cout<<"p_abusive = "<<p_abusive<<endl;

			vector<float> p0vect(train_mat[0].size(),0); //the frequency of each word in non-absusive docs
			vector<float> p1vect(train_mat[0].size(),0); //the frequency of each word in abusive docs
			printf("p0num.size() = %d , p1num.size() = %d\n",p0vect.size(),p1vect.size());
			float p0Denom = 0.0; //the total number of words in non-abusive docs
			float p1Denom = 0.0; //the total number of words in abusive docs

			/* calculate the p0num,p1num,p0Denom,p1Denom */
			for(int i=0;i<list_classes.size();i++)
			{
				if(list_classes[i] == 1)  //abusive doc
				{
					for(int j=0;j<p1vect.size();j++)
					{
						p1vect[j] += train_mat[i][j];
						if(train_mat[i][j]==1)			
							p1Denom++;
					}
				}
				else   //non-abusive doc
				{
					for(int j=0;j<p0vect.size();j++)
					{
						p0vect[j] += train_mat[i][j];
						if(train_mat[i][j]==1)			
							p0Denom++;
					}
				}
			}
			
			for(int i=0;i<p1vect.size();i++)
			{
				p0vect[i] = p0vect[i]/p0Denom;
				p1vect[i] = p1vect[i]/p1Denom;
			}
			
			cout<<"print the p0vect values : ";
			for(int i=0;i<p0vect.size();i++)
				cout<<p0vect[i]<<" ";
			cout<<"\nprint the p1vect values : ";
			for(int i=0;i<p1vect.size();i++)
				cout<<p1vect[i]<<" ";
			cout<<endl;
		}


};

int main()
{
	NaiveBayes nb;
	nb.create_vocab_list();
	nb.get_train_matrix();
	nb.print();
	nb.train_NB0();
	system("pause") ;
	return 0;
}<strong>
</strong>

分析:

代碼中和4-1的代碼相比,4-2增加了私有成員變量train_mat和公有成員函數train_NB0。其中:

train_mat:文檔矩陣。是由0,1組成的詞向量矩陣,單個向量是一個文檔轉換成爲和my_vocab_list等長的[0,1]數組。

train_NB0樸素貝葉斯分類器訓練函數。

首先,計算文檔屬於侮辱性文檔(class=1)的概率:p_abusive,即P(1)。因爲這是一個二類分類問題,所以可以通過1-P(1)得到P(0)。對於多於兩類的分類問題,則需要對代碼稍加修改。

計算P(Wi | C0)和P(Wi | C1),需要初始化程序中的分子變量p0vect/p1vect和分母變量p0Denom/p1Denom。在for循環中,要遍歷訓練集train_mat中的所有文檔。每次某個詞語(侮辱性或非侮辱性)在某一文檔中出現,則該詞在向量p0vect或p1vect對應的位置數值加一,而且在所有的文檔中,該文檔的總次數p0Denom或p1Denom也相應加1。對於兩個類別都需要進行同樣的計算處理。最後,對每個元素除以該類別中的總次數。


結果:



4.3:測試算法:根據現實情況修改分類器

利用貝葉斯分類器對文檔進行分類時,要計算多個概率乘積以獲得文檔屬於某個類別的概率,即計算p(W0|1)p(W1|1)p(W2|1)。

問題一:如果其中的一個概率值爲0,那麼最後的乘積也爲0。爲降低這種影響,可以將所有詞的出現數初始化1,並將分母初始化爲2.

p0vect.resize(train_mat[0].size(),1);//the frequency of each word in non-absusive docs
p1vect.resize(train_mat[0].size(),1);//the frequency of each word in abusive docs
float p0Denom = 2.0; //the total number of words in non-abusive docs
float p1Denom = 2.0; //the total number of words in abusive docs

問題二:下溢出。這是由於太多很小的數相乘造成的。當計算乘積p(W0|Ci)p(W1|Ci)p(W2|Ci)....p(Wn|Ci)時,由於大部分因子都非常小,所以程序會下溢出或者得到不正確答案。在代數中有ln(a*b) = ln(a) + ln(b),於是通過對數可以避免下溢出或者浮點數舍入導致的錯誤。同時,採用自然對數進行處理不會有任何損失。自然ln不會影響函數的單調性。

p0vect[i] = log(p0vect[i]/p0Denom);
p1vect[i] = log(p1vect[i]/p1Denom);
萬事俱備,已經可以開始構建完整的分類器了。


4-3:樸素貝葉斯分類函數:

/*
 * code list 4-1 : transfer func from docs list to vocabulary list
 * code list 4-2 : training func on Naive Bayes Classifier
 * add code list 4-3 : naive bayes classify function
 * */

#include<iostream>
#include<map>
#include<set>
#include<cmath>
#include<vector>
#include<algorithm>
#include<numeric>
#include<cstring>
#include<stdio.h>
#include<cstdlib>
using namespace std;


string posting_list[6][10]={
	{"my","dog","has","flea","problems","help","please","null"},
	{"maybe","not","take","him","to","dog","park","stupid","null"},
	{"my","dalmation","is","so","cute","I","love","him","null"},
	{"stop","posting","stupid","worthless","garbage","null"},
	{"mr","licks","ate","my","steak","how","to","stop","him","null"},
	{"quit","buying","worthless","dog","food","stupid","null"}
};
int class_vec[6] = {0,1,0,1,0,1};   //1 is abusive ,0 not


class NaiveBayes
{
	private:
		vector< vector<string> > list_of_posts;
		vector<int> list_classes;
		map<string,int>  my_vocab_list;
		int *return_vec;
		vector< vector<int> > train_mat;
		vector<float> p0vect;
		vector<float> p1vect;
		float p_abusive;


	public:
		NaiveBayes()
		{
			vector<string> vec;
			for(int i=0;i<6;i++)
			{
				vec.clear();
				for(int j=0;posting_list[i][j]!="null";j++)
				{
					vec.push_back( posting_list[i][j] );
				}
				list_of_posts.push_back( vec );
			}

			for(int i=0;i<sizeof(class_vec)/sizeof(class_vec[0]);i++)
			{
				list_classes.push_back( class_vec[i] );
			}

		}

		void create_vocab_list()
		{
			vector< vector<string> > :: iterator it = list_of_posts.begin();
			int index = 1;
			while( it!=list_of_posts.end() )
			{
				//vector<string> vec( *it.begin(),*it.end() );
				vector<string> vec = *it;

				vector<string> :: iterator tmp_it = vec.begin();

				while( tmp_it!=vec.end() )
				{
					//cout<<*tmp_it<<" ";
					if( my_vocab_list[*tmp_it] == 0 )
					{
						my_vocab_list[*tmp_it] = index++; //index is the location of the vovabulary
					}
					tmp_it++;
				}
				it++;
			}

		}//create_vocab_list

		//set some one word to vec with 0 and 1.
		void set_of_words_to_vec(int idx)
		{
			cout<<"set of words to vec begin the document id is : "<<idx<<endl;
			int len = my_vocab_list.size()+1;
			return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero.
			fill(return_vec,return_vec+len,0);
			for(int i=0;i<len;i++)
				cout<<return_vec[i]<<" ";
			for( int i=0;posting_list[idx][i]!="null";i++ )
			{
				//cout<<posting_list[idx][i]<<" ";
				int pos = my_vocab_list[ posting_list[idx][i] ];
				if( pos != 0 )
				{
					return_vec[pos] = 1;
				}
			}
			cout<<endl;
		}//set_of_words_to_vec

		void get_train_matrix()
		{
			cout<<"get train matrix begin : "<<endl;
			train_mat.clear();
			for(int i=0;i<6;i++)
			{
				set_of_words_to_vec(i);
				vector<int> vec( return_vec , return_vec + my_vocab_list.size()+1 );
				train_mat.push_back(vec);
				delete []return_vec;
			}
		}//get train matrix

		void print()
		{
			cout<<"print the train matrix begin : "<<endl;
			vector< vector<int> > :: iterator it = train_mat.begin();
			while(it!=train_mat.end())
			{
				vector<int> vec = *it;
				vector<int> :: iterator itt = vec.begin();
				while( itt!=vec.end())
				{
					cout<<*itt<<" ";
					itt++;
				}
				cout<<endl;
				it++;
			}

		}//print()

		void train_NB0()
		{
			int num_train_docs = train_mat.size();//sizeof(posting_lists)/sizeof(posting_lists[0]);
			cout<<"num_train_docs = "<<num_train_docs<<endl;
			int num_words = train_mat[0].size() - 1 ;
			/* calculatr the sum of the abusive classes */	
			int sum = accumulate(list_classes.begin(),list_classes.end(),0);
			
			cout<<"sum = "<<sum<<endl;
			//float p_abusive = (float)sum/(float)num_train_docs;
			p_abusive =  (float)sum/(float)num_train_docs;
			cout<<"p_abusive = "<<p_abusive<<endl;

			//vector<float> p0vect(train_mat[0].size(),1); //the frequency of each word in non-absusive docs
			p0vect.resize(train_mat[0].size(),1);
			//vector<float> p1vect(train_mat[0].size(),1); //the frequency of each word in abusive docs
			p1vect.resize(train_mat[0].size(),1);
			printf("p0num.size() = %d , p1num.size() = %d\n",p0vect.size(),p1vect.size());
			float p0Denom = 2.0; //the total number of words in non-abusive docs
			float p1Denom = 2.0; //the total number of words in abusive docs

			/* calculate the p0num,p1num,p0Denom,p1Denom */
			for(int i=0;i<list_classes.size();i++)
			{
				if(list_classes[i] == 1)  //abusive doc
				{
					for(int j=0;j<p1vect.size();j++)
					{
						p1vect[j] += train_mat[i][j];
						if(train_mat[i][j]==1)			
							p1Denom++;
					}
				}
				else   //non-abusive doc
				{
					for(int j=0;j<p0vect.size();j++)
					{
						p0vect[j] += train_mat[i][j];
						if(train_mat[i][j]==1)			
							p0Denom++;
					}
				}
			}
			
			for(int i=0;i<p1vect.size();i++)
			{
				p0vect[i] = log(p0vect[i]/p0Denom);
				p1vect[i] = log(p1vect[i]/p1Denom);
			}
			
			cout<<"print the p0vect values : "<<endl;
			for(int i=0;i<p0vect.size();i++)
				cout<<p0vect[i]<<" ";
			cout<<"\nprint the p1vect values : "<<endl;
			for(int i=0;i<p1vect.size();i++)
				cout<<p1vect[i]<<" ";
			cout<<endl;
		}

		int classify_NB( string *doc_to_classify )
		{
			return_vec = new int[ my_vocab_list.size()+1 ]();
			for(int i=0;doc_to_classify[i]!="null";i++)
			{
				int pos = my_vocab_list[ doc_to_classify[i] ];
				if( pos!=0 )
				{
					return_vec[ pos ] = 1;
				}
			}//for

			for(int i=0;i<my_vocab_list.size()+1;i++)
				cout<<return_vec[i]<<" ";
			cout<<endl;
			float p1 = inner_product( p1vect.begin()+1,p1vect.end(),return_vec+1,0 ) + log(p_abusive);
			float p0 = inner_product( p0vect.begin()+1,p0vect.end(),return_vec+1,0 ) + log(1-p_abusive);

			cout<<"p1 = "<<p1<<endl;
			cout<<"p0 = "<<p0<<endl;

			if( p1>p0 )
			{
				return 1;
			}
			else
			{
				return 0;
			}
		}

};

int main()
{
	NaiveBayes nb;
	nb.create_vocab_list();
	//nb.set_of_words_to_vec(5);
	nb.get_train_matrix();
	nb.print();
	nb.train_NB0();

	string doc1_to_classify[] = {"love","my","dalmation","null"}; 
	string doc2_to_classify[] = {"stupid","garbage","null"};
    cout<<"doc1 classified as : "<<nb.classify_NB( doc1_to_classify )<<endl;
    cout<<"doc2 classified as : "<<nb.classify_NB( doc2_to_classify )<<endl;
	return 0;
}<strong>
</strong>


結果:

可以看到doc1:{"love","my","dalmation","null"};被分爲0類(not abusive); doc2:{"stupid","garbage","null"}被分爲1類(abusive)。分類正確!

五、示例:使用樸素貝葉斯過濾垃圾郵件:

在這個例子中,我們將瞭解樸素貝葉斯的一個最著名的應用:電子郵件垃圾過濾。首先看一下如何使用通用框架來解決問題:

(1)收集數據:提供文本文件;

(2)準備數據:將文本文件解析成詞條向量;

(3)分析數據:檢查詞條確保解析的正確性;

(4)訓練算法:使用我們之前建立的train_NB()函數;

(5)測試算法:使用classifyNB(),並且構建一個新的測試函數來計算文檔集的錯誤率;

(6)使用算法:構建一個完整的程序對一組文檔進行分類,將錯分的文檔輸出到屏幕上。

5.1:切分文本:

首先我們將寫一個python程序textParse.py來對所有的email文件進行解析,正常的郵件放在/email/ham/下,垃圾郵件放在/email/spam/下,將ham下每個文件解析完成後放在/email/hamParse/下,將spam下每個文件解析完成後放在/email/spamParse/下,email共享文件鏈接:http://yunpan.cn/Q4fXnTtGudGA9 。

代碼textParse.py:

#!/usr/bin/env python

def textParse(bigString):
	import re
	listOfTokens = re.split(r'\W*',bigString)
	return [tok.lower() for tok in listOfTokens if len(tok) > 2 ]

def spamTest():
	for i in range(1,26):
		wordList = textParse( open('./email/ham/%d.txt' % i).read() )
		fp = open( './email/hamParse/%d.dat' % i , 'w')
		for item in wordList:
			fp.write(item+' ')		
		wordList = textParse( open('./email/spam/%d.txt' % i).read() )
		fp = open( './email/spamParse/%d.dat' % i , 'w')
		for item in wordList:
			fp.write(item+' ')		

spamTest()

分析:上面的python代碼就是讀入文本數據,然後切分,得到詞向量,然後將詞向量中的詞都轉換成小寫,並把長度大於2的提取出來,寫入文本文件中去。文本解析是一個相當複雜的過程,可以根據自己的情況自行修改。


5.2:測試算法:使用樸素貝葉斯進行交叉驗證

完整代碼NB3.cc:

/*
 * code list 4-1 : transfer func from docs list to vocabulary list
 * code list 4-2 : training func on Naive Bayes Classifier
 * code list 4-3 : naive bayes classify function
 * add code list 4-4 : naive bayes bag-of-word model
 * add code list 4-5 : text parse : textParse.py and spam email test function : get_error_rate()
 * */

#include<iostream>
#include<map>
#include<set>
#include<cmath>
#include<vector>
#include<algorithm>
#include<numeric>
#include<cstring>
#include<stdio.h>
#include<cstdlib>
#include<fstream>
#include<stdlib.h>
#include<unistd.h>
#include<string.h>
using namespace std;

class NaiveBayes
{
	private:
		vector< vector<string> > list_of_docs;
		vector<int> list_classes;
		map<string,int>  my_vocab_list;
		int *return_vec;
		vector< vector<int> > train_mat;
		vector<float> p0vect;
		vector<float> p1vect;
		float p_abusive;
		ifstream fin;
		ofstream fout;
		int test_data_num;

	public:
		NaiveBayes()
		{
			cout<<"please input the num of test data which should be less than 24 : "<<endl;
			cin>>test_data_num;
			vector<string> vec;
			string word;
			string filename;
			char buf[3];
			string buf_str;
			for(int i=test_data_num+1;i<=25;i++)
			{
				sprintf(buf,"%d",i);  //convert digit to string
				vec.clear();
				buf_str = buf;
				filename = "./email/hamParse/"+buf_str+".dat";
				//cout<<"filename : "<<filename<<endl;
				fin.open( filename.c_str() );
				if(!fin)
				{
					cerr<<"open the file "<<filename<<" error"<<endl;
					exit(1);
				}
				while(fin>>word)
				{
					vec.push_back(word);
				}
				list_of_docs.push_back( vec );
				list_classes.push_back(0);
				filename.clear();
				fin.close();
			}

			for(int i=test_data_num+1;i<=25;i++)
			{
				sprintf(buf,"%d",i);
				vec.clear();
				buf_str = buf;
				filename =	"./email/spamParse/"+buf_str+".dat";
				//cout<<"filename : "<<filename<<endl;
				fin.open( filename.c_str() );
				if(!fin)
				{
					cerr<<"open the file "<<filename<<" error"<<endl;
				}
				while(fin>>word)
				{
					vec.push_back(word);
				}
				list_of_docs.push_back( vec );
				list_classes.push_back(1);
				filename.clear();
				fin.close();
			}

		}

		~NaiveBayes()
		{
			fin.close();
			fout.close();
			list_of_docs.clear();
			list_classes.clear();
			my_vocab_list.clear();
			train_mat.clear();
			//delete [] return_vec;
			p0vect.clear();
			p1vect.clear();
		}


		void create_vocab_list()
		{
			vector< vector<string> > :: iterator it = list_of_docs.begin();
			int index = 1;
			while( it!=list_of_docs.end() )
			{
				//vector<string> vec( *it.begin(),*it.end() );
				vector<string> vec = *it;

				vector<string> :: iterator tmp_it = vec.begin();

				while( tmp_it!=vec.end() )
				{
					//cout<<*tmp_it<<" ";
					if( my_vocab_list[*tmp_it] == 0 )
					{
						my_vocab_list[*tmp_it] = index++; //index is the location of the vovabulary
					}
					tmp_it++;
				}
				it++;
			}
	
		}//create_vocab_list

		//set some one word to vec with 0 and 1.
		void beg_of_words_to_vec(int idx)
		{
			//cout<<"set of words to vec begin the document id is : "<<idx<<endl;
			int len = my_vocab_list.size()+1;
			return_vec = new int[ len ](); //pay attention to the difference between "new int[len]". initalize all the element to zero.
			fill(return_vec,return_vec+len,0);
			vector< vector<string> >:: iterator it = list_of_docs.begin() + idx - 1  ;
			vector<string> vec  = *it;
			vector<string> :: iterator itt = vec.begin();
			int pos = 0 ;
			while( itt!=vec.end() )
			{
	//			cout<<*itt<<" ";
				pos = my_vocab_list[ *itt ];
				if(pos!=0)
				{
					return_vec[pos] += 1;
				}
				itt++;
			}
		}//beg_of_words_to_vec

		void get_train_matrix()
		{
			cout<<"get train matrix begin : "<<endl;
			train_mat.clear();
			for(int i=1;i<=list_of_docs.size();i++)
			{
				beg_of_words_to_vec(i);
				vector<int> vec( return_vec , return_vec + my_vocab_list.size()+1 );
				train_mat.push_back(vec);
				delete []return_vec;
			}
		}//get train matrix

		void print()
		{
			cout<<"print the train matrix begin : "<<endl;
			vector< vector<int> > :: iterator it = train_mat.begin();
			while(it!=train_mat.end())
			{
				vector<int> vec = *it;
				vector<int> :: iterator itt = vec.begin();
				while( itt!=vec.end())
				{
					cout<<*itt<<" ";
					itt++;
				}
				cout<<endl;
				it++;
			}

		}//print()

		void train_NB0()
		{
			int num_train_docs = train_mat.size();//sizeof(docs_lists)/sizeof(docs_lists[0]);
			cout<<"num_train_docs = "<<num_train_docs<<endl;
			int num_words = train_mat[0].size() - 1 ;
			/* calculatr the sum of the abusive classes */	
			int sum = accumulate(list_classes.begin(),list_classes.end(),0);
			cout<<"sum = "<<sum<<endl;
			//float p_abusive = (float)sum/(float)num_train_docs;
			p_abusive =  (float)sum/(float)num_train_docs;
			cout<<"p_abusive = "<<p_abusive<<endl;

			//vector<float> p0vect(train_mat[0].size(),1); //the frequency of each word in non-absusive docs
			p0vect.resize(train_mat[0].size(),1);
			//vector<float> p1vect(train_mat[0].size(),1); //the frequency of each word in abusive docs
			p1vect.resize(train_mat[0].size(),1);
			printf("p0num.size() = %d , p1num.size() = %d\n",p0vect.size(),p1vect.size());
			float p0Denom = 2.0; //the total number of words in non-abusive docs
			float p1Denom = 2.0; //the total number of words in abusive docs

			/* calculate the p0num,p1num,p0Denom,p1Denom */
			for(int i=0;i<list_classes.size();i++)
			{
				if(list_classes[i] == 1)  //abusive doc
				{
					for(int j=0;j<p1vect.size();j++)
					{
						p1vect[j] += train_mat[i][j];
						if(train_mat[i][j]==1)			
							p1Denom++;
					}
				}
				else   //non-abusive doc
				{
					for(int j=0;j<p0vect.size();j++)
					{
						p0vect[j] += train_mat[i][j];
						if(train_mat[i][j]==1)			
							p0Denom++;
					}
				}
			}

			for(int i=0;i<p1vect.size();i++)
			{
				p0vect[i] = log(p0vect[i]/p0Denom);
				p1vect[i] = log(p1vect[i]/p1Denom);
			}

			cout<<endl;
		}

		int classify_NB(const char  *filename )
		{
			return_vec = new int[ my_vocab_list.size()+1 ]();
			
			fin.open(filename);
			if(!fin)
			{
				cerr<<"fail to open the file "<<filename<<endl;
				exit(1);
			}
			string word;
			while(fin>>word)
			{
				int pos = my_vocab_list[ word ];
				if( pos!=0 )
				{
					return_vec[ pos ] += 1;
				}
			}
			fin.close();

			cout<<endl;
			float p1 = inner_product( p1vect.begin()+1,p1vect.end(),return_vec+1,0 ) + log(p_abusive);
			float p0 = inner_product( p0vect.begin()+1,p0vect.end(),return_vec+1,0 ) + log(1-p_abusive);

			cout<<"p1 = "<<p1<<"  "<<"p0 = "<<p0<<endl;

			if( p1>p0 )
			{
				return 1;
			}
			else
			{
				return 0;
			}
		}
	
		void get_error_rate()
		{
			string filename ;
			char buf[3];
			string buf_str;
			int error_count = 0;
			for(int i=1;i<=test_data_num;i++)	
			{
				sprintf(buf,"%d",i);
				buf_str = buf;
				filename = "./email/hamParse/"+buf_str+".dat";
				if( classify_NB( filename.c_str() ) != 0 )
				{
					error_count++;
				}
				
				filename = "./email/spamParse/"+buf_str+".dat";
				if( classify_NB( filename.c_str() ) != 1 )
				{
					error_count++;
				}
			}		
			cout<<"the error rate is : "<<(float)error_count/(float)(2*test_data_num)<<endl;

		}
};

int main()
{
	NaiveBayes nb;
	nb.create_vocab_list();
	//nb.beg_of_words_to_vec(5);
	//nb.beg_of_words_to_vec(30);
	nb.get_train_matrix();
	//nb.print();
	nb.train_NB0();

	char  doc1_to_classify[] = "./email/hamParse/1.dat";
	char  doc2_to_classify[] = "./email/spamParse/1.dat";
	cout<<"doc1 classified as : "<<nb.classify_NB( doc1_to_classify )<<endl;
	cout<<"doc2 classified as : "<<nb.classify_NB( doc2_to_classify )<<endl;
	
	nb.get_error_rate();
	return 0;
}

makefile:

target:
	./textParse.py
	g++ NB3.cc
	./a.out

clean:
	rm ./email/spamParse/*  ./email/hamParse/*   a.out

代碼中增加了get_error_rate()函數測試分類函數的錯誤率。email中ham和spam下分別有25個文本文件,我們定義了成員變量test_data_num,那麼我們就將ham/spam下第1~test_data_num的郵件當做測試集,第test_data_num+1~25的郵件當做訓練集。這種隨機選擇數據的一部分作爲訓練集,而剩餘部分作爲測試集的過程稱爲留存交叉驗證。那麼在構造函數中就將第test_data_num+1~25的數據來初始化list_of_doc,進一步通過create_vocab_list()和get_train_matrix()得到train_mat,再通過訓練函數train_NB0()得到p0vect和p1vect,通過classify_NB()對文本進行分類,get_error_rate()測試分類函數的錯誤率。


下面展示一個在test_data_num = 7情況下的結果:



錯誤率在7%左右。經過測試隨着test_data_num的增加錯誤率會減小,知道test_data_num=12的時候降爲4%。之後又會隨着test_data_num的增加而上升。


如果還有問題可以留言進行交流,謝謝!


註明出處:http://blog.csdn.net/lavorange/article/details/17841383



參考:
《概率論與數理統計第四版》浙大版
《機器學習實戰中文版》


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章