Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

The Naive Bayesian is a baseline for text classification problem.

A spam email example. We need to count the frequency of words which occurs in the span/normal email.

Such as, ad., purchase,  link  ,etc. We could considered this email as spam.

But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.

there are two  steps for naive bayesian:

1) Training

count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.

p(advertisement/ span)   p(advertisement/ normal)

2) predict

 

Training :

p(購買 | 正常) = 3 / (24 * 10) = 1/80

p(購買 | 垃圾) = 7 / (12 * 10) = 7/120

p(物品 | 正常) =  4 / 240 = 1 / 60

p(物品 | 垃圾) = 4 / 120 = 1 / 30

p(不是 | 正常) = 4 / 240 = 1 / 60

p(不是 | 垃圾) = 3 / 120 = 1 / 40

p(廣告 | 正常) = 5 / 240 = 1 / 48

p(廣告 | 正常) = 4 / 120 = 1 / 56

p(這 | 正常) = 3 / 240 = 1 / 80

p(這 | 垃圾) = 0 / 120 = 0

 

Priori Probability(先驗概率)

正常郵件在所有郵件中的概率 24 / 36 = 2 / 3

垃圾郵件在所有郵件中的概率 12 / 36 = 1 / 3

 

We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)

Bayesian Theorem

P(X | Y): likelihood

P(Y): prior

P(X) = normalization

P(Y | X) = posterior

Prediction:

Conditional independence  P(x, y | z) = P(x | z) * P(y | z)

 

But the result is abnormal due to P(這|垃圾) = 0.

We need to do some smooth process.

Add-one smoothing:

 

A problem:

 

爲了避免underflow 可以加上log

log(p1 * p2 * p3) = logp1 + log p2 + log p3

 

Naive Bayesian Sample in python:

import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# read span.csv
df = pd.read_cv("spam.csv", encoding = 'latin')
df.head()

 

 	v1 	v2 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4
0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN
1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN
2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN
3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN
4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN

 

Rename some columns:

# rename  the column of v1 and v2
df.rename(columns = {'v1' : 'Label', 'v2' : 'Text'}, inplace = True)
df.head()
 	Label 	Text 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4
0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN
1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN
2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN
3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN
4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN

Map Label to number

# map'ham' and 'span' to 0 and 1
df['numLabel'] = df['Label'].map({'ham' : 0, 'spam' : 1})
df.head()

 


	Label 	Text 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4 	numLabel
0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN 	0
1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN 	0
2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN 	1
3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN 	0
4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN 	0

Count the number of spam/ham emails

# count number of ham and spam
print ('# of ham : ', len(df[df.numLabel == 0]), ' # of spam: ', len(df[df.numLabel == 1]))
print ('# of total samples: ', len(df))
# of ham :  4825  # of spam:  747
# of total samples:  5572

Plot the histogram for text length:

# count the length of  text, and plot a histogram
text_lengths = [len(df.loc[i, 'Text']) for i in range(len(df))]
plt.hist(terxt_lengths, 100, facecolor = 'blue', alpha = 0.5)
plt.xlim([0, 200])
plt.show()

 

 

# import English vocabulary
from sklearn.feature_extraction.text import CountVectorizer

# construct word vector (base on the frequency of the word)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df.Text)
y = df.numLabel

# split the data into train and test data set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 100)
print('# of samples in the train data set: ', X_train.shape[0], '# of samples in test data set: ', X_test.shape[0])

Output:

# of samples in the train data set:  4457 # of samples in test data set: 1115
# use the Naive Bayesian for model training
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

clf = MultinomialNB(alpha = 1.0, fit_prior = True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("accuracy on test data: ", accuracy_score(y_test, y_pred))
accuracy on test data:  0.97847533632287
# print confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred, labels = [0, 1])
array([[956,  14],
       [ 10, 135]])

 

Summary:

Maximum Likelihood estimation for parameter of Naive Bayesian:

Non-constraint Optimization Problem

Constrained Optimization

 

Maximum Likelihood estimation for Naive  Bayesian

We add parameter θ and π for our object function

π is the vector for the prior probability of each classification with K x 1 dimension

θ is the matrix which stores the probability as row as the word, column for each classification that is θij = p(wi | yj) i = 1,...,V,

V is the size of the vocabulary, j = 1,..,K, K is the size of the classification.

Construction for Lagrangian Multipler and solve for π

 

solve for θ

 

Gaussian Naive Bayesian for continus random variable

We can use the Gaussian distribution to present this random variable.

The Gaussian distribution has the properties that: the sum or product of two gaussian distributions is also gaussian distributions

the condition probability of two gaussian distribution is also gaussian distributions.

Central limit theorem

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

例:

Progress:

1)for each classification c, we choose all of the samples xi related to c, then we fit a gaussian distribution.

We fit independ gaussian distribution for each of the classification.

2) Then we can predict any xi as P(xi | y = c)

Examples:

there are two  feature age and income which are continus random variable.

we choose gaussian distribution to fit these distribution

In real world, if we have a lot of continus features, we will not choose the naive bayesian model. we choose logistic regression, XGBoost etc.

But the naive bayesian is a base line for text classification.

 

A python Implementation for Naive Bayesian

reference:https://github.com/sesiria/ML/blob/master/Lib/NaiveBayesian.py

# author sesiria 2019
# a simple Naive Bayesian classifier implementation
import numpy as np

# **********************definition of the Naive Bayesian***************************
class NaiveBayesianClassifier:
    def __init__(self):
        pass
    
    # currently we only support the digital number for labels.
    def fit(self, data, label):
        classes = np.unique(label)
        nWords = data.shape[1]
        # matrix to store the probability for each word in each category.
        self.paramMatrix = np.zeros([nWords, len(classes)], dtype = np.float64)
        self.priorVector = np.zeros(len(classes), dtype = np.float64)
        self.labels = [] # class label

        for i in range(len(classes)):
            c = classes[i]
            nCurrentSize = len(label[label == c])
            # build category hashtable
            self.labels.append(c)
            # we calculate the priorVector
            self.priorVector[i] = nCurrentSize / len(label)
            # calculate the paramMatrix with smoothing
            count = np.sum(data[label == c, :], axis = 0) + 1
            count = count / (nCurrentSize + nWords)
            self.paramMatrix[:, i] = count            

    def predict(self, test):
        if (len(test.shape) == 1):
            return self.getCategory(test)

        predictions = np.zeros(test.shape[0])
        for i in range(test.shape[0]):
            predictions[i] = self.getCategory(test[i, :])
        return predictions

    def getCategory(self, test):
        assert test.shape[0] == self.paramMatrix.shape[0]
        p = np.zeros(len(self.labels))
        for idx in range(len(self.labels)):
            # we use the log trick to avoid the underflow
            p[idx] = np.sum(np.log(self.paramMatrix[:, idx]) * test)

        return self.labels[np.argmax(p)]
        
# **************************unit test function.************************************
def sanity_check():
    X = np.array([[1, 1, 1, 0, 0, 0, 0, 0, 0],
                  [0, 0, 0, 1, 1, 0, 0, 0, 0],
                  [1, 0, 1, 0, 0, 1, 0, 0, 0],
                  [0, 0, 0, 0, 0, 0, 1, 0, 0],
                  [0, 0, 1, 0, 0, 0, 0, 1, 1],
                  [0, 0, 1, 1, 0, 0, 0, 0, 0]
                ])
    Y = np.array([1, 1, 1, 0, 0, 0])
    X_test = np.array([[1, 0, 0, 1, 2, 0, 1, 0, 0],
                        [1, 0, 0, 0, 0, 0, 1, 1, 0]
                       ])
    clf = NaiveBayesianClassifier()
    clf.fit(X, Y)
    result = clf.predict(X_test)

if __name__ == '__main__':
    sanity_check()

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章