Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

The Naive Bayesian is a baseline for text classification problem.

A spam email example. We need to count the frequency of words which occurs in the span/normal email.

Such as, ad., purchase, link ,etc. We could considered this email as spam.

But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.

there are two steps for naive bayesian:

1) Training

count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.

p(advertisement/ span) p(advertisement/ normal)

2) predict

Training :

p(購買 | 正常) = 3 / (24 * 10) = 1/80

p(購買 | 垃圾) = 7 / (12 * 10) = 7/120

p(物品 | 正常) = 4 / 240 = 1 / 60

p(物品 | 垃圾) = 4 / 120 = 1 / 30

p(不是 | 正常) = 4 / 240 = 1 / 60

p(不是 | 垃圾) = 3 / 120 = 1 / 40

p(廣告 | 正常) = 5 / 240 = 1 / 48

p(廣告 | 正常) = 4 / 120 = 1 / 56

p(這 | 正常) = 3 / 240 = 1 / 80

p(這 | 垃圾) = 0 / 120 = 0

Priori Probability(先驗概率)

正常郵件在所有郵件中的概率 24 / 36 = 2 / 3

垃圾郵件在所有郵件中的概率 12 / 36 = 1 / 3

We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)

Bayesian Theorem

P(X | Y): likelihood

P(Y): prior

P(X) = normalization

P(Y | X) = posterior

Prediction:

Conditional independence P(x, y | z) = P(x | z) * P(y | z)

But the result is abnormal due to P(這|垃圾) = 0.

We need to do some smooth process.

Add-one smoothing:

A problem:

爲了避免underflow 可以加上log

log(p1 * p2 * p3) = logp1 + log p2 + log p3

Naive Bayesian Sample in python:

import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# read span.csv
df = pd.read_cv("spam.csv", encoding = 'latin')
df.head()

 	v1 	v2 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4
0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN
1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN
2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN
3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN
4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN

Rename some columns:

# rename  the column of v1 and v2
df.rename(columns = {'v1' : 'Label', 'v2' : 'Text'}, inplace = True)
df.head()

 	Label 	Text 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4
0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN
1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN
2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN
3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN
4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN

Map Label to number

# map'ham' and 'span' to 0 and 1
df['numLabel'] = df['Label'].map({'ham' : 0, 'spam' : 1})
df.head()


	Label 	Text 	Unnamed: 2 	Unnamed: 3 	Unnamed: 4 	numLabel
0 	ham 	Go until jurong point, crazy.. Available only ... 	NaN 	NaN 	NaN 	0
1 	ham 	Ok lar... Joking wif u oni... 	NaN 	NaN 	NaN 	0
2 	spam 	Free entry in 2 a wkly comp to win FA Cup fina... 	NaN 	NaN 	NaN 	1
3 	ham 	U dun say so early hor... U c already then say... 	NaN 	NaN 	NaN 	0
4 	ham 	Nah I don't think he goes to usf, he lives aro... 	NaN 	NaN 	NaN 	0

Count the number of spam/ham emails

# count number of ham and spam
print ('# of ham : ', len(df[df.numLabel == 0]), ' # of spam: ', len(df[df.numLabel == 1]))
print ('# of total samples: ', len(df))

# of ham :  4825  # of spam:  747
# of total samples:  5572

Plot the histogram for text length:

# count the length of  text, and plot a histogram
text_lengths = [len(df.loc[i, 'Text']) for i in range(len(df))]
plt.hist(terxt_lengths, 100, facecolor = 'blue', alpha = 0.5)
plt.xlim([0, 200])
plt.show()

# import English vocabulary
from sklearn.feature_extraction.text import CountVectorizer

# construct word vector (base on the frequency of the word)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df.Text)
y = df.numLabel

# split the data into train and test data set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 100)
print('# of samples in the train data set: ', X_train.shape[0], '# of samples in test data set: ', X_test.shape[0])

Output:

# of samples in the train data set:  4457 # of samples in test data set: 1115

# use the Naive Bayesian for model training
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

clf = MultinomialNB(alpha = 1.0, fit_prior = True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("accuracy on test data: ", accuracy_score(y_test, y_pred))

accuracy on test data:  0.97847533632287

# print confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred, labels = [0, 1])

array([[956,  14],
       [ 10, 135]])

Summary:

Maximum Likelihood estimation for parameter of Naive Bayesian:

Non-constraint Optimization Problem

Constrained Optimization

Maximum Likelihood estimation for Naive Bayesian

We add parameter θ and π for our object function

π is the vector for the prior probability of each classification with K x 1 dimension

θ is the matrix which stores the probability as row as the word， column for each classification that is θij = p(wi | yj) i = 1,...,V,

V is the size of the vocabulary, j = 1,..,K, K is the size of the classification.

Construction for Lagrangian Multipler and solve for π

solve for θ

Gaussian Naive Bayesian for continus random variable

We can use the Gaussian distribution to present this random variable.

The Gaussian distribution has the properties that: the sum or product of two gaussian distributions is also gaussian distributions

the condition probability of two gaussian distribution is also gaussian distributions.

Central limit theorem

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

例：

Progress：

1）for each classification c, we choose all of the samples xi related to c, then we fit a gaussian distribution.

We fit independ gaussian distribution for each of the classification.

2) Then we can predict any xi as P(xi | y = c)

Examples:

there are two feature age and income which are continus random variable.

we choose gaussian distribution to fit these distribution

In real world, if we have a lot of continus features, we will not choose the naive bayesian model. we choose logistic regression, XGBoost etc.

But the naive bayesian is a base line for text classification.

A python Implementation for Naive Bayesian

reference：https://github.com/sesiria/ML/blob/master/Lib/NaiveBayesian.py

# author sesiria 2019
# a simple Naive Bayesian classifier implementation
import numpy as np

# **********************definition of the Naive Bayesian***************************
class NaiveBayesianClassifier:
    def __init__(self):
        pass
    
    # currently we only support the digital number for labels.
    def fit(self, data, label):
        classes = np.unique(label)
        nWords = data.shape[1]
        # matrix to store the probability for each word in each category.
        self.paramMatrix = np.zeros([nWords, len(classes)], dtype = np.float64)
        self.priorVector = np.zeros(len(classes), dtype = np.float64)
        self.labels = [] # class label

        for i in range(len(classes)):
            c = classes[i]
            nCurrentSize = len(label[label == c])
            # build category hashtable
            self.labels.append(c)
            # we calculate the priorVector
            self.priorVector[i] = nCurrentSize / len(label)
            # calculate the paramMatrix with smoothing
            count = np.sum(data[label == c, :], axis = 0) + 1
            count = count / (nCurrentSize + nWords)
            self.paramMatrix[:, i] = count            

    def predict(self, test):
        if (len(test.shape) == 1):
            return self.getCategory(test)

        predictions = np.zeros(test.shape[0])
        for i in range(test.shape[0]):
            predictions[i] = self.getCategory(test[i, :])
        return predictions

    def getCategory(self, test):
        assert test.shape[0] == self.paramMatrix.shape[0]
        p = np.zeros(len(self.labels))
        for idx in range(len(self.labels)):
            # we use the log trick to avoid the underflow
            p[idx] = np.sum(np.log(self.paramMatrix[:, idx]) * test)

        return self.labels[np.argmax(p)]
        
# **************************unit test function.************************************
def sanity_check():
    X = np.array([[1, 1, 1, 0, 0, 0, 0, 0, 0],
                  [0, 0, 0, 1, 1, 0, 0, 0, 0],
                  [1, 0, 1, 0, 0, 1, 0, 0, 0],
                  [0, 0, 0, 0, 0, 0, 1, 0, 0],
                  [0, 0, 1, 0, 0, 0, 0, 1, 1],
                  [0, 0, 1, 1, 0, 0, 0, 0, 0]
                ])
    Y = np.array([1, 1, 1, 0, 0, 0])
    X_test = np.array([[1, 0, 0, 1, 2, 0, 1, 0, 0],
                        [1, 0, 0, 0, 0, 0, 1, 1, 0]
                       ])
    clf = NaiveBayesianClassifier()
    clf.fit(X, Y)
    result = clf.predict(X_test)

if __name__ == '__main__':
    sanity_check()

Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

MatchZoo 文本匹配工具包

Naive Bayesian for Text Classification (MLE, Gaussian Naive Bayesian)

如何寫好一封paper Summary

Algorithm: k-nearest neighbors and decison boundary(Cross Validation)

基於集成學習模型的估價預測（量化投資）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結