The Naive Bayesian is a baseline for text classification problem.
A spam email example. We need to count the frequency of words which occurs in the span/normal email.
Such as, ad., purchase, link ,etc. We could considered this email as spam.
But sometimes, the words mentioned above will exist in the normal email, so the problem is complicated.
there are two steps for naive bayesian:
1) Training
count each words in vocabulary ,and calculate the contributions between each word and the probability of spam/normal email.
p(advertisement/ span) p(advertisement/ normal)
2) predict
Training :
p(購買 | 正常) = 3 / (24 * 10) = 1/80
p(購買 | 垃圾) = 7 / (12 * 10) = 7/120
p(物品 | 正常) = 4 / 240 = 1 / 60
p(物品 | 垃圾) = 4 / 120 = 1 / 30
p(不是 | 正常) = 4 / 240 = 1 / 60
p(不是 | 垃圾) = 3 / 120 = 1 / 40
p(廣告 | 正常) = 5 / 240 = 1 / 48
p(廣告 | 正常) = 4 / 120 = 1 / 56
p(這 | 正常) = 3 / 240 = 1 / 80
p(這 | 垃圾) = 0 / 120 = 0
Priori Probability(先驗概率)
正常郵件在所有郵件中的概率 24 / 36 = 2 / 3
垃圾郵件在所有郵件中的概率 12 / 36 = 1 / 3
We need to calculate the condition probability of span/ normal base on the context of the email. P(spam / context) and P(normal / context)
Bayesian Theorem
P(X | Y): likelihood
P(Y): prior
P(X) = normalization
P(Y | X) = posterior
Prediction:
Conditional independence P(x, y | z) = P(x | z) * P(y | z)
But the result is abnormal due to P(這|垃圾) = 0.
We need to do some smooth process.
Add-one smoothing:
A problem:
爲了避免underflow 可以加上log
log(p1 * p2 * p3) = logp1 + log p2 + log p3
Naive Bayesian Sample in python:
import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# read span.csv
df = pd.read_cv("spam.csv", encoding = 'latin')
df.head()
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
Rename some columns:
# rename the column of v1 and v2
df.rename(columns = {'v1' : 'Label', 'v2' : 'Text'}, inplace = True)
df.head()
Label Text Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
Map Label to number
# map'ham' and 'span' to 0 and 1
df['numLabel'] = df['Label'].map({'ham' : 0, 'spam' : 1})
df.head()
Label Text Unnamed: 2 Unnamed: 3 Unnamed: 4 numLabel
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN 0
1 ham Ok lar... Joking wif u oni... NaN NaN NaN 0
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN 1
3 ham U dun say so early hor... U c already then say... NaN NaN NaN 0
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN 0
Count the number of spam/ham emails
# count number of ham and spam
print ('# of ham : ', len(df[df.numLabel == 0]), ' # of spam: ', len(df[df.numLabel == 1]))
print ('# of total samples: ', len(df))
# of ham : 4825 # of spam: 747
# of total samples: 5572
Plot the histogram for text length:
# count the length of text, and plot a histogram
text_lengths = [len(df.loc[i, 'Text']) for i in range(len(df))]
plt.hist(terxt_lengths, 100, facecolor = 'blue', alpha = 0.5)
plt.xlim([0, 200])
plt.show()
# import English vocabulary
from sklearn.feature_extraction.text import CountVectorizer
# construct word vector (base on the frequency of the word)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df.Text)
y = df.numLabel
# split the data into train and test data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 100)
print('# of samples in the train data set: ', X_train.shape[0], '# of samples in test data set: ', X_test.shape[0])
Output:
# of samples in the train data set: 4457 # of samples in test data set: 1115
# use the Naive Bayesian for model training
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
clf = MultinomialNB(alpha = 1.0, fit_prior = True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("accuracy on test data: ", accuracy_score(y_test, y_pred))
accuracy on test data: 0.97847533632287
# print confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred, labels = [0, 1])
array([[956, 14],
[ 10, 135]])
Summary:
Maximum Likelihood estimation for parameter of Naive Bayesian:
Non-constraint Optimization Problem
Constrained Optimization
Maximum Likelihood estimation for Naive Bayesian
We add parameter θ and π for our object function
π is the vector for the prior probability of each classification with K x 1 dimension
θ is the matrix which stores the probability as row as the word, column for each classification that is θij = p(wi | yj) i = 1,...,V,
V is the size of the vocabulary, j = 1,..,K, K is the size of the classification.
Construction for Lagrangian Multipler and solve for π
solve for θ
Gaussian Naive Bayesian for continus random variable
We can use the Gaussian distribution to present this random variable.
The Gaussian distribution has the properties that: the sum or product of two gaussian distributions is also gaussian distributions
the condition probability of two gaussian distribution is also gaussian distributions.
Central limit theorem
In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
例:
Progress:
1)for each classification c, we choose all of the samples xi related to c, then we fit a gaussian distribution.
We fit independ gaussian distribution for each of the classification.
2) Then we can predict any xi as P(xi | y = c)
Examples:
there are two feature age and income which are continus random variable.
we choose gaussian distribution to fit these distribution
In real world, if we have a lot of continus features, we will not choose the naive bayesian model. we choose logistic regression, XGBoost etc.
But the naive bayesian is a base line for text classification.
A python Implementation for Naive Bayesian
reference:https://github.com/sesiria/ML/blob/master/Lib/NaiveBayesian.py
# author sesiria 2019
# a simple Naive Bayesian classifier implementation
import numpy as np
# **********************definition of the Naive Bayesian***************************
class NaiveBayesianClassifier:
def __init__(self):
pass
# currently we only support the digital number for labels.
def fit(self, data, label):
classes = np.unique(label)
nWords = data.shape[1]
# matrix to store the probability for each word in each category.
self.paramMatrix = np.zeros([nWords, len(classes)], dtype = np.float64)
self.priorVector = np.zeros(len(classes), dtype = np.float64)
self.labels = [] # class label
for i in range(len(classes)):
c = classes[i]
nCurrentSize = len(label[label == c])
# build category hashtable
self.labels.append(c)
# we calculate the priorVector
self.priorVector[i] = nCurrentSize / len(label)
# calculate the paramMatrix with smoothing
count = np.sum(data[label == c, :], axis = 0) + 1
count = count / (nCurrentSize + nWords)
self.paramMatrix[:, i] = count
def predict(self, test):
if (len(test.shape) == 1):
return self.getCategory(test)
predictions = np.zeros(test.shape[0])
for i in range(test.shape[0]):
predictions[i] = self.getCategory(test[i, :])
return predictions
def getCategory(self, test):
assert test.shape[0] == self.paramMatrix.shape[0]
p = np.zeros(len(self.labels))
for idx in range(len(self.labels)):
# we use the log trick to avoid the underflow
p[idx] = np.sum(np.log(self.paramMatrix[:, idx]) * test)
return self.labels[np.argmax(p)]
# **************************unit test function.************************************
def sanity_check():
X = np.array([[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 1, 1],
[0, 0, 1, 1, 0, 0, 0, 0, 0]
])
Y = np.array([1, 1, 1, 0, 0, 0])
X_test = np.array([[1, 0, 0, 1, 2, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 0]
])
clf = NaiveBayesianClassifier()
clf.fit(X, Y)
result = clf.predict(X_test)
if __name__ == '__main__':
sanity_check()