ML入門3.0--手寫樸素貝葉斯(Na ̈ıve Bayes)
ML入門3.0 手寫樸素貝葉斯(Na ̈ıve Bayes)
樸素貝葉斯簡介
貝葉斯分類是一種基於概率框架下的分類算法的總稱,核心採用貝葉斯公式解決分類問題,而樸素貝葉斯分類是貝葉斯分類中最簡單的一種分類方法。
樸素貝葉斯自20世紀50年代已廣泛研究。在20世紀60年代初就以另外一個名稱引入到文本信息檢索界中,並仍然是文本分類的一種熱門(基準)方法,文本分類是以詞頻爲特徵判斷文件所屬類別或其他(如垃圾郵件、合法性、體育或政治等等)的問題。通過適當的預處理,它可以與這個領域更先進的方法(包括支持向量機)相競爭。它在自動醫療診斷中也有應用。百度百科.
樸素貝葉斯原理
算法思想
樸素貝葉斯的思想基礎是:首先假設數據集的特徵向量的各個分量之間相互獨立(樸素的來源),在假設的基礎上對於待分類的數據項,計算在此數據項的特徵向量的條件下各個類別的概率,選擇概率更大的那個作爲分類結果。
貝葉斯公式
假設特徵之間相互獨立
拉普拉斯平滑
在上述樸素貝葉斯的就算過程中會出現一種情況可能等於0,這將會導致整個數據項的概率爲0,爲了避免這種情況引入一種處理方法———拉普拉斯平滑(Laplacian smoothing)
其思想十分簡單,就是在分子+1,在分母加上分類的類別
其中a(x)是對象屬性x在屬性a的取值,d(x)是x的類別
Laplacian smoothing:
其中是決策類別的數量,而且:
其中m是訓練集的大小
應用舉例:
這裏給出一個十分詳細清晰的例子:
女生是否嫁人決策(帶你理解樸素貝葉斯分類算法 - 憶臻的文章 - 知乎)
手寫樸素貝葉斯分類:
數據集
這裏的採用的數據集是:mushroom.csv 詳見個人github
函數
Func1: readNominalData(paraFilename) 加載數據集
def readNominalData(paraFilename):
'''
read data from paraFilename/
:param paraFilename:dataFile
:return: resultNames(), resultData
'''
resultData = []
tempFile = open(paraFilename)
tempLine = tempFile.readline().replace('\n', '')
tempNames = np.array(tempLine.split(','))
resultNames = [tempValue for tempValue in tempNames]
tempLine = tempFile.readline().replace('\n', '')
while tempLine != '':
tempValues = np.array(tempLine.split(','))
tempArray = [tempValue for tempValue in tempValues]
resultData.append(tempArray)
tempLine = tempFile.readline().replace('\n', '')
tempFile.close()
return resultNames, resultData
Func2: obtainFeaturesValues(paraDataset) 生成特徵值矩陣
def obtainFeaturesValues(paraDataset):
'''
將整個數據集的特徵值生成一個矩陣
:param paraDataset:當前數據集
:return:生成的矩陣
'''
resultMatrix = []
for i in range(len(paraDataset[0])):
featureValues = [example[i] for example in paraDataset] # obtain all values of every feature
uniqueValues = set(featureValues)
currentValues = [tempValue for tempValue in uniqueValues]
resultMatrix.append(currentValues)
return resultMatrix
Func3: calculateClassCounts(paraData, paraValuesMatrix) 統計不同類別的數量
def calculateClassCounts(paraData, paraValuesMatrix):
'''
統計不同類別的數量
:param paraData:dataSet
:param paraValuesMatrix:特徵值矩陣
:return: 統計結果
'''
classCount = {}
tempNumInstances = len(paraData)
tempNumClasses = len(paraValuesMatrix[-1])
for i in range(tempNumInstances):
tempClass = paraData[i][-1]
if tempClass not in classCount.keys():
classCount[tempClass] = 0
classCount[tempClass] += 1
resultCounts = np.array(classCount)
return resultCounts
Func4: calculateClassDistributionLaplacian(paraData, paraValuesMatrix) class的概率計算,並進行拉普拉斯變換
def calculateClassDistributionLaplacian(paraData, paraValuesMatrix):
'''
class的概率計算,並進行拉普拉斯變換
:param paraData: dataSet
:param paraValuesMatrix: 特徵值矩陣
:return: 不同類別的概率
'''
classCount = {}
tempNumInstances = len(paraData)
tempNumClasses = len(paraValuesMatrix[-1])
for i in range(tempNumInstances):
tempClass = paraData[i][-1]
if tempClass not in classCount.keys():
classCount[tempClass] = 0
classCount[tempClass] += 1
resultClassDistribution = []
for tempValue in paraValuesMatrix[-1]:
resultClassDistribution.append((classCount[tempValue] + 1.0) / (tempNumInstances + tempNumClasses))
print("tempNumClasses", tempNumClasses)
return resultClassDistribution
Func5: calculateConditionalDistributionLaplacian(paraData, paraValuesMatrix, paraMappings) 計算拉普拉斯變換後的條件概率
def calculateConditionalDistributionLaplacian(paraData, paraValuesMatrix, paraMappings):
'''
計算拉普拉斯變換後的條件概率
:param paraData: dataSet
:param paraValuesMatrix:屬性取值矩陣
:param paraMappings: 映射後的數值矩陣
:return: 所有屬性取值的條件概率
'''
tempNumInstances = len(paraData)
tempNumConditions = len(paraData[0]) - 1
tempNumClasses = len(paraValuesMatrix[-1])
#Step1 Allocate Space
tempCountCubic = []
resultDistributionsLaplacianCubic = []
for i in range(tempNumClasses):
tempMatrix = []
tempMatrix2 = []
#Over all conditions
for j in range(tempNumConditions):
#Over all values
tempNumValues = len(paraValuesMatrix[j])
tempArray = [0.0] * tempNumValues
tempArray2 = [0.0] * tempNumValues
tempMatrix.append(tempArray)
tempMatrix2.append(tempArray2)
tempCountCubic.append(tempMatrix)
resultDistributionsLaplacianCubic.append(tempMatrix2)
#Step 2. Scan the dataSet
for i in range(tempNumInstances):
tempClass = paraData[i][-1]
tempIntClass = paraMappings[tempNumConditions][tempClass] #get class index
# 統計不同類別條件下,每種特徵不同取值分別有多少個(eg:p的條件下,特徵爲a的有x個,b有x1個···)
for j in range(tempNumConditions):
tempValue = paraData[i][j]
tempIntValue = paraMappings[j][tempValue] #get a feature's value's correaspondence index
tempCountCubic[tempIntClass][j][tempIntValue] += 1
#Calculate the real probability with LapLacian
tempClassCounts = [0] * tempNumClasses
for i in range(tempNumInstances):
tempValue = paraData[i][-1]
tempIntValue = paraMappings[tempNumConditions][tempValue]
tempClassCounts[tempIntValue] += 1
for i in range(tempNumClasses):
for j in range(tempNumConditions):
for k in range(len(tempCountCubic[i][j])):
resultDistributionsLaplacianCubic[i][j][k] = (tempCountCubic[i][j][k] + 1) / (tempClassCounts[i] + tempNumClasses)
return resultDistributionsLaplacianCubic
Func6: nbClassify(paraTestData, paraValueMatrix, paraClassValues, paraMappings, paraClassDistribution, paraDistributionCubic) 分類
def nbClassify(paraTestData, paraValueMatrix, paraClassValues, paraMappings, paraClassDistribution, paraDistributionCubic):
'''
分類並返回正確率
:param paraTestData:
:param paraValueMatrix:
:param paraClassValues:
:param paraMappings:
:param paraClassDistribution:
:param paraDistributionCubic:
:return: 正確率
'''
tempCorrect = 0.0
tempNumInstances = len(paraTestData)
tempNumConditions = len(paraTestData[0]) - 1
tempNumClasses = len(paraValueMatrix[-1])
tempTotal = len(paraTestData)
tempBiggest = -1000
tempBest = -1
for featureVector in paraTestData:
tempActualLabel = paraMappings[tempNumConditions][featureVector[-1]]
tempBiggest = -1000
tempBest = -1
for i in range(tempNumClasses):
tempPro = np.log(paraClassDistribution[i])
for j in range(tempNumConditions):
tempValue = featureVector[j]
tempIntValue = paraMappings[j][tempValue]
tempPro += np.log(paraDistributionCubic[i][j][tempIntValue])
if tempBiggest < tempPro:
tempBiggest = tempPro
tempBest = i
if tempBest == tempActualLabel:
tempCorrect += 1
return tempCorrect/tempNumInstances
Func7: STNBTest(paraFileName) 測試
def STNBTest(paraFileName):
featureNames, dataSet = readNominalData(paraFileName)
print("Feature Names = ", featureNames)
valuesMatrix = obtainFeaturesValues(dataSet)
tempMappings = calculateMappings(valuesMatrix)
classSumValues = calculateClassCounts(dataSet, valuesMatrix)
classDistribution = calculateClassDistributionLaplacian(dataSet, valuesMatrix)
print("classDistribution = ", classDistribution)
conditionalDistributions = calculateConditionalDistributionLaplacian(dataSet, valuesMatrix, tempMappings)
tempAccuracy = nbClassify(dataSet, valuesMatrix, classSumValues, tempMappings, classDistribution, conditionalDistributions)
print("The accuracy of NB classifier is {}".format(tempAccuracy))
運行結果
完整代碼+數據集
算法優缺點
優點:
(1) 算法邏輯簡單,易於實現
(2)分類過程中時空開銷小
缺點:
樸素貝葉斯模型假設屬性之間相互獨立,這個假設在實際應用中往往是不成立的,在屬性個數比較多或者屬性之間相關性較大時,分類效果不好。