文章目錄
2 K-近鄰算法 & 3 決策樹
4 基於概率論的分類方法:樸素貝葉斯
4.5 使用Python進行文本分類
4.5.1 準備數據:從文本中構建詞向量
我們把文本看成單詞向量或者詞條向量,也就是說將句子轉換爲向量。創建新文件bayesCopy.py
# 創建一個實驗樣本
def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
return postingList,classVec
# 創建一個包含在所有文檔中出現的不重複詞的列表
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)
# 檢測輸入的詞是否在詞彙表中,輸出文檔向量
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
# print(returnVec)
# print(inputSet)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else: print("the word: %s is not in my Vocabulary!"%word)
return returnVec
測試結果:
from mechineLearning.Ch04 import bayesCopy
listOPosts, listClasses = bayesCopy.loadDataSet()
myVocabList = bayesCopy.createVocabList(listOPosts)
myVocabList
['quit', 'flea', 'my', 'him', 'worthless', 'ate', 'garbage', 'buying', 'dog', 'licks', 'has', 'please', 'steak', 'cute', 'not', 'mr', 'maybe', 'stop', 'dalmation', 'is', 'food', 'park', 'so', 'posting', 'how', 'take', 'problems', 'to', 'stupid', 'love', 'help', 'I']
bayesCopy.setOfWords2Vec(myVocabList, listOPosts[0])
[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]
bayesCopy.setOfWords2Vec(myVocabList, listOPosts[3])
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
4.5.2 訓練算法:從詞向量計算頻率
樸素貝葉斯分類器訓練函數
# 樸素貝葉斯分類器訓練函數
def trainNB0(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = np.zeros(numWords); p1Num = np.zeros(numWords)
p0Denom = 0.0; p1Denom = 0.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
# print(trainMatrix[i])
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
# print()
print(p1Denom, p0Denom)
p1Vect = p1Num/p1Denom
p0Vect = p0Num/p0Denom
return p0Vect, p1Vect, pAbusive
計算p(w|c)
,上述函數是在確定類別的情況下,直接計算出現頻次。
測試結果:
4.5.3 測試算法: 根據現實情況修改分類器
避免分母或者其中某一分子爲零而導致整個概率爲零;爲解決下溢出問題;
在trainNB0()
,函數中修改相應代碼。
# 避免分母或者其中某一分子爲零而導致整個概率爲零
p0Num = np.ones(numWords); p1Num = np.ones(numWords)
p0Denom = 1.0; p1Denom = 2.0
p1Vect = np.log(p1Num/p1Denom)
p0Vect = np.log(p0Num/p0Denom)
樸素貝葉斯分類函數:
# 樸素貝葉斯分類函數
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)
p0 = sum(vec2Classify * p0Vec) + np.log(pClass1)
if p1 > p0:
return 1
else:
return 0
def testingNB():
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
testEntry = ['love', 'my', 'dalmation']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
testEntry = ['my', 'stop']
thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))
print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))
測試結果:
bayesCopy.testingNB()
21.0 25.0
['love', 'my', 'dalmation'] classified as: 0
['stupid', 'garbage'] classified as: 1
['my', 'stop'] classified as: 0
4.5.4 準備數據:文檔詞袋模型
我們將每個詞的出現是否作爲一個特徵,這可以被描述爲詞集模型
;
如果一個詞在文檔中不止出現一次,這可能意味着包含該詞是否出現在文檔中所不能表達的某種信息這種方法被稱爲詞袋模型
。
因此,我們需要之前的函數setOfWords2Vec()
,進行修改。
# 詞袋模型
def bigOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
4.6 示例:使用樸素貝葉斯過濾垃圾郵件
4.6.1 準備數據:切分文本
# 文件解析
def textParse(bigString):
import re
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
4.6.2 測試算法:使用樸素貝葉斯進行交叉驗證
注意:
- 源文件給的
Ch04/email/ham/24.txt
中有系統默認的編碼方式所不能識別的字符,運行程序會報如下錯誤:UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 198: illegal multibyte sequence
。此時,我們只需打開文本編譯器(如:pycharm)刪除不能識別的字符即可。 - 處理錯誤
TypeError: 'range' object doesn't support item deletion
。
原因分析:python3.x range返回的是range對象,不返回數組對象;
因此我們只需修改trainingSet = list(range(50))
就可以了。
# 完整的垃圾郵件測試函數
def spamTest():
docList = []; classList = []; fullTest = []
for i in range(1, 26):
wordList = textParse(open(r'mechineLearning/Ch04/email/ham/%d.txt'% i,).read())
docList.append(wordList)
fullTest.extend(wordList)
classList.append(1)
wordList = textParse(open(r'mechineLearning/Ch04/email/spam/%d.txt'% i).read())
docList.append(wordList)
fullTest.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)
trainingSet = list(range(50)); testSet = []
for i in range(10):
randIndex = int(np.random.uniform(0, len(trainingSet)))
testSet.append(trainingSet[randIndex])
del[trainingSet[randIndex]]
trainMat = []; trainClasses = []
# print(list(trainingSet))
for docIndex in trainingSet:
trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = setOfWords2Vec(vocabList, docList[docIndex])
if classifyNB(wordVector, p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print ("classification error",docList[docIndex])
print('the error rate is: ', float(errorCount)/len(testSet))
測試結果:
由於是隨機選取的十封郵件,所有測試結果會有所不同。
bayesCopy.spamTest()
classification error ['benoit', 'mandelbrot', '1924', '2010', 'benoit', 'mandelbrot', '1924', '2010', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', '2010', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
the error rate is: 0.1
bayesCopy.spamTest()
the error rate is: 0.0
4.7 使用樸素貝葉斯分類器從廣告中獲取區域傾向
5 Logistic迴歸
5.2 基於最優化方法的最佳迴歸係數確定
5.2.1 梯度上升法
梯度上升法主要思想:要找到函數的最大值,最好的方法是沿着該函數的梯度方向探尋。
5.2.2 訓練算法:使用梯度上升找到最佳參數
Logistic
迴歸梯度上升優化算法。
import numpy as np
# 讀取文本中的數據
def loadDataSet():
dataMat = []; labelMat = []
fr = open('mechineLearning/Ch05/testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
labelMat.append(int(lineArr[2]))
return dataMat, labelMat
def sigmoid(inX):
return 1.0/(1+np.exp(-inX))
# 確定最佳參數
def gradAscent(dataMatIn, classLabels):
dataMatrix = np.mat(dataMatIn)
labelsMat = np.mat(classLabels).transpose()
m, n = np.shape(dataMatrix)
# 我們將這裏的移動量成爲步長,記爲 alpha
alpha = 0.001
maxCycles = 500
# 參數變量
weights = np.ones((n,1))
for k in range(maxCycles):
h = sigmoid(dataMatrix*weights)
error = (labelsMat -h)
# 這裏是通過一個人爲確定的一個函數(平方損失函數),來確定參數
weights = weights + alpha * dataMatrix.transpose() * error
return weights
weights = weights + alpha * dataMatrix.transpose() * error
推導過程:
圖片來源
在最優化方法中,這些梯度下降(上升)函數往往可以有我們自己來指定。只要是最優就可以。
測試結果,這裏是在pycharm
交互式命令行測試。
from mechineLearning.Ch05 import logResgresCopy
dataArr, labelMat = logResgresCopy.loadDataSet()
logResgresCopy.gradAscent(dataArr, labelMat)
matrix([[ 4.12414349],
[ 0.48007329],
[-0.6168482 ]])
5.2.3 分析數據: 畫出決策邊界
畫出數據集和Logistic迴歸最佳擬合直線的函數
# 畫出決策分界線
def plotBestFit(weights):
import matplotlib.pyplot as plt
dataMat,labelMat = loadDataSet()
dataArr = np.array(dataMat)
n = np.shape(dataArr)[0]
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
for i in range(n):
if int(labelMat[i]) == 1:
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig = plt.figure(num=1)
ax = fig.add_subplot(111)
# ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
# ax.scatter(xcord2,ycord2,s=30,c='green')
plt.scatter(xcord1, ycord1, s=30, c='red', marker='s')
plt.scatter(xcord2, ycord2, s=30, c='green')
x = np.arange(-3.0, 3.0, 0.1)
weights = np.array(weights)
y = (-weights[0]-weights[1]*x)/weights[2]
ax.plot(x,y)
plt.xlabel('X1'); plt.ylabel('X2')
plt.show()
測試結果:
weights = logResgresCopy.gradAscent(dataArr, labelMat)
logResgresCopy.plotBestFit(weights.getA())
**注意:**這裏使用getA()
函數將矩陣轉爲數組,否則會出現x
,y
維數不同,無法繪製圖像
5.2.4 訓練算法:隨機梯度上升
梯度上升算法在每次更新迴歸係數時都需要遍歷整個數據集,該方法如果處理的樣本和特徵太多,那麼該方法的計算複雜度就太高了。
爲此,可以將該方法進行改進,改進方法是每次僅使用一個樣本點來更新迴歸函數,該方法稱爲隨機梯度上升算法。
# 隨機梯度上升
def stocGradAscent0(dataMatrix, classLabels):
m, n = np.shape(dataMatrix)
alpha = 0.01
weights = np.ones(n)
for i in range(m):
h = sigmoid(sum(dataMatrix[i]*weights))
error = classLabels[i] - h
weights = weights + alpha*error*dataMatrix[i]
return weights
改進的隨機梯度上升算法。
# 改進的隨機梯度上升算法
def stocGradAscent1(dataMatrix, classLabels, numIter=150):
m,n = np.shape(dataMatrix)
weights = np.ones(n) #initialize to all ones
for j in range(numIter):
dataIndex = list(range(m))
for i in range(m):
alpha = 4/(1.0+j+i)+0.0001 #apha decreases with iteration, does not
randIndex = int(np.random.uniform(0,len(dataIndex)))#go to 0 because of the constant
h = sigmoid(sum(dataMatrix[randIndex]*weights))
error = classLabels[randIndex] - h
weights = weights + alpha * error * dataMatrix[randIndex]
del(dataIndex[randIndex])
return weights
5.3 示例:從疝氣病症預測病馬的死亡率
5.3.2 測試算法:用Logistic迴歸進行分類
Logistic
迴歸分類函數
# 用Logistic迴歸進行分類
def classifyVector(inX, weights):
prob = sigmoid(sum(inX * weights))
if prob > 0.5:
return 1.0
else:
return 0.0
def colicTest():
frTrain = open('mechineLearning/Ch05/horseColicTraining.txt')
frTest = open('mechineLearning/Ch05/horseColicTest.txt')
trainingSet = []
trainingLabels = []
for line in frTrain.readlines():
currLine = line.strip().split('\t')
lineArr = []
for i in range(21):
lineArr.append(float(currLine[i]))
trainingSet.append(lineArr)
trainingLabels.append(float(currLine[21]))
trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels, 1000)
errorCount = 0;
numTestVec = 0.0
for line in frTest.readlines():
numTestVec += 1.0
currLine = line.strip().split('\t')
lineArr = []
for i in range(21):
lineArr.append(float(currLine[i]))
if int(classifyVector(np.array(lineArr), trainWeights)) != int(currLine[21]):
errorCount += 1
errorRate = (float(errorCount) / numTestVec)
print("the error rate of this test is: %f" % errorRate)
return errorRate
def multiTest():
numTests = 10
errorSum = 0.0
for k in range(numTests):
errorSum += colicTest()
print("after %d iterations the average error rate is: %f"
% (numTests, errorSum / float(numTests)))
測試結果:
logResgresCopy.multiTest()
E:\Users\Administrator\PycharmProjects\TestCases\mechineLearning\Ch05\logResgresCopy.py:17: RuntimeWarning: overflow encountered in exp
return 1.0/(1+np.exp(-inX))
the error rate of this test is: 0.328358
the error rate of this test is: 0.373134
the error rate of this test is: 0.328358
the error rate of this test is: 0.432836
the error rate of this test is: 0.298507
the error rate of this test is: 0.283582
the error rate of this test is: 0.328358
the error rate of this test is: 0.373134
the error rate of this test is: 0.402985
the error rate of this test is: 0.283582
after 10 iterations the average error rate is: 0.343284