在spark上做簡單的文本分類(python)

數據集選的是20_newsgroups,我按7:3分的訓練集和測試集。

總的流程如下:

文本分類基本步驟
這裏把數據集中的每一條文本都表示成TFTDF向量,用訓練集的TFTDF向量來訓練模型,用測試集的TFTDF向量進行分類測試,最後統計測試準確率。


初始化

# 設置訓練集,測試集路徑。
trainPath = "hdfs:///user/yy/20_newsgroups/train/*"
testPath = "hdfs:///user/yy/20_newsgroups/test/*"

# 分類時,新聞主題需要轉換成數字,labelsDict將主題轉換成數字
labelsDict = {'alt.atheism':0, 'comp.graphics':1, 'comp.os.ms-windows.misc':2,\
              'comp.sys.ibm.pc.hardware':3, 'comp.sys.mac.hardware':4, 'comp.windows.x':5,\
              'misc.forsale':6, 'rec.autos':7, 'rec.motorcycles':8, 'rec.sport.baseball':9,\
              'rec.sport.hockey':10, 'sci.crypt':11, 'sci.electronics':12, 'sci.med':13,\
              'sci.space':14, 'soc.religion.christian':15, 'talk.politics.guns':16,\
              'talk.politics.mideast':17, 'talk.politics.misc':18, 'talk.religion.misc':19}

# keyTolabels則將數字再轉換回主題,主要是方便自己看的
keyTolabels = {0:'alt.atheism', 1:'comp.graphics', 2:'comp.os.ms-windows.misc',\
              3:'comp.sys.ibm.pc.hardware', 4:'comp.sys.mac.hardware', 5:'comp.windows.x',\
              6:'misc.forsale', 7:'rec.autos', 8:'rec.motorcycles', 9:'rec.sport.baseball',\
              10:'rec.sport.hockey', 11:'sci.crypt', 12:'sci.electronics', 13:'sci.med',\
              14:'sci.space', 15:'soc.religion.christian', 16:'talk.politics.guns',\
              17:'talk.politics.mideast', 18:'talk.politics.misc', 19:'talk.religion.misc'}

預處理函數

完成對文檔的分詞,去停用詞,詞幹提取,同義詞替換的工作,需要安裝一個自然語言處理的第三方庫nltk。當然,每個節點都需要安裝。預處理的基本步驟如下:
預處理步驟
這裏的同義詞替換做的非常簡單,只是從單詞的第一個同義詞集裏取出第一個同義詞。這麼做有時會產生歧義,因爲單詞在不同的語義下有不同的同義詞集,只取第一個同義詞集即限定了僅僅使用單詞的第一個語義。

def tokenlize(doc):
    import nltk, re
    from nltk.corpus import stopwords
    from nltk.corpus import wordnet

    r = re.compile(r'[\w]+') # 以非字母數字字符來進行分詞
    my_stopwords = nltk.corpus.stopwords.words('english')
    porter = nltk.PorterStemmer()

    newdoc = []
    for word in nltk.regexp_tokenize(doc, r): # 分詞
        newWord = porter.stem(word.lower()) # 詞幹提取
        if newWord in my_stopwords: # 去停用詞
            continue
        tokenSynsets = wordnet.synsets(newWord)
        newdoc.append(newWord if tokenSynsets == [] else tokenSynsets[0].lemma_names()[0]) # 同義詞替換
    return newdoc

導入訓練集

trainTokens = sc.wholeTextFiles(trainPath)\
                .map(lambda (fileName, doc): doc)\
                .map(lambda doc: tokenlize(doc))

構建單詞映射哈希表,tfidf模型

訓練集和測試集都需要使用這個哈希表,它的大小根據不同單詞的數量來設置,一般取2的n方,在前期數據探索的時候需要計算一下不同單詞的數量。

from pyspark.mllib.feature import HashingTF
hasingTF = HashingTF(2 ** 16)

# 將訓練集每個文檔都映射爲tf向量
trainTf = hasingTF.transform(trainTokens)
trainTf.cache()

# 構建IDF模型,訓練集和測試集都用它
from pyspark.mllib.feature import IDF
idf = IDF().fit(trainTf)

# 將訓練集每個tf向量轉換爲tfidf向量
trainTfidf = idf.transform(trainTf)
trainTfidf.cache()

標註訓練集

# 爲訓練集標註,成爲最終可用的訓練集,每個樣本都需要放在LabeledPoint裏
from pyspark.mllib.regression import LabeledPoint
trainLabels = sc.wholeTextFiles(trainPath)\
                .map(lambda (path, doc): path.split('/')[-2])
train = trainLabels.zip(trainTfidf)\
                   .map(lambda (topic, vector): LabeledPoint(labelsDict[topic], vector))
train.cache()

導入測試集

# 導入測試集並完成預處理
testTokens = sc.wholeTextFiles(testPath)\
               .map(lambda (fileName, doc): doc)\
               .map(lambda doc: tokenlize(doc))

將測試集轉換成tfidf向量

# 將測試集每個文檔都映射爲tf向量,和訓練集用的是同一個哈希映射hasingTF
from pyspark.mllib.feature import HashingTF
testTf = hasingTF.transform(testTokens)

# 將測試集每個tf向量轉換爲tfidf向量,和訓練集用的是同一個IDF模型idf
from pyspark.mllib.feature import IDF
testTfidf = idf.transform(testTf)

標註測試集

# 爲測試集標註,成爲最終可用與測試的測試集
from pyspark.mllib.regression import LabeledPoint
testLabels = sc.wholeTextFiles(testPath)\
               .map(lambda (path, doc): path.split('/')[-2])

test = testLabels.zip(testTfidf)\
                 .map(lambda (topic, vector): LabeledPoint(labelsDict[topic], vector))
testCount = test.count()

訓練樸素貝葉斯模型並計算準確率

from pyspark.mllib.classification import NaiveBayes
model = NaiveBayes.train(train, 0.1)

# 計算測試的準確率
predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / testCount
print accuracy
0.803298634582

訓練多元邏輯迴歸模型並計算準確率

from  pyspark.mllib.classification import LogisticRegressionWithLBFGS
lrModel = LogisticRegressionWithLBFGS.train(train, iterations=10, numClasses=20)

# 計算測試的準確率
predictionAndLabel = test.map(lambda p: (lrModel.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / testCount
print accuracy
0.812897120454

如果有興趣,可以隨便拿一份新聞組文本來測試一下,給自己一個更爲直觀的感受。

aTestText = """
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!bogus.sura.net!howland.reston.ans.net!ira.uka.de!math.fu-berlin.de!cs.tu-berlin.de!ossip
From: [email protected] (Ossip Kaehr)
Newsgroups: comp.sys.mac.hardware
Subject: SE/30 8bit card does not work with 20mb..
Date: 21 Apr 1993 23:22:22 GMT
Organization: Technical University of Berlin, Germany
Lines: 27
Message-ID: <[email protected]>
NNTP-Posting-Host: trillian.cs.tu-berlin.de
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Summary: HELP!
Keywords: SE/30 MODE32 System7 PDS

Hello!

I have a SE/30 and a Generation Systems 8bit PDS card for a 17"
screen.
It worked great until I upgraded from 5 to 20 mb ram.
Now with Sys7.1 and MODE32 or 32enabler it does not boot..

a tech support person said the card does not support these 32bit
fixes.

BUT: when pressing the shift key while booting (when the ext. monitor
goes black after having been grey) the system  SOMETIMES boots properly!!
and then works ok with the 20mb and full graphics.

WHAT's HAPPENING???

Thanks a lot for any advice!!!
please answer by mail.

Ossip Kaehr
[email protected]
voice: +49.30.6226317
-- 
 __   --------------------------------------------------------------   __
/_/\  Ossip Kaehr   Hermannstrasse 32  D-1000 Berlin 44  Germany  /\_\
\_\/  Tel. +49.30.6223910 or 6218814     EMail [email protected]  \/_/
      --------------------------------------------------------------

"""
testTf = hasingTF.transform(tokenlize(aTestText)) # 預處理後轉換爲tf向量
testTfidf = idf.transform(testTf) # 再轉換成tfidf向量
print keyTolabels[lrModel.predict(testTfidf)] # 預測並輸出結果
'comp.sys.mac.hardware'

總結spark上如何將文檔轉換成tfidf向量

# 構建哈希表用於映射所有單詞
from pyspark.mllib.feature import HashingTF
hasingTF = HashingTF(2 ** 16) # 維數需要大於不同單詞的總數

# 將文檔映射爲tf向量,這裏的trainTokens爲rdd類型
trainTf = hasingTF.transform(trainTokens)
testTf = hasingTF.transform(testTokens)

# 構建IDF模型,訓練集和測試集都用它
from pyspark.mllib.feature import IDF
idf = IDF().fit(trainTf)

# 將tf向量轉換爲tfidf向量
trainTfidf = idf.transform(trainTf)
testTfidf = idf.transform(testTf)

相關閱讀

https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://en.wikipedia.org/wiki/Natural_Language_Toolkit

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章