Spark MLib的使用

原創

汪喵行

2019-10-26 11:01

Spark MLib

Intro

MapReduce 不適合做機器學習-> 反覆讀寫磁盤的開銷/不適合機器學習需要的大量迭代計算。

MLib中只包含能夠在集羣上運行良好的並行算法，有些算法不能並行執行，所以無法包含在MLib中。

package：spark.mlib基於RDD；spark.ml基於dataframe。

機器學習流水線

dataframe/transform/estimator

1.定義pipeline中各個流水線階段pipelinestage（包括轉換器和評估器）；

（轉換器和評估器有序組織起來構建成pipeline）

pipeline = Pipeline(stages = [stage1,stage2,stage3])

流水線本身也可以看成是一個估計器。在fit方法運行之後，產生一個pipeline model,是一個transformer。這個管道模型將在測試數據的時候使用。

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('WordCount').getOrCreate()

舉例：

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF,Tokenizer

# prepare
training = spark.createDataFrame([ (0,'a b c d spark',1.0), (1,'c d spark',1.0),(2,'c d hadoop',0.0)],['id','text','label'])

2. 構建轉換器

token = Tokenizer(inputCol ='text',outputCol = 'words')
hashingTF = HashingTF(inputCol =token.getoutputCol(),outputCol = 'features')
lr = LogisticRegression(maxIter = 10,regParam = 0.001)

3. 有序組織Pipeline Stages，創建pipeline

pipeline = Pipeline(stages = [token, hashingTF,lr])
# pipeline 仍然是一個transformer
model = pipeline.fit(training)
# fit之後會產生一個model可以之後使用

4. 構建測試數據

# 構建test
test = spark.createDataFrame([ (4,'a b c spark'), (5,'c spark'),(6,'c hadoop')],['id','text'])

5. 測試

prediction = model.transform(test)
selected = prediction.select('id','text','probablity','prediction')

TF-IDF 特徵抽取

每個句子代表一個文檔

from pyspark.ml.feature import HashingTF,Tokenizer
# 構建df
sentence = spark.createDataFrame([(0,"I heard about Spark"),(0,"Java is GOOD"),("1,"Logistic regression models are neat")]).toDF('label','sentence')

1.分詞

token = Tokenizer(inputCol = 'sentence',outputCol = 'words')
wordsData = token.transform(sentence)

2.用hashingTF的transform方法哈希成特徵向量

hashingTF = HashingTF(inputCol ='words',outputCol = 'rawFeatures',numFeatures = 2000)
featureData = hashingTF.transform(wordsData)

3.用IDF進行權重調整

idf  = IDF(inputCol = 'rawFeatures',outputCol = 'features')
idfModel = idf.fit(featureData)

4.進行訓練

newdata = idfMpdel.transform(featureData)
newdata.select('features','label').show(truncate=False)

特徵轉換

StringIndexer:

優先編碼數量多的類型，爲0

引入需要使用的類

from pyspark.ml.feature import StringIndexer
# 轉換
indexer = StringIndexer(inputCol = 'category',outputCol = 'categoryIndex')
# fit進行訓練
model = indexer.fit(df)
# 對數據值進行處理
indexed = model.transform(df)
indexed.show()

IndexToString

一般都是和之前的StringIndexer配合使用，在訓練完之後把index之後的轉回去。

VectorIndexer

1.構建轉換器，設置input和output，進行模型訓練

indexer = VectorIndexer(inputCol = 'features',outputCol = 'indexed',maxCategories = 2)
# 某一列的數值不同值大於2就不認爲是類別
indexModel = indexer.fit(df)

2.通過categoryMaps成員獲得被轉換的特徵和映射

categoricalFeatures = indexerModel.categoryMaps.keys()

3.把模型運用到原始數據上

indexed = indexerModel.transform(df)
indexed.show()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark MLib的使用

Spark MLib

Intro

機器學習流水線

TF-IDF 特徵抽取

特徵轉換

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Spark學習筆記（RDD編程基礎）

Spark學習筆記（基本概念與環境部署）

Python Practice

NLP 基礎

leetcode刷題筆記（LinkedList相關）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結