ML Pipelines

前言：在這一節中，我們介紹一個叫做ML Pipelines管道的概念。ML Pipelines提供了一套建立在 DataFrames 之上的高級APIs來幫助用戶創造和協調機器學習中實際實用的管道技術。
本文佈局

Pipelines的主要概念

MLlib是標準化的機器學習算法APIs來讓機器學習算法變得更簡單融合複雜的算法在一個管道里，或者工作流。管道是一種靈感來自於 scikit-learn project,這節將會介紹它的Pipelines API。

DataFrame:這是ML API 使用來自於Spark SQL的DataFrame作爲一個ML 自己的數據集。DataFrame是一種能夠支持多種數據類型的數據集。比如DataFrame能夠含有不同的列存儲文本，向量，真實的標籤，和預測標籤。

Transformer：Transformer是一種能夠將一個DataFrame轉化另一個DataFrame的算法。比如ML 模型是一個Transformer ，這樣的模型就能夠將一個有特徵的DataFrame轉化爲一個預測的DataFrame。

Estimator：Pipelines是一個能夠採取合適的方法將DataFrame產出一個Transformer。

Pipeline：Pipeline是一種被多種Transformers 和Estimators共同描述的唯一的 ML workflow；

Parameter：所有的Transformers and Estimators都共用一個指定的參數API。

DataFrame

機器學習能夠被應用到各種不同的數據類型，比如向量，文本，圖片，和結構化數據。這個API爲了支持多樣數據類型採用來自Spark SQL的DataFrame。
DataFrame支持許多基礎的和結構化的類型，並且能夠被RDD明確和隱性的創造。這段代碼事例，DataFrame的列被text, “features,” 和 “label”命名了

Pipeline的組成

Transformers
Transformers是一個包含了特徵transformers和已經學習好的模型的抽象概念。從技術上來說transformers實現了transform()方法， transform()能夠將 DataFrame轉化成另一個 DataFrame。通常是通過增加一個或者更多的列來實現。

Estimators
Estimators是一個學習算法或者任何在數據上訓練和普適算法的抽象概念。從技術上來說，Estimators實現了 fit()方法，fit()是一種接受 DataFrame並轉化爲模型的方法,是一個Transformer。舉個例子，
一個學習算法比如邏輯迴歸算法就是一個Estimators，並使fit()方法訓練出一個LogisticRegressionModel。

pipeline components的屬性
Transformer.transform()s 和Estimator.fit()s都是無狀態的，在未來，狀態性的算法會由可選擇的概念支持。
每一個 Transformer 和 Estimator運行實例都有一個不同的ID,這個ID用來區別參數。（下文討論）

Pipeline
在機器學習中，運行一個序列化好的數據的算法是一件很常見的事情。比如一個簡單的文本處理工作流可能包含下面幾個步驟。
１．對每個文檔進行分詞。
２．將每個文檔分詞轉化爲數學特徵向量。
３．利用特徵向量和標籤產生一個預測模型。
MLlib使用Pipeline代表這個工作流。Pipeline由序列化好的PipelineStages和運行特定的順序。下節點我們使用這個簡單的工作流來運行一個簡單的例子。

他是如何工作的

一個 Pipeline已經被指定了運行步驟，每一個步驟不是 Transformer就是Estimator，這些步驟都是按順序運行的。輸入的DataFrame數據被轉化，每當通過每一個步驟。在Transformer階段，transform()返回DataFrame。在Estimator階段,fit()返回一個Transformer（這是一個 PipelineModel的一部分，或者已經轉化合適的Pipeline）， Transformer的transform()方法返回一個DataFrame。
下面的圖片是使用Pipeline的過程。

在上面的圖片中，第一行代表了 Pipeline三個階段。前面兩個階段（Tokenizer and HashingTF）是Transformers（藍色），最後的第三階段（LogisticRegression）是 Estimator（紅色）。第二行代表了數據通過pipeline的轉化，圓柱面代表了DataFrames。Pipeline.fit()載入一個原始的DataFrame，包含RAW 文本文檔和標籤。Tokenizer.transform()對raw 文本文檔進行分詞，並引用DataFrame的詞加入新列，HashingTF.transform()將詞列化爲特徵向量。使用生成的特徵向量加入新的列。現在，LogisticRegression是一個Estimator， Pipeline 首先調用LogisticRegression.fit()方法產生一個 LogisticRegressionModel（邏輯迴歸模型）。Pipeline如果有更多的步驟，他就回調用 LogisticRegressionModel’s transform()方法，在DataFrame還未通過下一步驟時候。Pipelines and PipelineModels幫助我們確保訓練數據和測試數據通過了相同特徵處理步驟。

Example: Estimator(估計量), Transformer, and Param

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()

# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print "Model 2 was fit using parameters: "
print model2.extractParamMap()

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
selected = prediction.select("features", "label", "myProbability", "prediction")
for row in selected.collect():
    print row

Example: Pipeline

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 0.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 0.0)], ["id", "text", "label"])

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4L, "spark i j k"),
    (5L, "l m n"),
    (6L, "mapreduce spark"),
    (7L, "apache hadoop")], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "prediction")
for row in selected.collect():
    print(row)

MLlib主要概念之ML Pipelines

ML Pipelines

Pipelines的主要概念

DataFrame

Pipeline的組成

他是如何工作的

Example: Estimator(估計量), Transformer, and Param

Example: Pipeline

爲什麼要⽤ Foundry

【筆記】動手學深度學習-預備知識

py發送email

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

公司來了個新同事，把 DDD 運用得爐火純青！

MLlib主要概念之ML Pipelines

調整ubuntu命令行終端字體顏色和大小

Spark機器學習MLlib系列１（for python）－－數據類型，向量，分佈式矩陣，API

Docker安裝與配置

Ubuntu16.04安裝QQ（for linux）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結