spark pipeline原理學習和記錄

概念

MLlib提供標準的機器學習算法API,能夠方便的將不同的算法組合成一個獨立的管道,或者叫工作流。
• DataFrame:ML API使用Sark SQL中的DataFrme作爲機器學習數據集,可容納各種類型的數據,如DataFrame可能是存儲文本的不同列,特徵向量,真正的標籤或者預測。      
• 轉換器:Transformer是一種算法,可以將一個DataFrame轉換成另一個DataFrame。如機器學習模型是一個轉換器,可以將特徵向量的DataFrame轉換成預測結果的DataFrame。
• 預測器:一個預測是一個算法,可以基於DataFrame產出一個轉換器。如機器學習算法是一種預測,訓練DataFrame併產生一個模型。      
• 管道/工作流:管道鏈接多個轉換器和預測器生成一個機器學習工作流。     
• 參數:所有的轉換器和預測器共享一個通用的API指定參數。
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.

• DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
• Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
• Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
• Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
• Parameter: All Transformers and Estimators now share a common API for specifying parameters.

DataFrame

機器學習可以處理多種類型的數據,比如矢量/文本/圖像和結構化數據,這裏DataFrame API源於Spark SQL,主要用來處理各種類型的數據。
DataFrame支持簡單的和結構化類型,同時支持ML中常用的vector,可以從規則的RDD中顯示或者隱式的構建。
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data types.
DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.
A DataFrame can be created either implicitly or explicitly from a regular RDD. See the code examples below and the Spark SQL programming guide for examples.
Columns in a DataFrame are named. The code examples below use names such as “text,” “features,” and “label.”
Pipeline components

PipeLine組件

1.轉換器

轉換器是一個抽象概念,他包括特徵轉換和模型學習,繼承transform方法,這個方法可以將一個DataFrame轉換成另一個,通常爲添加一列或者多個列。如

(1) 一個特徵轉換對於DataFrame的操作可能爲,讀取一個列,將他映射成爲一個新的列,然後輸出一個新的DataFrame。
(2) 一個模型學習對於DataFrame的操作可能爲,讀取包含一組特徵的列,然後預測結果,最後將添加預測結果的新數據輸出爲一個DataFrame。

Transformers

A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer,implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:
• A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
• A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

2.預測器

Estimators
預測器是學習算法或者其他算法的抽象,用來訓練數據。預測器繼承fit方法,可以接收一個DataFrame輸入,然後產出一個模型。例如,像邏輯迴歸算法是一種預測,調用fit方法來訓練一個邏輯迴歸模型。

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.

Properties of pipeline components

Pipleline組件的屬性

Transformer.transform()和 Estimator.fit()都是無狀態的。在將來,可能會被有狀態的算法替代。每個轉換器和預測器都有一個唯一的ID,這在參數調節中很有用處。
Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).

Pipeline

在機器學習中,很常見的一種現象就是運行一系列的算法來學習數據。例如一個簡單的文本執行可能包括以下幾步:
(1) 文檔分詞
(2) 每個詞轉成向量
(3) 使用向量和標籤學習一個預測模型
In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
• Split each document’s text into words.
• Convert each document’s words into a numerical feature vector.
• Learn a prediction model using the feature vectors and labels.
MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.

How it works

管道被指定爲一系列階段,每個階段是一個轉換器或一個預測器。這些階段按順序運行,輸入DataFrame每個通過每個階段時進行轉換。對於轉換階段,transform方法作用在DataFrame上。預測階段,調用fit()方法併產出一個Transformer(PipelneModel的一部分)。
通過文檔分析的工作流來解釋pipleline。

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer(which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.
We illustrate this for the simple text document workflow. The figure below is for the training time usage of a Pipeline.

這裏寫圖片描述
上圖中,第一行表示一個Pipleline,包含三個階段。其中Tokenizer和HashingTF是轉換,第三個LogiticRegression邏輯迴歸是預測。下面一行表示數據流向,(1)對於Raw text文本數據和標籤生成DataFrame,然後調用Pipeline的fit接口;(2)調用Tokenizer的transform接口將文本進行分詞,並將分詞添加到DataFrame;(3)pipleline調用LogiticRegression.fit產出一個LogiticRegressionModel。

一個pipeline是一個預測,fit方法執行之後,產出一個PipeLineModel,這是一個轉換器,可以用在測試中。

Above, the top row represents a Pipeline with three stages. The first two (Tokenizer and HashingTF) are Transformers (blue), and the third (LogisticRegression) is an Estimator (red). The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrames. The Pipeline.fit() method is called on the original DataFrame, which has raw text documents and labels. The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame. The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer. This PipelineModel is used at test time; the figure below illustrates this usage.
這裏寫圖片描述
在上圖中,PipelineModel和原始工作流有相同數量的階段,但在原來的管道中的預測已成爲轉換。當PipelineModel在測試數據集上調用transform接口時,這些數據是按照順序通過管道。每個階段的transform方法更新數據並將其傳遞到下一個階段。
管道和PipelineModels有助於確保訓練和測試數據經過相同的處理步驟。
In the figure above, the PipelineModel has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline have become Transformers. When the PipelineModel’s transform() method is called on a test dataset, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage.
Pipelines and PipelineModels help to ensure that training and test data go through identical feature processing steps.

DAG

DAG管道:管道的階段被指定爲一個有序數組。這裏給出的示例都是線性管道,即每個階段使用前一個階段產生的數據。也可以創建非線性管道,數據流圖形成一個有向無環圖(DAG)。這張圖是目前隱式地指定每個階段輸入和輸出列的名稱(通常指定爲參數)。如果管道形成DAG,那麼必須指定拓撲秩序。
運行時檢查:由於管道可以操作不同類型的DataFrames,不能在編譯時檢查出錯誤類型。管道和PipelineModels在管道實際運行之前檢查,使用DataFrame的schema模式檢查,這個模式就是DataFrame中的列的數據類型。
唯一的管道階段:管道的階段應該是唯一的實例。如,相同的實例myHashingTF不應該插入管道兩次,因爲管道階段必須有唯一的id。然而,不同實例myHashingTF1和myHashingTF2(都是HashingTF型)可以放在相同的管道,因爲不同的實例在創建時是不同的id。

DAG Pipelines: A Pipeline’s stages are specified as an ordered array. The examples given here are all for linear Pipelines, i.e., Pipelines in which each stage uses data produced by the previous stage. It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.
Runtime checking: Since Pipelines can operate on DataFrames with varied types, they cannot use compile-time type checking. Pipelines and PipelineModels instead do runtime checking before actually running the Pipeline. This type checking is done using the DataFrame schema, a description of the data types of columns in the DataFrame.
Unique Pipeline stages: A Pipeline’s stages should be unique instances. E.g., the same instance myHashingTF should not be inserted into the Pipeline twice since Pipeline stages must have unique IDs. However, different instances myHashingTF1 and myHashingTF2 (both of type HashingTF) can be put into the same Pipeline since different instances will be created with different IDs.

Parameters

MLlib的預測和轉換使用統一的API來指定參數。一個參數是一個包含參數屬性值的屬性名稱,一個ParamMap由一組(參數名稱,值)這樣的對組成。
有兩種傳參的方法:
(1) 爲實例設置參數。
(2) 將參數封裝到ParamMap,然後解析ParamMap爲fit和transform傳遞參數。ParamMap的參數將覆蓋原始的參數。

MLlib Estimators and Transformers use a uniform API for specifying parameters.
A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.
There are two main ways to pass parameters to an algorithm:
1. Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call lr.setMaxIter(10) to make lr.fit() use at most 10 iterations. This API resembles the API used in spark.mllib package.
2. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.
Parameters belong to specific instances of Estimators and Transformers. For example, if we have two LogisticRegressioninstances lr1 and lr2, then we can build a ParamMap with both maxIter parameters specified: ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20). This is useful if there are two algorithms with the maxIter parameter in a Pipeline.

Saving and Loading Pipelines

通常保存一個模型或工作流到磁盤供以後使用。一個模型導入/導出功能添加到管道API。支持最基本的轉換以及一些更基本的ML模型。請參考算法的API文檔確認保存和加載是否支持。
Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm’s API documentation to see if saving and loading is supported.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章