PySpark 之 ML 庫之 Transformer 相關函數學習

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('EXAMPLE').getOrCreate()
from pyspark.ml.feature import *

1 Binarizer

Binarizer(ML提供的二元化方法)二元化涉及的參數有inputCol(輸入)、outputCol(輸出)以及threshold(閥值)。(輸入的)特徵值大於閥值將映射爲1.0,特徵值小於等於閥值將映射爲0.0。(Binarizer)支持向量(Vector)和雙精度(Double)類型的輸出

df = spark.createDataFrame(((0.5,), (2.1,))).toDF('values')
df.show()

binarizer = Binarizer(threshold=1.0, inputCol="values", outputCol="features")
dff = binarizer.transform(df)
dff.show()
dff.select('features').show()
+------+
|values|
+------+
|   0.5|
|   2.1|
+------+

+------+--------+
|values|features|
+------+--------+
|   0.5|     0.0|
|   2.1|     1.0|
+------+--------+

+--------+
|features|
+--------+
|     0.0|
|     1.0|
+--------+

2 Bucketizer

Bucketizer將一列連續的特徵轉換爲特徵區間,區間由用戶指定。splits:分裂數爲n+1時,將產生n個區間。除了最後一個區間外,每個區間範圍[x,y]由分裂的x,y決定。分裂必須是嚴格遞增的。在分裂指定外的值將被歸爲錯誤。

values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
df = spark.createDataFrame(values, ['values'])
df.show()
bucketizer = Bucketizer(splits = [-float("inf"), 0.5, 1.4, float("inf")], inputCol="values", outputCol="buckets")
dff = bucketizer.setHandleInvalid("keep").transform(df)
dff.show()
bucketizer2 = Bucketizer(splits = [-float("inf"), 0.5, 1.4, float("inf")], inputCol="values", outputCol="buckets2")
dfff = bucketizer2.setHandleInvalid("skip").transform(df)
dfff.show()
+------+
|values|
+------+
|   0.1|
|   0.4|
|   1.2|
|   1.5|
|   NaN|
|   NaN|
+------+

+------+-------+
|values|buckets|
+------+-------+
|   0.1|    0.0|
|   0.4|    0.0|
|   1.2|    1.0|
|   1.5|    2.0|
|   NaN|    3.0|
|   NaN|    3.0|
+------+-------+

+------+--------+
|values|buckets2|
+------+--------+
|   0.1|     0.0|
|   0.4|     0.0|
|   1.2|     1.0|
|   1.5|     2.0|
+------+--------+

3 ChiSqSelector

ChiSqSelector(self, numTopFeatures=50, featuresCol=“features”, outputCol=None, labelCol=“label”, selectorType=“numTopFeatures”, percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)

ChiSqSelector代表卡方特徵選擇。它適用於帶有類別特徵的標籤數據。ChiSqSelector根據獨立卡方檢驗,然後選取類別標籤主要依賴的特徵。它類似於選取最有預測能力的特徵。

卡方特徵選擇,選擇用於預測分類標籤的分類特徵。 選擇器支持不同的選擇方法:numTopFeatures,percentile,fpr,fdr,fwe。numTopFeatures通過卡方檢驗選取最具有預測能力的Top(num)個特徵。percentile類似於上一種方法,但是選取一小部分特徵而不是固定(num)個特徵。fpr選擇P值低於門限值的特徵,這樣就可以控制false positive rate來進行特徵選擇。fdr使用Benjamini-Hochberg過程來選擇虛假髮現率低於閾值的所有特徵。

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),(Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),(Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],["features", "label"])
selector = ChiSqSelector(numTopFeatures=2, outputCol="selectedFeatures")
model = selector.fit(df)
model.transform(df).show()
+------------------+-----+----------------+
|          features|label|selectedFeatures|
+------------------+-----+----------------+
|[0.0,0.0,18.0,1.0]|  1.0|      [18.0,1.0]|
|[0.0,1.0,12.0,0.0]|  0.0|      [12.0,0.0]|
|[1.0,0.0,15.0,0.1]|  0.0|      [15.0,0.1]|
+------------------+-----+----------------+

4 CountVectorizer

從文檔集合中提取詞彙表並生成CountVectorizerModel。CountVectorizer算法是將文本向量轉換成稀疏表示的數值向量(字符頻率向量)。CountVectorizer將會把頻率高的單詞排在前面。

df = spark.createDataFrame([(0, ['a', 'b', 'c']), (1, ['a', 'a', 'a', 'c']), (3, ['b', 'a', 'a', 'c', 'b', 'b'])]).toDF('labels', 'features')
df2 = spark.createDataFrame([(0, ['aa', 'ba', 'ca']), (1, ['aa', 'aa', 'av', 'cv']), (3, ['br', 'at', 'aq', 'cq', 'ba', 'ba'])]).toDF('labels', 'features')
cv = CountVectorizer(inputCol="features", outputCol="vectors")
model = cv.fit(df)
dff = model.transform(df)
dff.show(truncate=False)
cv2 = CountVectorizer(inputCol="features", outputCol="vectors")
model = cv2.fit(df2)
dff2 = model.transform(df2)
dff2.show(truncate=False)
+------+------------------+-------------------------+
|labels|features          |vectors                  |
+------+------------------+-------------------------+
|0     |[a, b, c]         |(3,[0,1,2],[1.0,1.0,1.0])|
|1     |[a, a, a, c]      |(3,[0,2],[3.0,1.0])      |
|3     |[b, a, a, c, b, b]|(3,[0,1,2],[2.0,3.0,1.0])|
+------+------------------+-------------------------+

+------+------------------------+-------------------------------------+
|labels|features                |vectors                              |
+------+------------------------+-------------------------------------+
|0     |[aa, ba, ca]            |(9,[0,1,2],[1.0,1.0,1.0])            |
|1     |[aa, aa, av, cv]        |(9,[1,4,6],[2.0,1.0,1.0])            |
|3     |[br, at, aq, cq, ba, ba]|(9,[0,3,5,7,8],[2.0,1.0,1.0,1.0,1.0])|
+------+------------------------+-------------------------------------+

5 DCT

離散餘弦變換(Discrete Cosine Transform) 是將時域的N維實數序列轉換成頻域的N維實數序列的過程(有點類似離散傅里葉變換)。(ML中的)DCT類提供了離散餘弦變換DCT-II的功能,將離散餘弦變換後結果乘以 12√12 得到一個與時域矩陣長度一致的矩陣。輸入序列與輸出之間是一一對應的。

df = spark.createDataFrame([(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),(Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),(Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],["features", "label"])
dct = DCT(inverse=False, inputCol="features", outputCol="resultVec")
dff = dct.transform(df)
dff.show(truncate=False)
+------------------+-----+----------------------------------------------------------------------------+
|features          |label|resultVec                                                                   |
+------------------+-----+----------------------------------------------------------------------------+
|[0.0,0.0,18.0,1.0]|1.0  |[9.5,-5.524046383753962,-8.500000000000002,11.488468633814291]              |
|[0.0,1.0,12.0,0.0]|0.0  |[6.500000000000001,-2.9765785508040836,-6.500000000000002,7.186096306820071]|
|[1.0,0.0,15.0,0.1]|0.0  |[8.05,-3.471017416902109,-6.95,10.042760481638615]                          |
+------------------+-----+----------------------------------------------------------------------------+

6 ElementwiseProduct

元素積

df = spark.createDataFrame([(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),(Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),(Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],["features", "label"])
ep = ElementwiseProduct(scalingVec=Vectors.dense([1.0, 2.0, 3.0, 4.0]), inputCol="features", outputCol="eprod")
dff = ep.transform(df)
dff.show()
+------------------+-----+------------------+
|          features|label|             eprod|
+------------------+-----+------------------+
|[0.0,0.0,18.0,1.0]|  1.0|[0.0,0.0,54.0,4.0]|
|[0.0,1.0,12.0,0.0]|  0.0|[0.0,2.0,36.0,0.0]|
|[1.0,0.0,15.0,0.1]|  0.0|[1.0,0.0,45.0,0.4]|
+------------------+-----+------------------+

7 HashingTF

可以將特徵詞組轉換成給定長度的(詞頻)特徵向量組。在文本處理中,“特徵詞組”有一系列的特徵詞構成。HashingTF利用hashing trick將原始的特徵(raw feature)通過哈希函數映射到低維向量的索引(index)中。這裏使用的哈希函數是murmurHash 3。詞頻(TF)是通過映射後的低維向量計算獲得。通過這種方法避免了直接計算(通過特徵詞建立向量term-to-index產生的)巨大特徵數量。(直接計算term-to-index 向量)對一個比較大的語料庫的計算來說開銷是非常巨大的。但這種降維方法也可能存在哈希衝突:不同的原始特徵通過哈希函數後得到相同的值( f(x1) = f(x2) )。爲了降低出現哈希衝突的概率,我們可以增大哈希值的特徵維度,例如:增加哈希表中的bucket的數量。一個簡單的例子:通過哈希函數變換到列的索引,這種方法適用於2的冪(函數)作爲特徵維度,否則(採用其他的映射方法)就會出現特徵不能均勻地映射到哈希值上。默認的特徵維度是 218=262,144218=262,144 。一個可選的二進制切換參數控制詞頻計數。當設置爲true時,所有非零詞頻設置爲1。這對離散的二進制概率模型計算非常有用。

df = spark.createDataFrame([(["a", "b", "c", 'd', 'a', 'd'],)], ["words"])
hashingTF = HashingTF(numFeatures=7, inputCol="words", outputCol="features")
hashingTF.transform(df).show(truncate=False)
+------------------+-------------------------------+
|words             |features                       |
+------------------+-------------------------------+
|[a, b, c, d, a, d]|(7,[0,1,2,3],[1.0,2.0,1.0,2.0])|
+------------------+-------------------------------+
df = spark.createDataFrame([([0,1,2,3,4,5], ['a','c','e','b','b'])]).toDF("id", "words")
#indexer = StringIndexer().setInputCol("words").setOutputCol("wordsIndex").fit(df)
#indexed = indexer.transform(df)
#indexed.show()
hashingTF = HashingTF(numFeatures=10, inputCol="words", outputCol="features")
hashingTF.transform(df).collect()
[Row(id=[0, 1, 2, 3, 4, 5], words=['a', 'c', 'e', 'b', 'b'], features=SparseVector(10, {0: 1.0, 1: 2.0, 2: 1.0, 8: 1.0}))]

8 IDF

IDF(逆文檔頻率):IDF是的權重評估器(Estimator),用於對數據集產生相應的IDFModel(不同的詞頻對應不同的權重)。 IDFModel對特徵向量集(一般由HashingTF或CountVectorizer產生)做取對數(log)處理。直觀地看,特徵詞出現的文檔越多,權重越低(down-weights colume)

from pyspark.ml.linalg import DenseVector
df = spark.createDataFrame([(DenseVector([1.0, 2.0]),),(DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"])
idf = IDF(minDocFreq=3, inputCol="tf", outputCol="idf")
model = idf.fit(df)
model.transform(df).show()
+---------+---------+
|       tf|      idf|
+---------+---------+
|[1.0,2.0]|[0.0,0.0]|
|[0.0,1.0]|[0.0,0.0]|
|[3.0,0.2]|[0.0,0.0]|
+---------+---------+

9 StringIndexer

StringIndexer(字符串-索引變換)將字符串的(以單詞爲)標籤編碼成標籤索引(表示)。標籤索引序列的取值範圍是[0,numLabels(字符串中所有出現的單詞去掉重複的詞後的總和)],按照標籤出現頻率排序,出現最多的標籤索引爲0。如果輸入是數值型,我們先將數值映射到字符串,再對字符串進行索引化。如果下游的pipeline(例如:Estimator或者Transformer)需要用到索引化後的標籤序列,則需要將這個pipeline的輸入列名字指定爲索引化序列的名字。大部分情況下,通過setInputCol設置輸入的列名。

df = spark.createDataFrame(((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))).toDF("id", "category")
si = StringIndexer(inputCol='category', outputCol='out')
model = si.fit(df)
td = model.transform(df)
td.show()
sorted(set([(i[0], i[1]) for i in td.select(td.id, td.out).collect()]),key=lambda x: x[0])
+---+--------+---+
| id|category|out|
+---+--------+---+
|  0|       a|0.0|
|  1|       b|2.0|
|  2|       c|1.0|
|  3|       a|0.0|
|  4|       a|0.0|
|  5|       c|1.0|
+---+--------+---+






[(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]

10 IndexToString

IndexToString的作用是把標籤索引的一列重新映射回原有的字符型標籤。一般都是和StringIndexer配合,先用StringIndexer轉化成標籤索引,進行模型訓練,然後在預測標籤的時候再把標籤索引轉化成原有的字符標籤。當然,也允許你使用自己提供的標籤。

inverter = IndexToString(inputCol="out", outputCol="label")
itd = inverter.transform(td)
sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, itd.label).collect()]),key=lambda x: x[0])
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')]

11 MaxAbsScaler

將某一列的所有值除以該列最大的絕對值

#df = spark.createDataFrame(([1,2] ,[2,3])).toDF('features1','features2' )
df = spark.createDataFrame([(DenseVector([1.0, 2.0]),),(DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["ff"])
mas = MaxAbsScaler(inputCol='ff', outputCol='out')
model = mas.fit(df)
model.transform(df).show(truncate=False)
+---------+------------------------+
|ff       |out                     |
+---------+------------------------+
|[1.0,2.0]|[0.3333333333333333,1.0]|
|[0.0,1.0]|[0.0,0.5]               |
|[3.0,0.2]|[1.0,0.1]               |
+---------+------------------------+

12 MinMaxScaler

歸一化之最小最大值標準化

df = spark.createDataFrame([(DenseVector([1.0, 2.0]),),(DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["ff"])
mms = MinMaxScaler(inputCol='ff', outputCol='out')
model = mms.fit(df)
model.transform(df).show(truncate=False)
+---------+------------------------+
|ff       |out                     |
+---------+------------------------+
|[1.0,2.0]|[0.3333333333333333,1.0]|
|[0.0,1.0]|[0.0,0.4444444444444445]|
|[3.0,0.2]|[1.0,0.0]               |
+---------+------------------------+

13 NGram

N-Gram認爲語言中每個單詞只與其前面長度 N-1 的上下文有關。主要分爲bigram和trigram,bigram假設下一個詞的出現依賴它前面的一個詞,trigram假設下一個詞的出現依賴它前面的兩個詞。在SparkML中用NGram類實現,setN(2)爲bigram,setN(3)爲trigram。

wordDataFrame = spark.createDataFrame(((0, ["Hi", "I", "heard", "about", "Spark"]),(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),(2,["Logistic", "regression", "models", "are", "neat"]))).toDF("id", "words")
ng = NGram(n=2, inputCol='words', outputCol='grams')
ng.transform(wordDataFrame).show(truncate=False)
+---+------------------------------------------+------------------------------------------------------------------+
|id |words                                     |grams                                                             |
+---+------------------------------------------+------------------------------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |[Hi I, I heard, heard about, about Spark]                         |
|1  |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|
|2  |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat]    |
+---+------------------------------------------+------------------------------------------------------------------+

14 Normalizer

正則化

df = spark.createDataFrame([(DenseVector([1.0, 2.0]),),(DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["ff"])
nm = Normalizer(p=2.0,inputCol='ff', outputCol='out')
nm.transform(df).show(truncate=False)
+---------+---------------------------------------+
|ff       |out                                    |
+---------+---------------------------------------+
|[1.0,2.0]|[0.4472135954999579,0.8944271909999159]|
|[0.0,1.0]|[0.0,1.0]                              |
|[3.0,0.2]|[0.997785157856609,0.06651901052377394]|
+---------+---------------------------------------+

15 OneHotEncoder

獨熱編碼是指把一列標籤索引映射成一列二進制數組,且最多的時候只有一位有效。這種編碼適合一些期望類別特徵爲連續特徵的算法,比如說邏輯斯蒂迴歸。

df = spark.createDataFrame(((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))).toDF("id", "category")
stringIndexer = StringIndexer(inputCol="category", outputCol="indexed")
model = stringIndexer.fit(df)
td = model.transform(df)
ohe = OneHotEncoder(inputCol='indexed', outputCol='onehot')
ohe.transform(td).show()
+---+--------+-------+-------------+
| id|category|indexed|       onehot|
+---+--------+-------+-------------+
|  0|       a|    0.0|(2,[0],[1.0])|
|  1|       b|    2.0|    (2,[],[])|
|  2|       c|    1.0|(2,[1],[1.0])|
|  3|       a|    0.0|(2,[0],[1.0])|
|  4|       a|    0.0|(2,[0],[1.0])|
|  5|       c|    1.0|(2,[1],[1.0])|
+---+--------+-------+-------------+

16 PCA

降維

from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca.fit(df)
model.transform(df).show(truncate=False)
+---------------------+----------------------------------------+
|features             |pca_features                            |
+---------------------+----------------------------------------+
|(5,[1,3],[1.0,7.0])  |[1.6485728230883807,-4.013282700516296] |
|[2.0,0.0,3.0,4.0,5.0]|[-4.645104331781534,-1.1167972663619026]|
|[4.0,0.0,0.0,6.0,7.0]|[-6.428880535676489,-5.337951427775355] |
+---------------------+----------------------------------------+

17 PolynomialExpansion

多項式擴展(Polynomial expansion)是將n維的原始特徵組合擴展到多項式空間的過程。

from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
pe = PolynomialExpansion(degree=2, inputCol="features", outputCol="expanded")
pe.transform(df).show(truncate=False)
+---------------------+---------------------------------------------------------------------------------------+
|features             |expanded                                                                               |
+---------------------+---------------------------------------------------------------------------------------+
|(5,[1,3],[1.0,7.0])  |(20,[2,4,9,11,13],[1.0,1.0,7.0,7.0,49.0])                                              |
|[2.0,0.0,3.0,4.0,5.0]|[2.0,4.0,0.0,0.0,0.0,3.0,6.0,0.0,9.0,4.0,8.0,0.0,12.0,16.0,5.0,10.0,0.0,15.0,20.0,25.0]|
|[4.0,0.0,0.0,6.0,7.0]|[4.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,24.0,0.0,0.0,36.0,7.0,28.0,0.0,0.0,42.0,49.0]|
+---------------------+---------------------------------------------------------------------------------------+

18 QuantileDiscretizer

連續型數據處理之給定分位數離散化: 有時候我們不想給定分類標準,可以讓spark自動給我們分箱。

from pyspark.ml.linalg import Vectors
data = [[2.5],[1.6],[4.0]]
df = spark.createDataFrame(data,["features"])
qds = QuantileDiscretizer(numBuckets=2,inputCol="features", outputCol="buckets", relativeError=0.01, handleInvalid="error")
qds.fit(df).transform(df).show(truncate=False)
+--------+-------+
|features|buckets|
+--------+-------+
|2.5     |1.0    |
|1.6     |0.0    |
|4.0     |1.0    |
+--------+-------+

19 RegexTokenizer

RegexTokenizer是基於正則表達式進行單詞分割,默認打分割方式是’\s+’,

df = spark.createDataFrame([("A B  c",), ("dsd fdds    eee     efsf",)], ["text"])
reTokenizer = RegexTokenizer(inputCol="text", outputCol="words")
reTokenizer.transform(df).show(truncate=False)
+------------------------+----------------------+
|text                    |words                 |
+------------------------+----------------------+
|A B  c                  |[a, b, c]             |
|dsd fdds    eee     efsf|[dsd, fdds, eee, efsf]|
+------------------------+----------------------+

20 RFormula

假設有a,b兩列作爲2個特徵,y是應變量。

y ~ a + b 表示建立這樣的線性模型:y ~ w0 + w1 * a + w2 * b ,其中w0是截距。

y ~ a + b + a:b - 1 表示線性模型:y ~ w1 * a + w2 * b + w3 * a * b (-1表示去掉截距,所以模型中沒有w0了,a:b表示將ab兩個特徵相乘生成新的特徵)

也就是說,我們可以通過這些簡單的符號去表示線性模型。 RFormula可以生成多組列向量來表示特徵,和一組double或string類型的列來標籤。
就像在R中使用公式來建立線性模型一樣,字符串類型的特徵會被One-hot編碼,數值類型的特徵會被轉換成double類型。如果標籤列是字符串類型,會先將它轉換成雙精度的字符串索引。 如果在dataframe中不存在標籤列,將會根據公式中的自變量去生成標籤應變量作爲輸出。

df = spark.createDataFrame([(1.0, 1.0, "a"),(0.0, 2.0, "b"),(0.0, 0.0, "a")], ["y", "x", "s"])
rf = RFormula(formula="y ~ x + s")
model = rf.fit(df)
model.transform(df).show()
+---+---+---+---------+-----+
|  y|  x|  s| features|label|
+---+---+---+---------+-----+
|1.0|1.0|  a|[1.0,1.0]|  1.0|
|0.0|2.0|  b|[2.0,0.0]|  0.0|
|0.0|0.0|  a|[0.0,1.0]|  0.0|
+---+---+---+---------+-----+

21 StandardScaler

標準化,0均值,1標準差

from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
ss = StandardScaler(inputCol="features", outputCol="sss").fit(df)
ss.transform(df).show(truncate=False)
+---------------------+-----------------------------------------------------------------+
|features             |sss                                                              |
+---------------------+-----------------------------------------------------------------+
|(5,[1,3],[1.0,7.0])  |(5,[1,3],[1.7320508075688774,4.58257569495584])                  |
|[2.0,0.0,3.0,4.0,5.0]|[1.0,0.0,1.7320508075688776,2.6186146828319083,1.386750490563073]|
|[4.0,0.0,0.0,6.0,7.0]|[2.0,0.0,0.0,3.9279220242478625,1.941450686788302]               |
+---------------------+-----------------------------------------------------------------+

22 StopWordsRemover

StopWordsRemover的功能是直接移除所有停用詞(stopword),所有從inputCol輸入的量都會被它檢查,然後再outputCol中,這些停止詞都會去掉了。

df1 = spark.createDataFrame([(0, ["I", "saw", "the", "red", "balloon"]),(1, ["Mary", "had", "a", "little", "lamb"])]).toDF("id", "text")
df2 = spark.createDataFrame([(["a", "b", "c"],)], ["text"])
swr = StopWordsRemover(inputCol="text", outputCol="words", )
swr.transform(df1).show(truncate=False)
swr.transform(df2).show(truncate=False)
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "C:\opt\spark\spark-2.2.3-bin-hadoop2.7\python\pyspark\ml\wrapper.py", line 105, in __del__
    SparkContext._active_spark_context._gateway.detach(self._java_obj)
AttributeError: 'StandardScaler' object has no attribute '_java_obj'


+---+----------------------------+--------------------+
|id |text                        |words               |
+---+----------------------------+--------------------+
|0  |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+

+---------+------+
|text     |words |
+---------+------+
|[a, b, c]|[b, c]|
+---------+------+

23 Tokenizer

將字符串轉換成小寫,並以空格進行分詞

df = spark.createDataFrame([("A b c",), ("DSVF SDF fds DW fes FDSsdfds", )], ["text"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenizer.transform(df).show(truncate=False)
+----------------------------+-----------------------------------+
|text                        |words                              |
+----------------------------+-----------------------------------+
|A b c                       |[a, b, c]                          |
|DSVF SDF fds DW fes FDSsdfds|[dsvf, sdf, fds, dw, fes, fdssdfds]|
+----------------------------+-----------------------------------+

24 VectorAssembler

從源數據中提取特徵指標數據,這是一個比較典型且通用的步驟,因爲我們的原始數據集裏,經常會包含一些非指標數據,如 ID,Description 等。爲方便後續模型進行特徵輸入,需要部分列的數據轉換爲特徵向量,並統一命名,VectorAssembler類完成這一任務。VectorAssembler是一個transformer,將多列數據轉化爲單列的向量列。

df = spark.createDataFrame([(1, 0, 3), [3,4,2], [43, 2,3]], ["a", "b", "c"])
vecAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")
vecAssembler.transform(df).show()
+---+---+---+--------------+
|  a|  b|  c|      features|
+---+---+---+--------------+
|  1|  0|  3| [1.0,0.0,3.0]|
|  3|  4|  2| [3.0,4.0,2.0]|
| 43|  2|  3|[43.0,2.0,3.0]|
+---+---+---+--------------+

25 VectorIndexer

主要作用:提高決策樹或隨機森林等ML方法的分類效果。VectorIndexer是對數據集特徵向量中的類別(離散值)特徵(index categorical features categorical features )進行編號。它能夠自動判斷那些特徵是離散值型的特徵,並對他們進行編號,具體做法是通過設置一個maxCategories,特徵向量中某一個特徵不重複取值個數小於maxCategories,則被重新編號爲0~K(K<=maxCategories-1)。某一個特徵不重複取值個數大於maxCategories,則該特徵視爲連續值,不會重新編號(不會發生任何改變)。

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(Vectors.dense([4.0, 0.0, 1.0, 2.0, 3.0,0.0, 5.0, 0., 6.0, 8.0]),),(Vectors.dense([4.0, 9.0, 22.0, 2.0, 22.0,0.0, 33.0, 12., 6.0, 55.0]),),(Vectors.dense([3.0, 0.0, 7.0, 6.0, 3.0,4.0, 5.0, 13., 6.0, 11.0]),),(Vectors.dense([0.0, 1.0, 7., 0.0, 1.0,9.0, 1.0, 0., 0.0, 1.0]),), (Vectors.dense([3.0,0.0, 2.0, 1.0, 2.0,0.0, 1.0, 0., 0.0, 1.0]),)], ["a"])
indexer = VectorIndexer(maxCategories=3, inputCol="a", outputCol="indexed")
model = indexer.fit(df)
model.transform(df).show(truncate=False)
+----------------------------------------------+--------------------------------------------+
|a                                             |indexed                                     |
+----------------------------------------------+--------------------------------------------+
|[4.0,0.0,1.0,2.0,3.0,0.0,5.0,0.0,6.0,8.0]     |[2.0,0.0,1.0,2.0,3.0,0.0,1.0,0.0,1.0,8.0]   |
|[4.0,9.0,22.0,2.0,22.0,0.0,33.0,12.0,6.0,55.0]|[2.0,2.0,22.0,2.0,22.0,0.0,2.0,1.0,1.0,55.0]|
|[3.0,0.0,7.0,6.0,3.0,4.0,5.0,13.0,6.0,11.0]   |[1.0,0.0,7.0,6.0,3.0,1.0,1.0,2.0,1.0,11.0]  |
|[0.0,1.0,7.0,0.0,1.0,9.0,1.0,0.0,0.0,1.0]     |[0.0,1.0,7.0,0.0,1.0,2.0,0.0,0.0,0.0,1.0]   |
|[3.0,0.0,2.0,1.0,2.0,0.0,1.0,0.0,0.0,1.0]     |[1.0,0.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0]   |
+----------------------------------------------+--------------------------------------------+

26 VectorSlicer

VectorSlicer是一個轉換器輸入特徵向量,輸出原始特徵向量子集。VectorSlicer接收帶有特定索引的向量列,通過對這些索引的值進行篩選得到新的向量集。可接受如下兩種索引:

1、整數索引—代表向量中特徵的的索引,setIndices()

2、字符串索引—代表向量中特徵的名字,這要求向量列有AttributeGroup,因爲這根據Attribute來匹配名字字段

指定整數或者字符串類型都是可以的。另外,同時使用整數索引和字符串名字也是可以的。同時注意,至少選擇一個特徵,不能重複選擇同一特徵(整數索引和名字索引對應的特徵不能疊)。注意如果使用名字特徵,當遇到空值的時候將會報錯。輸出向量將會首先按照所選的數字索引排序(按輸入順序),其次按名字排序(按輸入順序)。

df = spark.createDataFrame([(Vectors.dense([-2.0, 2.3, 0.0, 0.0, 1.0]),),(Vectors.dense([0.0, 0.0, 0.0, 0.0, 0.0]),),(Vectors.dense([0.6, -1.1, -3.0, 4.5, 3.3]),)], ["features"])
vs = VectorSlicer(inputCol="features", outputCol="sliced", indices=[1, 4])
vs.transform(df).show()
+--------------------+----------+
|            features|    sliced|
+--------------------+----------+
|[-2.0,2.3,0.0,0.0...| [2.3,1.0]|
|[0.0,0.0,0.0,0.0,...| [0.0,0.0]|
|[0.6,-1.1,-3.0,4....|[-1.1,3.3]|
+--------------------+----------+

27 Word2Vec

sent = ("a b " * 100 + "a c " * 10).split(" ")
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
doc.show()
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
model.getVectors().show(truncate=False)
from pyspark.sql.functions import format_number as fmt
model.findSynonyms("a", 2).select("word", fmt("similarity", 3).alias("similarity")).show()
model.transform(doc).show()
+--------------------+
|            sentence|
+--------------------+
|[a, b, a, b, a, b...|
|[a, b, a, b, a, b...|
+--------------------+

+----+-----------------------------------------------------------------------------------------------------+
|word|vector                                                                                               |
+----+-----------------------------------------------------------------------------------------------------+
|a   |[0.0946177989244461,-0.4951631426811218,0.06406556069850922,-0.37930983304977417,0.21593928337097168]|
|b   |[1.1547421216964722,-0.593326210975647,-0.8721810579299927,0.4669361710548401,0.551497220993042]     |
|c   |[-0.3794820010662079,0.34077689051628113,0.06388652324676514,0.0352821946144104,-0.24136029183864594]|
+----+-----------------------------------------------------------------------------------------------------+

+----+----------+
|word|similarity|
+----+----------+
|   b|     0.251|
|   c|    -0.698|
+----+----------+

+--------------------+--------------------+
|            sentence|               model|
+--------------------+--------------------+
|[a, b, a, b, a, b...|[0.55243144814784...|
|[a, b, a, b, a, b...|[0.55243144814784...|
+--------------------+--------------------+

參考

https://blog.csdn.net/u013719780/article/details/52494351

https://yq.aliyun.com/articles/577701

https://www.cnblogs.com/xiaoma0529/p/6952769.html

http://dblab.xmu.edu.cn/blog/1297-2/

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://blog.csdn.net/sinat_33761963/article/details/54910936

http://blog.sina.com.cn/s/blog_45922fd70102xcnr.html

https://blog.csdn.net/u012050154/article/details/60766387

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章