Tokenizer(分詞器)
算法介紹:
Tokenization將文本劃分爲獨立個體(通常爲單詞)。下面的例子展示瞭如何把句子劃分爲單詞。
RegexTokenizer基於正則表達式提供更多的劃分選項。默認情況下,參數“pattern”爲劃分文本的分隔符。或者,用戶可以指定參數“gaps”來指明正則“patten”表示“tokens”而不是分隔符,這樣來爲分詞結果找到所有可能匹配的情況。
示例調用:
Scala:
- import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
- val sentenceDataFrame = spark.createDataFrame(Seq(
- (0, "Hi I heard about Spark"),
- (1, "I wish Java could use case classes"),
- (2, "Logistic,regression,models,are,neat")
- )).toDF("label", "sentence")
- val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
- val regexTokenizer = new RegexTokenizer()
- .setInputCol("sentence")
- .setOutputCol("words")
- .setPattern("\\W") // alternatively .setPattern("\\w+").setGaps(false)
- val tokenized = tokenizer.transform(sentenceDataFrame)
- tokenized.select("words", "label").take(3).foreach(println)
- val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
- regexTokenized.select("words", "label").take(3).foreach(println)
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.RegexTokenizer;
- import org.apache.spark.ml.feature.Tokenizer;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(0, "Hi I heard about Spark"),
- RowFactory.create(1, "I wish Java could use case classes"),
- RowFactory.create(2, "Logistic,regression,models,are,neat")
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
- new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
- });
- Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
- Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
- Dataset<Row> wordsDataFrame = tokenizer.transform(sentenceDataFrame);
- for (Row r : wordsDataFrame.select("words", "label").takeAsList(3)) {
- java.util.List<String> words = r.getList(0);
- for (String word : words) System.out.print(word + " ");
- System.out.println();
- }
- RegexTokenizer regexTokenizer = new RegexTokenizer()
- .setInputCol("sentence")
- .setOutputCol("words")
- .setPattern("\\W"); // alternatively .setPattern("\\w+").setGaps(false);
Python:
- from pyspark.ml.feature import Tokenizer, RegexTokenizer
- sentenceDataFrame = spark.createDataFrame([
- (0, "Hi I heard about Spark"),
- (1, "I wish Java could use case classes"),
- (2, "Logistic,regression,models,are,neat")
- ], ["label", "sentence"])
- tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
- wordsDataFrame = tokenizer.transform(sentenceDataFrame)
- for words_label in wordsDataFrame.select("words", "label").take(3):
- print(words_label)
- regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
- # alternatively, pattern="\\w+", gaps(False)
StopWordsRemover
算法介紹:
停用詞爲在文檔中頻繁出現,但未承載太多意義的詞語,他們不應該被包含在算法輸入中。
StopWordsRemover的輸入爲一系列字符串(如分詞器輸出),輸出中刪除了所有停用詞。停用詞表由stopWords參數提供。一些語言的默認停用詞表可以通過StopWordsRemover.loadDefaultStopWords(language)調用。布爾參數caseSensitive指明是否區分大小寫(默認爲否)。
示例:
假設我們有如下DataFrame,有id和raw兩列:
id | raw
----|----------
0 | [I,saw, the, red, baloon]
1 |[Mary, had, a, little, lamb]
通過對raw列調用StopWordsRemover,我們可以得到篩選出的結果列如下:
id | raw | filtered
----|-----------------------------|--------------------
0 | [I,saw, the, red, baloon] | [saw, red, baloon]
1 |[Mary, had, a, little, lamb]|[Mary, little, lamb]
其中,“I”, “the”, “had”以及“a”被移除。
示例調用:
Scala:
- import org.apache.spark.ml.feature.StopWordsRemover
- val remover = new StopWordsRemover()
- .setInputCol("raw")
- .setOutputCol("filtered")
- val dataSet = spark.createDataFrame(Seq(
- (0, Seq("I", "saw", "the", "red", "baloon")),
- (1, Seq("Mary", "had", "a", "little", "lamb"))
- )).toDF("id", "raw")
- remover.transform(dataSet).show()
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.StopWordsRemover;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- StopWordsRemover remover = new StopWordsRemover()
- .setInputCol("raw")
- .setOutputCol("filtered");
- List<Row> data = Arrays.asList(
- RowFactory.create(Arrays.asList("I", "saw", "the", "red", "baloon")),
- RowFactory.create(Arrays.asList("Mary", "had", "a", "little", "lamb"))
- );
- StructType schema = new StructType(new StructField[]{
- new StructField(
- "raw", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty())
- });
- Dataset<Row> dataset = spark.createDataFrame(data, schema);
- remover.transform(dataset).show();
Python:
- from pyspark.ml.feature import StopWordsRemover
- sentenceData = spark.createDataFrame([
- (0, ["I", "saw", "the", "red", "baloon"]),
- (1, ["Mary", "had", "a", "little", "lamb"])
- ], ["label", "raw"])
- remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
- remover.transform(sentenceData).show(truncate=False)
n-gram
算法介紹:
一個n-gram是一個長度爲整數n的字序列。NGram可以用來將輸入轉換爲n-gram。
NGram的輸入爲一系列字符串(如分詞器輸出)。參數n決定每個n-gram包含的對象個數。結果包含一系列n-gram,其中每個n-gram代表一個空格分割的n個連續字符。如果輸入少於n個字符串,將沒有輸出結果。
示例調用:
Scala:
- import org.apache.spark.ml.feature.NGram
- val wordDataFrame = spark.createDataFrame(Seq(
- (0, Array("Hi", "I", "heard", "about", "Spark")),
- (1, Array("I", "wish", "Java", "could", "use", "case", "classes")),
- (2, Array("Logistic", "regression", "models", "are", "neat"))
- )).toDF("label", "words")
- val ngram = new NGram().setInputCol("words").setOutputCol("ngrams")
- val ngramDataFrame = ngram.transform(wordDataFrame)
- ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.NGram;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(0.0, Arrays.asList("Hi", "I", "heard", "about", "Spark")),
- RowFactory.create(1.0, Arrays.asList("I", "wish", "Java", "could", "use", "case", "classes")),
- RowFactory.create(2.0, Arrays.asList("Logistic", "regression", "models", "are", "neat"))
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
- new StructField(
- "words", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty())
- });
- Dataset<Row> wordDataFrame = spark.createDataFrame(data, schema);
- NGram ngramTransformer = new NGram().setInputCol("words").setOutputCol("ngrams");
- Dataset<Row> ngramDataFrame = ngramTransformer.transform(wordDataFrame);
- for (Row r : ngramDataFrame.select("ngrams", "label").takeAsList(3)) {
- java.util.List<String> ngrams = r.getList(0);
- for (String ngram : ngrams) System.out.print(ngram + " --- ");
- System.out.println();
- }
Python:
- from pyspark.ml.feature import NGram
- wordDataFrame = spark.createDataFrame([
- (0, ["Hi", "I", "heard", "about", "Spark"]),
- (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
- (2, ["Logistic", "regression", "models", "are", "neat"])
- ], ["label", "words"])
- ngram = NGram(inputCol="words", outputCol="ngrams")
- ngramDataFrame = ngram.transform(wordDataFrame)
- for ngrams_label in ngramDataFrame.select("ngrams", "label").take(3):
- print(ngrams_label)
Binarizer
算法介紹:
二值化是根據閥值將連續數值特徵轉換爲0-1特徵的過程。
Binarizer參數有輸入、輸出以及閥值。特徵值大於閥值將映射爲1.0,特徵值小於等於閥值將映射爲0.0。
示例調用:
Scala:
- import org.apache.spark.ml.feature.Binarizer
- val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
- val dataFrame = spark.createDataFrame(data).toDF("label", "feature")
- val binarizer: Binarizer = new Binarizer()
- .setInputCol("feature")
- .setOutputCol("binarized_feature")
- .setThreshold(0.5)
- val binarizedDataFrame = binarizer.transform(dataFrame)
- val binarizedFeatures = binarizedDataFrame.select("binarized_feature")
- binarizedFeatures.collect().foreach(println)
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.Binarizer;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(0, 0.1),
- RowFactory.create(1, 0.8),
- RowFactory.create(2, 0.2)
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
- new StructField("feature", DataTypes.DoubleType, false, Metadata.empty())
- });
- Dataset<Row> continuousDataFrame = spark.createDataFrame(data, schema);
- Binarizer binarizer = new Binarizer()
- .setInputCol("feature")
- .setOutputCol("binarized_feature")
- .setThreshold(0.5);
- Dataset<Row> binarizedDataFrame = binarizer.transform(continuousDataFrame);
- Dataset<Row> binarizedFeatures = binarizedDataFrame.select("binarized_feature");
- for (Row r : binarizedFeatures.collectAsList()) {
- Double binarized_value = r.getDouble(0);
- System.out.println(binarized_value);
- }
Python:
- from pyspark.ml.feature import Binarizer
- continuousDataFrame = spark.createDataFrame([
- (0, 0.1),
- (1, 0.8),
- (2, 0.2)
- ], ["label", "feature"])
- binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature")
- binarizedDataFrame = binarizer.transform(continuousDataFrame)
- binarizedFeatures = binarizedDataFrame.select("binarized_feature")
- for binarized_feature, in binarizedFeatures.collect():
- print(binarized_feature)
PCA
算法介紹:
主成分分析是一種統計學方法,它使用正交轉換從一系列可能相關的變量中提取線性無關變量集,提取出的變量集中的元素稱爲主成分。使用PCA方法可以對變量集合進行降維。下面的示例將會展示如何將5維特徵向量轉換爲3維主成分向量。
示例調用:
Scala:
- import org.apache.spark.ml.feature.PCA
- import org.apache.spark.ml.linalg.Vectors
- val data = Array(
- Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
- Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
- Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
- )
- val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
- val pca = new PCA()
- .setInputCol("features")
- .setOutputCol("pcaFeatures")
- .setK(3)
- .fit(df)
- val pcaDF = pca.transform(df)
- val result = pcaDF.select("pcaFeatures")
- result.show()
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.PCA;
- import org.apache.spark.ml.feature.PCAModel;
- import org.apache.spark.ml.linalg.VectorUDT;
- import org.apache.spark.ml.linalg.Vectors;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(Vectors.sparse(5, new int[]{1, 3}, new double[]{1.0, 7.0})),
- RowFactory.create(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)),
- RowFactory.create(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- });
- Dataset<Row> df = spark.createDataFrame(data, schema);
- PCAModel pca = new PCA()
- .setInputCol("features")
- .setOutputCol("pcaFeatures")
- .setK(3)
- .fit(df);
- Dataset<Row> result = pca.transform(df).select("pcaFeatures");
- result.show();
Python:
- from pyspark.ml.feature import PCA
- from pyspark.ml.linalg import Vectors
- data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
- (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
- (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
- df = spark.createDataFrame(data, ["features"])
- pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
- model = pca.fit(df)
- result = model.transform(df).select("pcaFeatures")
- result.show(truncate=False)
PolynomialExpansion
算法介紹:
多項式擴展通過產生n維組合將原始特徵將特徵擴展到多項式空間。下面的示例會介紹如何將你的特徵集拓展到3維多項式空間。
示例調用:
Scala:
- import org.apache.spark.ml.feature.PolynomialExpansion
- import org.apache.spark.ml.linalg.Vectors
- val data = Array(
- Vectors.dense(-2.0, 2.3),
- Vectors.dense(0.0, 0.0),
- Vectors.dense(0.6, -1.1)
- )
- val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
- val polynomialExpansion = new PolynomialExpansion()
- .setInputCol("features")
- .setOutputCol("polyFeatures")
- .setDegree(3)
- val polyDF = polynomialExpansion.transform(df)
- polyDF.select("polyFeatures").take(3).foreach(println)
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.PolynomialExpansion;
- import org.apache.spark.ml.linalg.VectorUDT;
- import org.apache.spark.ml.linalg.Vectors;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- PolynomialExpansion polyExpansion = new PolynomialExpansion()
- .setInputCol("features")
- .setOutputCol("polyFeatures")
- .setDegree(3);
- List<Row> data = Arrays.asList(
- RowFactory.create(Vectors.dense(-2.0, 2.3)),
- RowFactory.create(Vectors.dense(0.0, 0.0)),
- RowFactory.create(Vectors.dense(0.6, -1.1))
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- });
- Dataset<Row> df = spark.createDataFrame(data, schema);
- Dataset<Row> polyDF = polyExpansion.transform(df);
- List<Row> rows = polyDF.select("polyFeatures").takeAsList(3);
- for (Row r : rows) {
- System.out.println(r.get(0));
- }
Python:
- from pyspark.ml.feature import PolynomialExpansion
- from pyspark.ml.linalg import Vectors
- df = spark\
- .createDataFrame([(Vectors.dense([-2.0, 2.3]),),
- (Vectors.dense([0.0, 0.0]),),
- (Vectors.dense([0.6, -1.1]),)],
- ["features"])
- px = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures")
- polyDF = px.transform(df)
- for expanded in polyDF.select("polyFeatures").take(3):
- print(expanded)
Discrete Cosine Transform(DCT)
算法介紹:
離散餘弦變換是與傅里葉變換相關的一種變換,它類似於離散傅立葉變換但是隻使用實數。離散餘弦變換相當於一個長度大概是它兩倍的離散傅里葉變換,這個離散傅里葉變換是對一個實偶函數進行的(因爲一個實偶函數的傅里葉變換仍然是一個實偶函數)。離散餘弦變換,經常被信號處理和圖像處理使用,用於對信號和圖像(包括靜止圖像和運動圖像)進行有損數據壓縮。
示例調用:
Scala:
- import org.apache.spark.ml.feature.DCT
- import org.apache.spark.ml.linalg.Vectors
- val data = Seq(
- Vectors.dense(0.0, 1.0, -2.0, 3.0),
- Vectors.dense(-1.0, 2.0, 4.0, -7.0),
- Vectors.dense(14.0, -2.0, -5.0, 1.0))
- val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
- val dct = new DCT()
- .setInputCol("features")
- .setOutputCol("featuresDCT")
- .setInverse(false)
- val dctDf = dct.transform(df)
- dctDf.select("featuresDCT").show(3)
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.DCT;
- import org.apache.spark.ml.linalg.VectorUDT;
- import org.apache.spark.ml.linalg.Vectors;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)),
- RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)),
- RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0))
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("features", new VectorUDT(), false, Metadata.empty()),
- });
- Dataset<Row> df = spark.createDataFrame(data, schema);
- DCT dct = new DCT()
- .setInputCol("features")
- .setOutputCol("featuresDCT")
- .setInverse(false);
- Dataset<Row> dctDf = dct.transform(df);
- dctDf.select("featuresDCT").show(3);
Python:
- from pyspark.ml.feature import DCT
- from pyspark.ml.linalg import Vectors
- df = spark.createDataFrame([
- (Vectors.dense([0.0, 1.0, -2.0, 3.0]),),
- (Vectors.dense([-1.0, 2.0, 4.0, -7.0]),),
- (Vectors.dense([14.0, -2.0, -5.0, 1.0]),)], ["features"])
- dct = DCT(inverse=False, inputCol="features", outputCol="featuresDCT")
- dctDf = dct.transform(df)
- for dcts in dctDf.select("featuresDCT").take(3):
- print(dcts)
STringindexer
算法介紹:
StringIndexer將字符串標籤編碼爲標籤指標。指標取值範圍爲[0,numLabels],按照標籤出現頻率排序,所以出現最頻繁的標籤其指標爲0。如果輸入列爲數值型,我們先將之映射到字符串然後再對字符串的值進行指標。如果下游的管道節點需要使用字符串-指標標籤,則必須將輸入和鑽還爲字符串-指標列名。
示例:
假設我們有DataFrame數據含有id和category兩列:
id | category
----|----------
0 | a
1 | b
2 | c
3 | a
4 | a
5 | c
category是有3種取值的字符串列,使用StringIndexer進行轉換後我們可以得到如下輸出:
id | category |categoryIndex
----|----------|---------------
0 |a | 0.0
1 |b | 2.0
2 |c | 1.0
3 |a | 0.0
4 |a | 0.0
5 |c | 1.0
另外,如果在轉換新數據時出現了在訓練中未出現的標籤,StringIndexer將會報錯(默認值)或者跳過未出現的標籤實例。
示例調用:
Scala:
- import org.apache.spark.ml.feature.StringIndexer
- val df = spark.createDataFrame(
- Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
- ).toDF("id", "category")
- val indexer = new StringIndexer()
- .setInputCol("category")
- .setOutputCol("categoryIndex")
- val indexed = indexer.fit(df).transform(df)
- indexed.show()
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.StringIndexer;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- import static org.apache.spark.sql.types.DataTypes.*;
- List<Row> data = Arrays.asList(
- RowFactory.create(0, "a"),
- RowFactory.create(1, "b"),
- RowFactory.create(2, "c"),
- RowFactory.create(3, "a"),
- RowFactory.create(4, "a"),
- RowFactory.create(5, "c")
- );
- StructType schema = new StructType(new StructField[]{
- createStructField("id", IntegerType, false),
- createStructField("category", StringType, false)
- });
- Dataset<Row> df = spark.createDataFrame(data, schema);
- StringIndexer indexer = new StringIndexer()
- .setInputCol("category")
- .setOutputCol("categoryIndex");
- Dataset<Row> indexed = indexer.fit(df).transform(df);
- indexed.show();
Python:
- from pyspark.ml.feature import StringIndexer
- df = spark.createDataFrame(
- [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
- ["id", "category"])
- indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
- indexed = indexer.fit(df).transform(df)
- indexed.show()
IndexToString
算法介紹:
與StringIndexer對應,IndexToString將指標標籤映射回原始字符串標籤。一個常用的場景是先通過StringIndexer產生指標標籤,然後使用指標標籤進行訓練,最後再對預測結果使用IndexToString來獲取其原始的標籤字符串。
示例:
假設我們有如下的DataFrame包含id和categoryIndex兩列:
id | categoryIndex
----|---------------
0 | 0.0
1 | 2.0
2 | 1.0
3 | 0.0
4 | 0.0
5 | 1.0
使用originalCategory我們可以獲取其原始的標籤字符串如下:
id | categoryIndex| originalCategory
----|---------------|-----------------
0 |0.0 | a
1 |2.0 | b
2 |1.0 | c
3 |0.0 | a
4 |0.0 | a
5 |1.0 | c
示例調用:
Scala:
- import org.apache.spark.ml.feature.{IndexToString, StringIndexer}
- val df = spark.createDataFrame(Seq(
- (0, "a"),
- (1, "b"),
- (2, "c"),
- (3, "a"),
- (4, "a"),
- (5, "c")
- )).toDF("id", "category")
- val indexer = new StringIndexer()
- .setInputCol("category")
- .setOutputCol("categoryIndex")
- .fit(df)
- val indexed = indexer.transform(df)
- val converter = new IndexToString()
- .setInputCol("categoryIndex")
- .setOutputCol("originalCategory")
- val converted = converter.transform(indexed)
- converted.select("id", "originalCategory").show()
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.IndexToString;
- import org.apache.spark.ml.feature.StringIndexer;
- import org.apache.spark.ml.feature.StringIndexerModel;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(0, "a"),
- RowFactory.create(1, "b"),
- RowFactory.create(2, "c"),
- RowFactory.create(3, "a"),
- RowFactory.create(4, "a"),
- RowFactory.create(5, "c")
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
- new StructField("category", DataTypes.StringType, false, Metadata.empty())
- });
- Dataset<Row> df = spark.createDataFrame(data, schema);
- StringIndexerModel indexer = new StringIndexer()
- .setInputCol("category")
- .setOutputCol("categoryIndex")
- .fit(df);
- Dataset<Row> indexed = indexer.transform(df);
- IndexToString converter = new IndexToString()
- .setInputCol("categoryIndex")
- .setOutputCol("originalCategory");
- Dataset<Row> converted = converter.transform(indexed);
- converted.select("id", "originalCategory").show();
Python:
- from pyspark.ml.feature import IndexToString, StringIndexer
- df = spark.createDataFrame(
- [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
- ["id", "category"])
- stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
- model = stringIndexer.fit(df)
- indexed = model.transform(df)
- converter = IndexToString(inputCol="categoryIndex", outputCol="originalCategory")
- converted = converter.transform(indexed)
- converted.select("id", "originalCategory").show()
OneHotEncoder
算法介紹:
獨熱編碼將標籤指標映射爲二值向量,其中最多一個單值。這種編碼被用於將種類特徵使用到需要連續特徵的算法,如邏輯迴歸等。
示例調用:
Scala:
- import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
- val df = spark.createDataFrame(Seq(
- (0, "a"),
- (1, "b"),
- (2, "c"),
- (3, "a"),
- (4, "a"),
- (5, "c")
- )).toDF("id", "category")
- val indexer = new StringIndexer()
- .setInputCol("category")
- .setOutputCol("categoryIndex")
- .fit(df)
- val indexed = indexer.transform(df)
- val encoder = new OneHotEncoder()
- .setInputCol("categoryIndex")
- .setOutputCol("categoryVec")
- val encoded = encoder.transform(indexed)
- encoded.select("id", "categoryVec").show()
Java:
- import java.util.Arrays;
- import java.util.List;
- import org.apache.spark.ml.feature.OneHotEncoder;
- import org.apache.spark.ml.feature.StringIndexer;
- import org.apache.spark.ml.feature.StringIndexerModel;
- import org.apache.spark.sql.Dataset;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- List<Row> data = Arrays.asList(
- RowFactory.create(0, "a"),
- RowFactory.create(1, "b"),
- RowFactory.create(2, "c"),
- RowFactory.create(3, "a"),
- RowFactory.create(4, "a"),
- RowFactory.create(5, "c")
- );
- StructType schema = new StructType(new StructField[]{
- new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
- new StructField("category", DataTypes.StringType, false, Metadata.empty())
- });
- Dataset<Row> df = spark.createDataFrame(data, schema);
- StringIndexerModel indexer = new StringIndexer()
- .setInputCol("category")
- .setOutputCol("categoryIndex")
- .fit(df);
- Dataset<Row> indexed = indexer.transform(df);
- OneHotEncoder encoder = new OneHotEncoder()
- .setInputCol("categoryIndex")
- .setOutputCol("categoryVec");
- Dataset<Row> encoded = encoder.transform(indexed);
- encoded.select("id", "categoryVec").show();
Python:
- from pyspark.ml.feature import OneHotEncoder, StringIndexer
- df = spark.createDataFrame([
- (0, "a"),
- (1, "b"),
- (2, "c"),
- (3, "a"),
- (4, "a"),
- (5, "c")
- ], ["id", "category"])
- stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
- model = stringIndexer.fit(df)
- indexed = model.transform(df)
- encoder = OneHotEncoder(dropLast=False, inputCol="categoryIndex", outputCol="categoryVec")
- encoded = encoder.transform(indexed)
- encoded.select("id", "categoryVec").show()
文章出處:https://blog.csdn.net/liulingyuan6/article/details/53397780?locationNum=3&fps=1