Spark - RegexTokenizer和StopWordsRemover學習

Stop words是應當從輸入中排除掉的詞,一般因爲他們經常出現,還沒有什麼意義。
StopWordsRemover接受一個字符串序列,他們已經由Tokenizer或者RegexTokenizer做了標記。stop words的列表由參數stopWords指定。

public class StopWordsRemoverDemo {

    public static void main(String[] args) {
        final SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("RegexTokenizer")
                .getOrCreate();

        final List<Row> data = Arrays.asList(
                RowFactory.create(0, "Tokenization,is the process of enchanting words,from the raw text"),
                RowFactory.create(1, "If you want,to have more advance tokenization,RegexTokenizer,\n" +
                        "is a good option"),
                RowFactory.create(2, "Here,will provide a sample example on how to tockenize sentences"),
                RowFactory.create(3, "This way,you can find all matching occurrences")
        );

        final StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
        });
        final Dataset<Row> df = spark.createDataFrame(data, schema);

        final RegexTokenizer tokenizer = new RegexTokenizer()
                .setInputCol("sentence")
                .setOutputCol("words")
                .setPattern("\\W+")
                .setGaps(true);

        spark.udf().register(
                "countTokens",
                (WrappedArray<?> words) -> words.size(),
                DataTypes.IntegerType);

        final Dataset<Row> regexTokenized = tokenizer.transform(df)
                .select("id", "sentence", "words")
                .withColumn("tokens", callUDF("countTokens", col("words")));

        final StopWordsRemover remover = new StopWordsRemover()
                .setInputCol("words")
                .setOutputCol("filtered");
                
        remover.transform(regexTokenized)
                .select("id", "filtered")
                .show(false);
        spark.stop();
    }
}

輸出是

+---+-----------------------------------------------------------+
|id |filtered                                                   |
+---+-----------------------------------------------------------+
|0  |[tokenization, process, enchanting, words, raw, text]      |
|1  |[want, advance, tokenization, regextokenizer, good, option]|
|2  |[provide, sample, example, tockenize, sentences]           |
|3  |[way, find, matching, occurrences]                         |
+---+-----------------------------------------------------------+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章