pyspark特徵哈希化

from pyspark.sql import SparkSession from pyspark.ml.feature import FeatureHasher spark = SparkSession\ .builder\ .appName("FeatureHasherExample")\ .getOrCreate() # $example on$ dataset = spark.createDataFrame([ (2.2, True, "1", "foo"), (3.3, False, "2", "bar"), (4.4, False, "3", "baz"), (5.5, False, "4", "foo") ], ["real", "bool", "stringNum", "string"]) # numFeatures:默認哈希桶容量 # categoricalCols:類別列 hasher = FeatureHasher(numFeatures=262144,inputCols=["real", "bool", "stringNum", "string"], outputCol="features",categoricalCols=None,) featurized = hasher.transform(dataset) # 所有特徵行被編碼爲長度262144的哈希桶中相應值 # 如第一行(262144,[174475,247670,257907,262126],[2.2,1.0,1.0,1.0])表示在指定4個索引的值爲[2.2,1.0,1.0,1.0],其他索引處爲0 featurized.show(truncate=False)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章