使用pyspark 中的VectorAssembler出現報錯
vectorAssembler = ft.VectorAssembler(inputCols=['cust_sex','cust_age'],outputCol='features')
查看輸入數據類型
df1.printSchema()
發現輸入的inputCols的字段類型是string,而這個函數只接受float 或者int
故先進行類型轉換
df1=df1.withColumn('device_number', df1.device_number.astype("int"))
df1=df1.withColumn('cust_sex', df1.cust_sex.astype("int"))
再執行
ft.VectorAssembler(inputCols=['cust_sex','cust_age'],outputCol='features',handleInvalid='keep').transform(df1).show()
成功,同時注意 若原列中有null,需要將handleInvalid設置爲'keep'或者"skip",否則報錯:
Caused by: org.apache.spark.SparkException: Encountered null while assembling a row with handleInvalid = "keep". Consider removing nulls from dataset or using handleInvalid = "keep" or "skip".