1、在特征列还未整合成一个"features"时,Assembler才是将特征列组合的,而不是用Stringindexer
出错语句:
indexer2 = StringIndexer(inputCol=new_columns_names[1:], outputCol='features')
报错:
typeError: Invalid param value given for param "inputCol". Could not convert <class 'list'> to string type
2、当你的所有特征列数据都是连续值的时候,不要用Stringindexer或者VectorIndexer,只需要VectorAssembler将所有特征列合并组成outputCol——"features"的列即可
出错语句:
# 下面的都是不需要的
new_columns_names = data.columns
new_columns_names = [name + '-new' for name in old_columns_names]
for i in range(len(old_columns_names)):
indexer = StringIndexer(inputCol=old_columns_names[i], outputCol=new_columns_names[i])
# 或是indexer = VectorIndexer(inputCol=old_columns_names[i], outputCol=new_columns_names[i], maxCategories=5)
data = indexer.fit(data).transform(data)
报错:(出现下面的报错也可能真的是你的maxBins设置小了,需要设大一点)
'requirement failed: DecisionTree requires maxBins (= 100) to be at least as large as the number of values in each categorical feature, but categorical feature 18 has 7815 values. Considering remove this and other categorical features with a large number of values, or add more training examples.'
'要求失败:DecisionTree要求maxBins(= 100)至少与每个分类要素中的值数量一样大,但分类要素18具有7815个值。考虑删除具有大量值的此分类功能和其他分类功能,或添加更多训练示例。'
3、一定要注意在spark ML中,需要将label列放在第0个,组合后的features列在第1个
报错:
'requirement failed: Classifier found max label value = 1277.4 but requires integers in range [0, ... 2147483647)'