pyspark一些错误

原創

2020-04-20 14:30

1、在特征列还未整合成一个"features"时，Assembler才是将特征列组合的，而不是用Stringindexer

出错语句：

indexer2 = StringIndexer(inputCol=new_columns_names[1:], outputCol='features')

报错：

typeError: Invalid param value given for param "inputCol". Could not convert <class 'list'> to string type

2、当你的所有特征列数据都是连续值的时候，不要用Stringindexer或者VectorIndexer，只需要VectorAssembler将所有特征列合并组成outputCol——"features"的列即可

出错语句：

# 下面的都是不需要的
new_columns_names = data.columns
new_columns_names = [name + '-new' for name in old_columns_names]
for i in range(len(old_columns_names)):
    indexer = StringIndexer(inputCol=old_columns_names[i], outputCol=new_columns_names[i])
    # 或是indexer = VectorIndexer(inputCol=old_columns_names[i], outputCol=new_columns_names[i], maxCategories=5)
    data = indexer.fit(data).transform(data)

报错：（出现下面的报错也可能真的是你的maxBins设置小了，需要设大一点）

'requirement failed: DecisionTree requires maxBins (= 100) to be at least as large as the number of values in each categorical feature, but categorical feature 18 has 7815 values. Considering remove this and other categorical features with a large number of values, or add more training examples.'
'要求失败：DecisionTree要求maxBins（= 100）至少与每个分类要素中的值数量一样大，但分类要素18具有7815个值。考虑删除具有大量值的此分类功能和其他分类功能，或添加更多训练示例。'

3、一定要注意在spark ML中，需要将label列放在第0个，组合后的features列在第1个

报错：

'requirement failed: Classifier found max label value = 1277.4 but requires integers in range [0, ... 2147483647)'

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pyspark一些错误

1、在特征列还未整合成一个"features"时，Assembler才是将特征列组合的，而不是用Stringindexer

2、当你的所有特征列数据都是连续值的时候，不要用Stringindexer或者VectorIndexer，只需要VectorAssembler将所有特征列合并组成outputCol——"features"的列即可

3、一定要注意在spark ML中，需要将label列放在第0个，组合后的features列在第1个

C#开源的两款功能强大的录屏神器

认知提升的方法

蚂蚁面试：Springcloud核心组件的底层原理，你知道多少？

python使用xlrd和xlwt模塊對Excel文件讀寫（實例：將點座標轉爲無向圖距離）

matlab與python的交互

hdu2023求平均成績杭電OJ Compilation error

分別用numpy和pandas劃分數據集以完成交叉驗證

進程同步水果問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結