【Pyspark】list轉爲dataframe報錯:TypeError:not supported type: class numpy.float64

        在PySpark中經常會使用到dataframe數據形式,本篇博文主要介紹,將list轉爲dataframe時,遇到的數據類型問題。

有如下一個list:

[(22.31670676205784, 15.00427254361571, 14.274554462639939, -48.011495169271186)]

正常情況下:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
import numpy as np
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.mllib.classification import LogisticRegressionWithLBFGS



spark = SparkSession \
    .builder \
    .master("yarn") \
    .appName('create_df_test2') \
    .enableHiveSupport() \
    .getOrCreate()


re = [(22.31670676205784, 15.00427254361571, 14.274554462639939, -48.011495169271186)]
print(re)
print(type(re))

df_re = spark.createDataFrame(re,['r1', 'r2', 'r3', 'r'])

由於re中的數據,其實都是float類型的,直接這樣寫會報錯,如下:

這時需要這樣處理:

spark = SparkSession \
    .builder \
    .master("yarn") \
    .appName('create_df_test2') \
    .enableHiveSupport() \
    .getOrCreate()


re = [(22.31670676205784, 15.00427254361571, 14.274554462639939, -48.011495169271186)]
print(re)
print(type(re))

df_re = spark.createDataFrame([(float(tup[0]), float(tup[1]), float(tup[2]), float(tup[3])) for tup in re],
                              ['r1', 'r2', 'r3', 'r'])

這樣就可以達到效果了。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章