1. 从list of set建立一个DataFrame:
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])
df2=spark.createDataFrame([(101, 1, 16), (102, 2, 13)], ['ID', 'A', 'B'])
生成的Spark DataFrame:
df.show()
+---+---+
| x| y|
+---+---+
| 1|2.0|
| 2|3.0|
+---+---+
df2.show()
+---+---+---+
| ID| A| B|
+---+---+---+
|101| 1| 16|
|102| 2| 13|
+---+---+---+
2. 从list of dict建立Spark DataFrame:
如果数据如下:
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
那么就有两种方法:
方法一简单,目前还能用,但已经deprecated:
df = spark.createDataFrame(mylist)
df.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
担心以后不能用,可使用方法二:
from pyspark.sql import Row
df2 = spark.createDataFrame(Row(**x) for x in mylist)
df2.show(truncate=False)
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1 |xxx |
|2 |yyy |
|3 |zzz |
+----------------+------------------+
3. Pandas DataFrame转Spark DataFrame
原来pyspark中python版本是2.7,直接转有中文内容的时候会有乱码。可尝试下面的用法。
import sys
reload(sys)
sys.setdefaultencoding('utf8')
def pandas_to_spark(pandas_df):
for column in pandas_df:
pandas_df[column] = pandas_df[column].str.decode('utf-8')
return spark.createDataFrame(pandas_df)