1. 從list of set建立一個DataFrame:
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])
df2=spark.createDataFrame([(101, 1, 16), (102, 2, 13)], ['ID', 'A', 'B'])
生成的Spark DataFrame:
df.show()
+---+---+
| x| y|
+---+---+
| 1|2.0|
| 2|3.0|
+---+---+
df2.show()
+---+---+---+
| ID| A| B|
+---+---+---+
|101| 1| 16|
|102| 2| 13|
+---+---+---+
2. 從list of dict建立Spark DataFrame:
如果數據如下:
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
那麼就有兩種方法:
方法一簡單,目前還能用,但已經deprecated:
df = spark.createDataFrame(mylist)
df.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
擔心以後不能用,可使用方法二:
from pyspark.sql import Row
df2 = spark.createDataFrame(Row(**x) for x in mylist)
df2.show(truncate=False)
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1 |xxx |
|2 |yyy |
|3 |zzz |
+----------------+------------------+
3. Pandas DataFrame轉Spark DataFrame
原來pyspark中python版本是2.7,直接轉有中文內容的時候會有亂碼。可嘗試下面的用法。
import sys
reload(sys)
sys.setdefaultencoding('utf8')
def pandas_to_spark(pandas_df):
for column in pandas_df:
pandas_df[column] = pandas_df[column].str.decode('utf-8')
return spark.createDataFrame(pandas_df)