PySpark CheatSheet-建立Spark DataFrame

1. 從list of set建立一個DataFrame:

df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])
df2=spark.createDataFrame([(101, 1, 16), (102, 2, 13)], ['ID', 'A', 'B'])

生成的Spark DataFrame:

df.show()
+---+---+                                                                       
|  x|  y|
+---+---+
|  1|2.0|
|  2|3.0|
+---+---+
df2.show()
+---+---+---+
| ID|  A|  B|
+---+---+---+
|101|  1| 16|
|102|  2| 13|
+---+---+---+

2. 從list of dict建立Spark DataFrame:

如果數據如下:  

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

那麼就有兩種方法:

方法一簡單,目前還能用,但已經deprecated:

df = spark.createDataFrame(mylist)
df.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+

擔心以後不能用,可使用方法二:

from pyspark.sql import Row
df2 = spark.createDataFrame(Row(**x) for x in mylist)
df2.show(truncate=False)
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1               |xxx               |
|2               |yyy               |
|3               |zzz               |
+----------------+------------------+

請參考這個Stack Overflow不打對號的回答。

3. Pandas DataFrame轉Spark DataFrame

原來pyspark中python版本是2.7,直接轉有中文內容的時候會有亂碼。可嘗試下面的用法。

import sys
reload(sys)
sys.setdefaultencoding('utf8')

def pandas_to_spark(pandas_df):
    for column in pandas_df:
        pandas_df[column] = pandas_df[column].str.decode('utf-8')
    return spark.createDataFrame(pandas_df)

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章