1、pandas dataframe
參考notebook:
https://nbviewer.jupyter.org/github/lonngxiang/spark_sql_exmple/blob/master/log_pandas.ipynb
2、pyspark sql dataframe
參考:
https://nbviewer.jupyter.org/github/lonngxiang/spark_sql_exmple/blob/master/log_imooc1.ipynb
另注:
agg聚合函數後列上操作
sql dataframe foreach 是行的操作
df.foreach(lambda x :x.age)
sql dataframe 類似pandas apply 操作賦值:需要udf函數
from pyspark.sql.functions import *
from pyspark.sql.types import *
def fc(a):
return a+1
# pass
fc = udf(fc, StringType())
data = df_rdd2.withColumn('value2', fc('age'))
print(data.show())