創建RDD
1,從字符串創建rdd
sc.parallelize(xxx)
如:val testrdd=sc.parallelize(Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)))
2,從文件創建rdd
讀文本文件
val citylevel = sc.textFile(HDFS_PATH)
.map(_.split(","))
.map(attributes=>Row(attributes(0).trim,attributes(1).trim))
創建DataFrame
1,從字符串創建dataframe
var test_df = Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)).toDF("imei","feature","id")
2,從rdd創建dataframe
rdd.toDF(xxx)
如:import spark.implicits._
val testrdd=sc.parallelize(Seq((1,Array("1.0"),3),(2,Array("2.0"),6),(3,Array("3.0"),7),(1,Array("3.0"),7)))
val testDF=testrdd.toDF("id","score","iemi")
3,從文件創建dataframe
(1)讀parquet格式文件 val parquetFileDF = spark.read.parquet(HDFS_PATH)
(2)文本文件 :先從文件創建rdd,再從rdd轉成dataframe
import spark.implicits._
val citylevel = sc.textFile(HDFS_PATH)
.map(_.split(","))
.map(attributes=>Row(attributes(0).trim,attributes(1).trim))
val cityDF = citylevel.toDF("cityid","citylevel")