RDD轉DataFrame的兩種方法

原創

段渣渣

2019-06-10 17:48

版權聲明：未經允許，隨意轉載，請附上本文鏈接謝謝（づ￣3￣）づ╭❤～
https://blog.csdn.net/xiaoduan_/article/details/79809225

RDD轉DataFrame的兩種方法

使用反射來推斷包含特定類型對象的 RDD 的模式（Inferring the Schema Using Reflection）

The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

這種基於反射的方法會導致更簡潔的代碼, 當你在編寫 Spark 應用程序的時候已經知道了這個模式。也就是說在運行前你就知道了這個RDD各各字段的類型

這是官網的例子

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()

這是自己的實現的例子

package com.anthony.spark

import org.apache.spark.sql.SparkSession

/**
  * @ Description:
  * @ Date: Created in 20:44 2018/3/29
  * @ Author: Anthony_Duan
  */
object DataFrameCase {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("DataFrameCase").master("local[2]").getOrCreate()

    val rdd = spark.sparkContext.textFile("file:///Users/duanjiaxing/data/student.data")

    //注意：需要導入隱式轉換
    import spark.implicits._
    val studentDF = rdd.map(_.split("\\|")).map(line => Student(line(0).toInt, line(1), line(2), line(3))).toDF()

    studentDF.show()

    spark.stop()

  }
  case class Student(id: Int, name: String, phone: String, email: String)

}

使用編程接口構建一個模式然後應用到RDD的模式（Programmatically Specifying the Schema）

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.
it allows you to construct Datasets when the columns and their types are not known until runtime.

也就是說很多字段的類型只有你運行的時候才能知道

這個方法分爲3步
1. Create an RDD of Rows from the original RDD;
創建RDD
2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
創建與由步驟1中創建的RDD中的行結構匹配的StructType所表示的模式。
3. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
通過SparkSession提供的createDataFrame方法將StructType應用於行的RDD。

官網的例子

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF.show()

自己的例子，這個例子包含了兩種轉換方式

package com.anthony.spark

import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
  * DataFrame和RDD的互操作
  */
object DataFrameRDDApp {

  def main(args: Array[String]) {

    val spark = SparkSession.builder().appName("DataFrameRDDApp").master("local[2]").getOrCreate()

    inferReflection(spark)

    program(spark)

    spark.stop()
  }
  /**
    * 使用編程的方式RDD轉DF
    * @param spark
    */
  def program(spark: SparkSession): Unit = {
    // RDD ==> DataFrame
    val rdd = spark.sparkContext.textFile("file:///Users/duanjiaxing/data/infos.txt")

    val infoRDD = rdd.map(_.split(",")).map(line => Row(line(0).toInt, line(1), line(2).toInt))

    val structType = StructType(Array(StructField("id", IntegerType, nullable = true),
      StructField("name", StringType, nullable = true),
      StructField("age", IntegerType, nullable = true)))

    val infoDF = spark.createDataFrame(infoRDD,structType)
    infoDF.printSchema()
    infoDF.show()


    //通過df的api進行操作
    infoDF.filter(infoDF.col("age") > 30).show

    //通過sql的方式進行操作
    infoDF.createOrReplaceTempView("infos")
    spark.sql("select * from infos where age > 30").show()
  }


  /**
    * 使用反射的方式RDD轉DF
    * @param spark
    */
  def inferReflection(spark: SparkSession) {
    // RDD ==> DataFrame
    val rdd = spark.sparkContext.textFile("file:///Users/duanjiaixng/data/infos.txt")

    //注意：需要導入隱式轉換
    import spark.implicits._
    val infoDF = rdd.map(_.split(",")).map(line => Info(line(0).toInt, line(1), line(2).toInt)).toDF()

    infoDF.show()

    infoDF.filter(infoDF.col("age") > 30).show

    infoDF.createOrReplaceTempView("infos")
    spark.sql("select * from infos where age > 30").show()
  }

  case class Info(id: Int, name: String, age: Int)

}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

RDD轉DataFrame的兩種方法

RDD轉DataFrame的兩種方法

使用反射來推斷包含特定類型對象的 RDD 的模式（Inferring the Schema Using Reflection）

使用編程接口構建一個模式然後應用到RDD的模式（Programmatically Specifying the Schema）

HDFS架構的基本認知

maven-assembly-plugin的使用，打包SparkSQL項目到生產環境

Windows報Failed to locate the winutils binary in the hadoop binary path，java.io.IOException:

SparkSQL中SQL、DataFrame和DataSet方式的靜態類型安全和運行時類型安全

淘女郎個人信息爬取

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結