Spark DataFrame讀取外部文件並解析數據格式

Spark DataFame實際是DataSet的一個特殊類型，DataFrame對sql過程做很了很多優化。現在DataFrame用起來和Python的Pandas一樣方便了，這裏記錄一下DataFrame讀取外部文件並解析的過程。

type DataFrame = Dataset[Row]

spark讀取csv文件有許多參數可以設置，例如inferSchema”表示是否開啓類型自動推測“，“header”時候有表頭，下面列舉所有的可選參數及其解釋

參數	默認值	說明
sep	`,`	sets the single character as a separator for each field and value
encoding	`UTF-8`	decodes the CSV files by the given encoding type
quote	`"`	sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not `null` but an empty string. This behaviour is different from `com.databricks.spark.csv`
escape	`\`	sets the single character used for escaping quotes inside an already quoted value
comment		sets the single character used for skipping lines. beginning with this character. By default, it is disabled
header	`false`	uses the first line as names of columns.
inferSchema	`false`	infers the input schema automatically from data. It requires one extra pass over the data
ignoreLeadingWhiteSpace	`false`	defines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace	`false`	defines whether or not trailing whitespaces from values being read should be skipped.
nullValue		sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type
nanValue	`NaN`	sets the string representation of a non-number" value.
positiveInf	`Inf`	sets the string representation of a positive infinity value
negativeInf	`-Inf`	sets the string representation of a negative infinity value
dateFormat	`yyyy-MM-dd`	sets the string that indicates a date format. Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to date type
timestampFormat	`yyyy-MM-dd'T'HH:mm:ss.SSSZZ`	sets the string that indicates a timestamp format. Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to timestamp type.
maxColumns	`20480`	defines a hard limit of how many columns a record can have.
maxCharsPerColumn	`-1`	defines the maximum number of characters allowed

假設我們的smaple.csv數據如下所示

item_id,month,total,distinct
2638,201801,1684,142
4120,201801,93,24
949976,201801,46,5
457,201801,4051,98
871603,201801,167,28
317120,201801,61,2

讀取代碼

    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample.csv")
      csvDF.printSchema()

schema會顯示如下

root
 |-- item_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- total: integer (nullable = true)
 |-- distinct: integer (nullable = true)

如果我們希望調整total和distince的數據類型爲double，我們需要使用 withColumn(colName:String, col:Column),並且導入Spark Sql支持的類型 import org.apache.spark.sql.types._ ，支持IntegerType、DoubleType、StringType…類型；使用列函數，需要sql 列函數支持，需要導入 import org.apache.spark.sql.functions._

	import org.apache.spark.sql.types._
	import org.apache.spark.sql.functions._
    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample.csv")
      .withColumn("total", col("total").cast(IntegerType))
      .withColumn("distinct", col("distinct").cast(IntegerType))
    csvDF.printSchema()

現在schema變爲了

root
 |-- item_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- total: double (nullable = true)
 |-- distinct: double (nullable = true)

自動推測的類型有時候並不符合我們的要求，這個時候需要做一次轉換。有兩種方式，第一種是將DataFame轉回RDD然後在RDD裏面處理；另外一種是直接在DataFrame裏面轉換，下面對比兩種方式的耗時差異。我們加入一個case class Person來解析DataFrame裏面的數據，Person的字段一定要和數據表中的字段名一樣，否則會報錯

Exception in thread "main" org.apache.spark.sql.AnalysisException:
 cannot resolve '`id`' given input columns: [item_id, month, total, distinct];

具體需要幾列看實際情況，但是字段名要保持一致

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object DataFrameTest {

  case class Person(item_id: String, month: String, total: Double, distinct: Double) {
    def info(): String = {
      "item_id:" + this.item_id + " total:" + total
    }
  }

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    val startTime = System.currentTimeMillis()

    import spark.implicits._
    import org.apache.spark.sql.functions._
    val csvDF = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("C:\\Users\\xxx\\Desktop\\sample_1.csv")
      .withColumn("total", col("total").cast(DoubleType))
      .withColumn("distinct", col("distinct").cast(DoubleType))
    csvDF.printSchema()
    csvDF.as[Person].map(p => p.info())  // 方法一：直接DataFrame裏面轉換
    //    csvDF.rdd.map(x => {x.asInstanceOf[Person].info()})  // 方法二：在RDD裏面轉換

    val endTime = System.currentTimeMillis()
    println("Run" + (endTime - startTime) + "ms")
  }
}

兩種方式對比，直接在DataFrame裏面讀取200w行數據，方式一需要2290ms，方式二需要3411ms，說明轉成RDD是有比較大的耗時的，這裏推薦使用方式一。需要注意的是，因爲這裏使用case class進行了數據類型的隱式轉換，需要import spark.implicits._

Spark DataFrame讀取外部文件並解析數據格式

Spark DataFrame讀取外部文件並解析數據格式

Notification Volume Control and Optimization System at Pinterest 小記

二叉樹的遍歷小結

Python實現均勻拆分大文件

FastJson在scala中序列化與反序列化

Spark DataFrame內置sql函數總結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結