一概述

spark sql是結構化數據處理模塊，可以通過SQL語句和Dataset API進行結構化數據處理。

1.1 SQL

spark sql一個用途就是sql查詢，也可以讀取已經存在的hive倉庫的數據。程序中運行sql語句，將會返回Dataset/DataFrame數據結構。你也可以通過使用spark-sql命令行或jdbc/odbc服務進行sql操作。

1.2 Datasets和DataFrames

Dataset是分佈式數據集，spark1.6版本之後新加的一個接口。DataSet是有類型的一個數據集合，例如DataSet<Pserson>，所以在對數據集記錄操作時，就可以使用類似person.getName（）這種操作來保證類型安全。DataFrame可以從結構化數據文件、hive表、外部數據庫或者已經存在的RDD得到。DataFrame是無類型的，其實就是DataSet<Row>，訪問記錄時可以使用row.getString（0）這種api訪問，不保證類型安全，例如這一行的第一個字段沒有辦法轉成String。DataFrame是有schema的，DataSet沒有，schema定義了每行數據的數據結構，類型與表的字段信息。

二入門指南

2.1 切入點

spark sql的切入點是SparkSession類，如下方式創建：

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames（如果想用RDD的toDF方法轉換成DataFrames，那麼就需要如下操作）
import spark.implicits._

spark2.0之後的SparkSeesion就提供了HiveQL查詢、hive UDFs使用、從hive表讀取數據的支持。你不需要安裝hive就可以使用這些特性。

2.2 創建DataFrames

可以從已存在的RDD、hive表或其他spark數據源創建DataFrame。如下是從json文件創建：

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

2.3 無類型的Dataset操作

如下示例（其實就是DataFrame的一些操作）：

// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// +-------+
// |   name|
// +-------+
// |Michael|
// |   Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// |   name|(age + 1)|
// +-------+---------+
// |Michael|     null|
// |   Andy|       31|
// | Justin|       20|
// +-------+---------+

// Select people older than 21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// |  19|    1|
// |null|    1|
// |  30|    1|
// +----+-----+

2.4 程序中sql查詢

如下：

df.createOrReplaceTempView("people")//創建臨時表，程序退出後清除

val sqlDF = spark.sql("SELECT * FROM people")//從臨時表做一些sql查詢操作
sqlDF.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

sql函數返回一個DataFrame作爲查詢結果。

2.5 全局臨時表

臨時表是session級別的，創建它的session結束後，臨時表會消失。如果你想創建一個跨session（一個application可以有多個SparkSession，session之間上下文環境、資源隔離）共享的臨時表，這個臨時表在程序退出之前都是存活的。那麼你就可以創建一個全局的臨時表。全局臨時表會存放在global_temp數據庫中，訪問時需要指定此數據庫，如下：

// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

2.6 創建Datasets

dataset和rdd是相似的，但不同於rdd的java序列化器或者kryo，它使用了一個專門的Encoder來對對象進行序列化操作。同樣是要把對象序列化成bytes，Encoder能夠在不反序列化成對象的情況下，可以做很多類似於過濾、排序、hash操作。

如下使用實例：

case class Person(name: String, age: Long)

// Encoders are created for case classes（針對Person case class創建了encoder）
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+

// Encoders for most common types are automatically provided by importing spark.implicits._（基礎數據類型自動創建）
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
//DataFrame可以用一個類轉成DataSet，用類的成員名稱進行映射。（people.json裏的json格式，包含name和age兩項，和case class中的成員名一一對應）
 val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

2.7 和RDDs交互

兩種方式可以把RDDs轉成Datasets：

使用反射進行schema推斷：條件是，schema已知。

程序中指定schema：程序運行之前是不知道schema信息的。

2.7.1 使用反射進行schema推斷

spark sql支持自動把包含case class的RDD轉成DataFrame。case class就蘊含了表的schema信息。case class的成員名稱就會變成列名。case class也可以支持嵌套和一些複雜的數據類型（Seqs和Arrays）。所以RDD可以轉成DataFrame，然後註冊成一個表，然後進行一些sql操作，如下實例：

// For implicit conversions from RDDs to DataFrames（導入這個後面才能使用toDF把RDD轉成DF）
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index（通過索引訪問）
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name（通過字段名訪問）
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly（因爲之前沒有Dataset[Map[K,V]]的encoder，所以要顯示定義，不然下面的map會報錯）
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))

2.7.2 程序中指定schema

有些情況下沒有辦法定義case class（例如，程序運行之前是不知道schema的，是通過動態參數傳入一個字符串或別的途徑，是動態讀取的schema信息，或者根據用戶不同，schema也不同。之前做過一個應用，讀取不同接口的數據進行處理，爲了適應不同接口的數據schema，就需要把各個接口數據的schema通過參數傳到程序中，然後動態的去解析），步驟如下：

1.首先從原始RDD創建RDD<Row>（因爲數據結構是已知的，比如有多少個字段）。

2.對應第一個步的RDD<Row>創建StructType（動態解析出字段名和字段類型等信息），這個就是schema信息。

3.通過SparkSession的createDataFrame方法應用schema到RDD<Row>上創建DataFrame。

實例如下：

import org.apache.spark.sql.types._

// Create an RDD（原始的普通RDD）
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string（這個是schema信息的字段名，這裏直接寫死，一般會從參數或者其他途徑傳入）
val schemaString = "name age"

// Generate the schema based on the string of schema（根據schemaString創建對應的schema信息）
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows（創建RDD<Row>）
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD（應用schema，創建DataFrame）
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+

2.8 聚合

DataFrames提供例如count（），countDistinct（），avg（），max（），min（）等聚合操作。同時，spark sql也提供類型安全的這些操作（Dataset函數）。同樣，用戶也可以自定義自己的聚合函數。

2.8.1 用戶自定義無類型聚合函數（UDAF）

通過繼承UserDefinedAggregateFunction實現，如下所示：

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._

object MyAverage extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function（聚合函數傳入參數的數據類型）
  def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
  // Data types of values in the aggregation buffer（聚合過程中buffer的數據類型）
  def bufferSchema: StructType = {
    StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
  }
  // The data type of the returned value（聚合函數返回的數據類型）
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  def deterministic: Boolean = true
//初始化buffer
 // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`（用傳入的數據更新buffer中的值，例如一個task剛開始buffer值都是0，然後對分區記錄一條條進行處理，這裏的input就是需處理的記錄）
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`（聚合不同的buffer，可以理解爲各個分區最終都會有一個buffer值，多個task的buffer要進行merge才能得到最終的）
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result（返回最終的結果）
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// Register the function to access it（自定義聚合函數後要進行註冊）
spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

2.8.2 類型安全的用戶自定義聚合函數（UDAF）

通過繼承Aggregator抽象類實現，如下：

import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator

case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
//這裏直接指定了需要的數據類型（Employee是需要聚合的記錄的數據結構，Average是自定義的聚合過程中buffer數據類型，Double是最終返回的數據類型）
object MyAverage extends Aggregator[Employee, Average, Double] {
  // A zero value for this aggregation. Should satisfy the property that any b + zero = b（初始0值）
  def zero: Average = Average(0L, 0L)
  // Combine two values to produce a new value. For performance, the function may modify `buffer`
  // and return it instead of constructing a new object（聚合每個記錄，類似於上面的update函數）
  def reduce(buffer: Average, employee: Employee): Average = {
    buffer.sum += employee.salary
    buffer.count += 1
    buffer
  }
  // Merge two intermediate values（類似於上面的merge函數）
  def merge(b1: Average, b2: Average): Average = {
    b1.sum += b2.sum
    b1.count += b2.count
    b1
  }
  // Transform the output of the reduction（最終的結果）
  def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
  // Specifies the Encoder for the intermediate value type（因爲類型安全的Dataset是需要指定encode的，這裏指定Average類型buffer的encoder）
  def bufferEncoder: Encoder[Average] = Encoders.product
  // Specifies the Encoder for the final output value type（指定結果數據的encoder）
  def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name（把聚合函數轉成一個字段類型並命名）
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
//可以看到這裏聚合列名是上面命名的
 // +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

三數據源

通過DataFrame接口 spark sql支持多種數據源。可以在DataFrame上做一些數據處理轉換，同時可以註冊爲一個臨時視圖，然後就可以通過這個視圖在其數據上做sql查詢。

3.1 通用Load/Save方法

默認數據源是parquet格式數據文件（也可以通過spark.sql.sources.default重新指定），如下：

val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

3.1.1 手動指定數據源配置選項

可以用額外的一些配置選項手動指定數據源，數據源應該用全路徑名稱指定，但內建的數據源支持簡寫（json、parquet、jdbc、orc、libsvm、csv、text）。

加載json文件可以如下使用：

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

加載csv文件可以如下使用：

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")（指定分隔符）
  .option("inferSchema", "true")（是否開啓schema推斷）
  .option("header", "true")（是否有頭部信息，就是開頭一行是字段頭信息，不是真正的數據）
  .load("examples/src/main/resources/people.csv")

3.1.2 文件上直接運行sql

你可以直接在文件上運行sql，而不用先讀到DataFrame中然後查詢，如下：

val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

parquet指定了文件的格式，所以可以直接進行查詢，json的應該寫成json.`文件名`，待測試。

3.1.3 保存模式

save操作可以指定SaveMode，需要注意的是，這些保存模式，都沒有使用鎖機制，所以不是原子操作。

SaveMode.ErrorIfExists（默認的模式）：保存DataFrame時，如果數據已經存在，則會拋出異常。

SaveMode.Append：追加到已經存在的數據上。

SaveMode.Overwrite：已經存在的數據先刪除，然後寫新數據

SaveMode.Ignore：如果數據已經存在，那麼忽略本次操作，並不會改變已有數據

3.1.4 保存爲持久化的表（非臨時表）

可以用saveAsTable保存DataFrames到hive倉庫。需要注意的是：如果你沒有安裝hive，也可以使用這個特性。spark會用Derby來爲你創建本地的hive倉庫。程序退出重新執行，你持久化的表還是存在的，可以通過SparkSession的table方法讀取到DataFrame中。

基於文件的數據源，保存爲持久化表時，可以指定保存路徑（df.write.option（“path”，“/some/path”）.saveAsTable（“t”））。當表被刪除了，這個路徑和數據還是在的（由此看出這裏指定path後是以一個外部表來處理的）。如果沒有指定path，那麼數據會默認放在倉庫目錄下，如果表被刪除了，默認路徑也會被移除（內部表）。

從spark 2.1開始，持久化數據表時，在hive metastore中每個分區都會有對應的元數據存儲。這樣就帶來了幾點好處：

1.在查詢個別分區的時候，可以只查找返回必要的分區，而不是第一次查詢時要遍歷所有分區找到需要的分區，這裏就可以直接定位查詢了。

2.可以使用Hive類似ALTER TABLE PARTITION....SET LOCATION等的DDLs。

需要注意的是，外部表（指定path的）默認分區信息不會被收集管理。如果需要同步倉庫中的分區信息（比如做了上面2操作），需要調用MSCK來修復hive表。

3.1.5 桶、排序、分區

對於文件數據源，你還可以對輸出進行桶、排序、分區操作。桶和排序操作僅適用於持久化表（saveAsTable）。

如下：

peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

分區是save和saveAsTable都可以用的：

usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

可能一個表會既用到分區，又要用到桶：

usersDF
  .write
  .partitionBy("favorite_color")
  .bucketBy(42, "name")
  .saveAsTable("users_partitioned_bucketed")

如果數據基數比較大，分區列類型較多，所以partitionBy後分區會很多，這樣其實作用不是很大。而桶限定了桶的數量，所以不管數據基數有多大，索引的消耗是一定的。

3.2 parquet文件

parquet文件格式是很多數據處理系統都支持的，所以spark sql提供了對parquet文件的讀寫支持，並且會自動保持原始數據的schema信息。當寫parquet文件時，我了兼容性的原因，所有的列自動轉成可爲空的。

3.2.1 編程導入數據

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

3.2.2 分區發現

分區是類似hive這種系統中常見的一個優化方法。分區表中，數據會根據分區信息存儲在不同的目錄中。內建的文件數據源（包括TEXT/CSV/JSON/ORC/PARQUET）能夠自動發現和推斷分區信息。例如下面gender、country是分區列的一個數據目錄結構：

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...

上面的數據，使用SparkSession.read.parquet或SparkSession.read.load方法（傳入path/to/table）去讀，spark sql會自動去提取path/to/table路徑下的分區信息，讀取後DataFrame的schema信息就變成了如下：

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)

需要注意的是，分區列的數據類型是自動推斷的。現在，支持數值、date、timestamp、字符串類型。有時候用戶可能不需要分區列的數據類型自動推斷。這個時候可以通過設置spark.sql.sources.partitionColumnTypeInference.enabled來實現。當設置成false後，分區列的類型會是string類型。

從spark1.6.0開始，只會默認發現給定路徑下的分區。上面的例子，如果用戶傳path/to/table/gender=male給讀取函數，gender就不會作爲一個分區列來處理了。如果用戶一定要這麼做，那麼可以指定basePath來實現：option函數指定basePath爲path/to/table/，然後gender就是一個分區列了。

3.2.3 schema合併

schema可以演進，剛開始很簡單，然後後面逐漸添加更多的列到這個schema。parquet文件如果是不同但兼容的，那麼就能夠自動發現併合並這些文件。

schema合併是相對昂貴的一個操作。所以spark1.5.0以後，默認關閉了，可以通過以下方法開啓：

1.讀取parquet文件時通過option函數設置mergeSchema爲true。

2.設置全局的sql操作配置：spark.sql.parquet.mergeSchema爲true。

如下實例：

// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._

// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.parquet("data/test_table/key=1")//（分區目錄存儲，key爲分區列）

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
cubesDF.write.parquet("data/test_table/key=2")

// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()

//（比如這裏value列必須是兼容的，都是數值型，因爲這裏自動類型推斷）
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
// root
//  |-- value: int (nullable = true)
//  |-- square: int (nullable = true)
//  |-- cube: int (nullable = true)
//  |-- key: int (nullable = true)

3.2.4 hive parquet表元數據轉換

待寫···（沒tm看懂）

3.2.5 元數據更新

spark sql會緩存parquet的元數據信息，所以在做hive parquet元數據轉換的時候，轉換後的變的元數據也被緩存了。但如果這些表被hive或別的外部工具更新了，那麼就要對錶進行更新操作以保持一致的元數據信息：

// spark is an existing SparkSession
spark.catalog.refreshTable("my_table")

3.2.6 配置

parquet的配置信息可以通過SparkSession的setConf函數或者在sql中使用set key=value命令設置：

詳細配置參見官方文檔。

3.3 ORC文件

意思就是支持ORC文件了，具體沒有了解過，也沒用到過，用到時再看

3.4 JSON數據集

spark sql可以自動推斷json數據集的schema並導成Dataset[Row]。可以用SparkSession.read.json（）函數來導入。數據源可以是一個String類型的Dataset或JSON文件。

注意，這裏的json文件不是一個常規典型的json文件。這裏的json文件每一行必須包含一個分隔完整有效的json對象。

多行json文件，需要設置multiLine爲true。

一般的json文件如下所示：

{    "people": [
        {
            "name": "aboutyun",
            "age": "4"
        },
        {
            "name": "baidu",
            "age": "5"
        }
    ]
}

如果是這種形式，讀進來後不但不會讀取到schema信息，而且action操作執行後會報錯：

根據官方文檔，需要設置multiLine，如下所示：

可以看到解釋成了array<struct<age:string,name:string>>字段people，和我們的需求也不一致。我們需求是解析成age、name兩個字段的DataFrame，所以json文件需要如下所示：

[
    {
        "name": "aboutyun",
        "age": "4"
    },
    {
        "name": "baidu",
        "age": "5"
    }
]

然後讀取後如下所示（滿足我們需求）：

但是如果沒有設置multiFile，那麼這種多行的json文件就解析不了了，一個json對象必須寫在一行，而且如果一行包含多個json文件，必須是分隔開並且完整的json對象，如下內容的json文件是可以解析出的：

用法實例如下：

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by（通過string類型的dataset創建，每行是一個json對象的字符串）
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

3.5 Hive表

spark sql支持讀寫存儲在hive中的表。但因爲hive的依賴太多，而這些依賴spark中默認不包含的。如果classpath中能夠找到hive的依賴，那麼spark會自動加載。需要注意的是，由於訪問hive中數據的時候，worker節點需要對數據進行序列化和反序列化，所以worker節點也是需要這些hive依賴的。

配置hive：把hive-site.xml（hive配置）、core-site.xml（安全配置）和hdfs-site.xml（hdfs配置）放到spark的conf目錄下。（這樣初始化SparkSession後，會自動加載現有hive倉庫的配置（hive-site.xml），使用sql函數的操作就是基於你已經安裝的hive倉庫了，如果沒有這些拷貝配置文件的操作，spark會通過derby本地創建倉庫）

如果沒有拷貝hive-site.xml，那麼spark會自動在當前目錄創建metastore_db，然後創建spark.sql.warehouse.dir配置的倉庫目錄。（注意，spark2.0.0後，hive-site.xml中hive.metastore.warehouse.dir棄用了，需要用spark.sql.warehouse.dir來指定倉庫中數據庫的默認位置），啓動spark的用戶需要有這些目錄的權限。

使用實例如下（spark shell中親測有效）：

import java.io.File

import org.apache.spark.sql.{Row, SaveMode, SparkSession}

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
  case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|
// ...

// Create a Hive managed Parquet table, with HQL syntax instead of the Spark SQL native syntax
// `USING hive`
sql("CREATE TABLE hive_records(key int, value string) STORED AS PARQUET")
// Save DataFrame to the Hive managed table
val df = spark.table("src")
df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records")
// After insertion, the Hive managed table has data now
sql("SELECT * FROM hive_records").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Prepare a Parquet data directory
val dataDir = "/tmp/parquet_data"
spark.range(10).write.parquet(dataDir)
// Create a Hive external Parquet table
sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'")
// The Hive external table should already have data
sql("SELECT * FROM hive_ints").show()
// +---+
// |key|
// +---+
// |  0|
// |  1|
// |  2|
// ...

// Turn on flag for Hive Dynamic Partitioning
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
// Create a Hive partitioned table using DataFrame API
df.write.partitionBy("key").format("hive").saveAsTable("hive_part_tbl")
// Partitioned column `key` will be moved to the end of the schema.
sql("SELECT * FROM hive_part_tbl").show()
// +-------+---+
// |  value|key|
// +-------+---+
// |val_238|238|
// | val_86| 86|
// |val_311|311|
// ...

spark.stop()

3.5.1 指定hive表存儲格式

當你創建hive表時，你需要指明怎麼從文件系統裏讀取或寫入數據，例如指明“input format”和“output format”。你也可以通過指定“serde”來指定數據怎麼反序列化成rows或序列化爲數據。可以通過“create table src（id int） using hive options（fileFormat ‘parquet’）”這種option指定。我們默認以文本形式讀取表數據。需要注意的是，目前創建表時還不hive storage handler，如果你需要，那麼需要再hive中進行此操作，然後用spark sql去讀取並做相應處理。

具體格式配置，參見官方文檔。

3.5.2 對不同版本hive metastore的支持

對於spark sql來說，和hive的metastore進行交互是最重要的，這樣spark sql就可以訪問hive表的元數據信息了。在spark1.4.0之後，可以通過以下配置來配置所使用的hive metastore的版本信息。(本spark版本，默認支持1.2.1版本hive)。

（對於hive的支持，不是所有的版本都支持的，如果不支持的版本，就會有各種兼容性問題，因此如果需要，可查詢官方文檔對hive 的版本支持）（hvie支持0.12.0-1.2.1）

3.6 JDBC

spark sql也支持把別的數據庫通過jdbc的方式作爲自己的一個數據源。如果需要使用此特性，那麼首先要指定jdbc的driver class以及相關包，例如命令行方式啓動：spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar。然後還要指定數據的連接配置，比如user和password信息等，相關配置如下參考官方文檔。

如下實例：

// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .load()

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF2 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Specifying the custom data types of the read schema
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Saving data to a JDBC source
jdbcDF.write
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .save()

jdbcDF2.write
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Specifying create table column data types on write
jdbcDF.write
  .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

四性能調優

4.1 內存緩存數據

可以通過spark.catalog.cacheTable("tableName")或者dataFrame.cache（）來對錶進行緩存。內存緩存的一些配置可以通過SparkSession的setConf方法或者通過sql裏set key=value進行配置。具體參數參考官方文檔。

4.2 其他優化配置

還有一些能夠優化spark查詢的效率的配置，具體參考官方文檔。

4.3 sql查詢廣播變量

意思是可以在join操作時，把表給廣播出去，如下：

import org.apache.spark.sql.functions.broadcast
broadcast(spark.table("src")).join(spark.table("records"), "key").show()

然後具體沒tm看懂···用到時再研究

五分佈式sql引擎

spark sql可以通過jdbc/odbc server和命令行接口作爲一個sql引擎，這樣最終用戶或者其他應用就可以直接在上面跑sql，而不用去編寫spark sql程序了。

5.1 thrift jdbc/odbc server

./sbin/start-thriftserver.sh啓動jdbc/odbc server（默認localhost:10000）,可以通過--hiveconf指定hive的配置（如果你已經配置過hive相關信息，則一般不需要指定，除非你需要更改一些配置），也可以修改默認的localhost:10000,通過如下配置：

export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
  --master <master-uri> \
  ...

或者：

./sbin/start-thriftserver.sh \
  --hiveconf hive.server2.thrift.port=<listening-port> \
  --hiveconf hive.server2.thrift.bind.host=<listening-host> \
  --master <master-uri>
  ...

然後你就可以通過spark或hive的beeline進行連接測試（只是一個連接測試的客戶端。）

例如：

./spark/sbin/start-thriftserver.sh --master yarn啓動後，hadoop的8080端口如下所示：

可以看到jdbc/odbc server是跑在yarn上的。然後訪問spark的4040端口，可以看到監控頁面（每個sql執行情況）：

jdbc server也支持通過http傳送thrift rpc消息，在conf目錄中的hive-site中如下配置（默認是tcp）：

hive.server2.transport.mode - Set this to value: http
hive.server2.thrift.http.port - HTTP port number to listen on; default is 10001
hive.server2.http.endpoint - HTTP endpoint; default is cliservice

然後使用beeline進行測試，如下：

beeline> !connect jdbc:hive2://<host>:<port>/<database>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>

5.2 spark sql cl

命令行輸入sql進行sql查詢的工具。需要注意的是，spark sql cli不能連接thrift jdbc server。所以這個一般會當做測試工具，如果有輸入sql進行大數據查詢需求，一般會使用thrift jdbc server。

通過./bin/spark-sql運行（可通過--help查看用法，可指定--master）。

hive配置參考上面hive數據源部分。

yarn模式運行後，可在yarn上看到監控，4040上可看到如下：

六 PySpark

不會py·······所以沒需求，有需求時再研究

七遷移指南

具體有需求時研究

Spark 2.3.0 Spark SQL, Datasets, and DataFrames 學習筆記

一 概述