1. Overview
Spark SQL是spark提供的一個結構化數據處理模塊。Spark提供的SparkSQL接口主要是針對數據的結構化及其計算,並針對這些方面做了大量的優化處理。SparkSQL提供了兩種方式來讓我們操作結構化數據:SQL和Dataset API。
2. SQL
SparkSQL可以直接執行sql查詢,Spark SQL也可以從已經存在的hive中讀取數據(關於這部分的配置在下面的模塊講)。Spark SQL也可以通過命令行和JDBC/ODBC的方式來交互。
3. Datasets and DataFrames
Dataset是一個分佈式數據集,Spark1.6中加入的新接口。Dataset可以通過JVM對象來構造,並且可以使用功能轉換(functional transformations(map,flatmap,filter等))操作。Dataset與RDDs類似,Dataset通過Encoder的方式序列化處理對象並在網絡上傳輸而不是Java serialization或者Kryo方式。儘管encoder和標準序列化方式都是將對象轉換爲字節,但是encoder是動態生成代碼的,並且允許spark在不需要反序列化的情況下執行許多操作,比如:過濾,排序,哈希等。
DataFrame由Dataset組成。在概念上與關係型數據庫的table一樣,但是在底層有更豐富的優化措施。DataFrame可以通過多個不同的數據源來構造,比如:結構化數據文件,Hive中的table,外部數據庫或者是存在的RDDs。
4. Getting Started
4.1 起點:SparkSession
在Spark中,SparkSession是所有功能的入口起點(2.0之後吧)。我們可以通過SparkSession.builder()來創建SparkSession的實例:
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate();
- 1
- 2
- 3
- 4
- 5
- 6
- 7
在Spark2.0中,SparkSession提供了內置的針對Hive的支持,包括使用HiveQL查詢,訪問Hive UDFs,讀取Hive tables中的數據,使用這些特色功能,也不再需要一個安裝好的Hive。
4.2 創建Datasets
import java.util.Arrays;
import java.util.Collections;
import java.io.Serializable;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
public static class Person implements Serializable {
private String name;
private int age;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}
// 1. 從bean對象創建dataset
// Create an instance of a Bean class
Person person = new Person();
person.setName("Andy");
person.setAge(32);
// Encoders are created for Java beans
// Encoder方式序列化
Encoder<Person> personEncoder = Encoders.bean(Person.class);
Dataset<Person> javaBeanDS = spark.createDataset(
Collections.singletonList(person),
personEncoder
);
javaBeanDS.show();
// +---+----+
// |age|name|
// +---+----+
// | 32|Andy|
// +---+----+
// 2. 從普通類型創建dataset
// Encoders for most common types are provided in class Encoders
Encoder<Integer> integerEncoder = Encoders.INT();
Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder);
Dataset<Integer> transformedDS = primitiveDS.map(new MapFunction<Integer, Integer>() {
@Override
public Integer call(Integer value) throws Exception {
return value + 1;
}
}, integerEncoder);
transformedDS.collect(); // Returns [2, 3, 4]
// 3. 從文件創建dataset
// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
// 將dataFrame轉換爲dataset
String path = "examples/src/main/resources/people.json";
Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);
peopleDS.show();
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
4.3 創建DataFrames
應用通過SparkSession,可以從存在的RDD,Hive table或者是Spark data sources來創建DataFrame。
通過json文件創建DataFrame
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");
// Displays the content of the DataFrame to stdout
df.show();
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
DataFrame操作
// col("...") is preferable to df.col("...")
import static org.apache.spark.sql.functions.col;
// Print the schema in a tree format
df.printSchema();
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show();
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+
// Select everybody, but increment the age by 1
df.select(col("name"), col("age").plus(1)).show();
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+
// Select people older than 21
df.filter(col("age").gt(21)).show();
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
// Count people by age
df.groupBy("age").count().show();
// +----+-----+
// | age|count|
// +----+-----+
// | 19| 1|
// |null| 1|
// | 30| 1|
// +----+-----+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
執行SQL查詢
SparkSession的sql函數可以運行SQL查詢語句。
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people");
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");
sqlDF.show();
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
5. 和RDD交互
Spark SQL提供兩種不同的方法來將存在的RDD轉換爲Datasets。
- 反射方式:通過反射可以推測一個包含特殊類型的RDD的schema。反射方式是基於我們知道具體類型的情況下使用的,並且使用更少的代碼。
- 編程接口:這種方式是我們要先創建一個schema,然後把這個schema運用到存在的RDD上。
反射方式
Spark SQL支持自動將一個JavaBean類型的RDD轉換爲一個DataFrame。JavaBean的信息將通過反射的方式來確定將要創建的DataFrame的schema。當前,Spark SQL不支持JavaBean中包含Map屬性,內置的JavaBean和List或Array屬性都支持。
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
// 創建一個javabean爲Person的RDD
// Create an RDD of Person objects from a text file
JavaRDD<Person> peopleRDD = spark.read()
.textFile("examples/src/main/resources/people.txt")
.javaRDD()
.map(new Function<String, Person>() {
@Override
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
// 由RDD來創建Dataframe
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people");
// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
// The columns of a row in the result can be accessed by field index
Encoder<String> stringEncoder = Encoders.STRING();
Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(new MapFunction<Row, String>() {
@Override
public String call(Row row) throws Exception {
return "Name: " + row.getString(0);
}
}, stringEncoder);
teenagerNamesByIndexDF.show();
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
// or by field name
Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(new MapFunction<Row, String>() {
@Override
public String call(Row row) throws Exception {
return "Name: " + row.<String>getAs("name");
}
}, stringEncoder);
teenagerNamesByFieldDF.show();
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
編程接口
當JavaBean不能夠被事先知道(比如:記錄的結構是一個編碼的字符串,或者是一個被轉換的文本數據集),一個Dataset可以通過以下三步來創建:
- 從源RDD創建一個Row類型的RDD
- 創建一個匹配第一步創建的Row結構的代表schema的StructType實例
- 通過SparkSession的createDataFrame方法,將StructType代表的schema運用到第一步的RDD上
例子:
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
// 源RDD
// Create an RDD
JavaRDD<String> peopleRDD = spark.sparkContext()
.textFile("examples/src/main/resources/people.txt", 1)
.toJavaRDD();
// 待創建的schema
// The schema is encoded in a string
String schemaString = "name age";
// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
// 創建schema
StructType schema = DataTypes.createStructType(fields);
// 將源RDD轉換爲Row類型的RDD
// Convert records of the RDD (people) to Rows
JavaRDD<Row> rowRDD = peopleRDD.map(new Function<String, Row>() {
@Override
public Row call(String record) throws Exception {
String[] attributes = record.split(",");
return RowFactory.create(attributes[0], attributes[1].trim());
}
});
// 將schema運用的Row類型的RDD上
// Apply the schema to the RDD
Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);
// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView("people");
// SQL can be run over a temporary view created using DataFrames
Dataset<Row> results = spark.sql("SELECT name FROM people");
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset<String> namesDS = results.map(new MapFunction<Row, String>() {
@Override
public String call(Row row) throws Exception {
return "Name: " + row.getString(0);
}
}, Encoders.STRING());
namesDS.show();
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
6. 數據源
Spark SQL通過DataFrame這個抽象可以操作不同類型的數據源。一個DataFrame可以通過關係型轉換操作,也可以被用來創建臨時視圖。將一個DataFrame註冊爲一個臨時視圖之後可以允許我們運行SQL查詢。
6.1 數據加載和保存
Dataset<Row> usersDF = spark.read().load("examples/src/main/resources/users.parquet");
usersDF.select("name", "favorite_color").write().save("namesAndFavColors.parquet");
- 1
- 2
最簡單的方式,默認的數據源(parquet,可以通過spark.sql.sources.default配置)將會被用於所有的操作。
Dataset<Row> peopleDF =
spark.read().format("json").load("examples/src/main/resources/people.json");
peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
- 1
- 2
- 3
手動設置數據源選項。數據源通過他們的全限定名來指定,內置的數據源可以使用簡寫(json,parquet,jdbc)。從任何一個數據源加載的DataFrame可以被轉換爲另一種格式的數據源。
Dataset<Row> sqlDF =
spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");
- 1
- 2
直接在文件上運行SQL。
6.2 保存模式
保存操作可以選擇一種存儲模式,存儲模式指定了如果目標數據源已經存在的時候該怎麼處理。
Scala/Java | Any Language | 解釋 |
---|---|---|
SaveMode.ErrorIfExists(default) | “error”(default) | 當保存一個DataFrame到一個數據源時,如果數據已經存在,將會拋出一個異常。 |
SaveMode.Append | “append” | 添加在存在數據的後面 |
SaveMode.Overwrite | “overwrite” | 刪除存在的數據再寫入 |
SaveMode.Ignore | “ignore” | 不會改變存在的數據,類似sql裏面的:create table if not exists。 |
7. Parquet文件
Parquet是一種列式存儲結構,關於Parquet的存儲結構也是一個專門的話題,想了解的同學請自行谷歌。Spark SQL提供了針對Parquet文件的讀寫支持。
示例:
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> peopleDF = spark.read().json("examples/src/main/resources/people.json");
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write().parquet("people.parquet");
// Read in the Parquet file created above.
// Parquet files are self-describing so the schema is preserved
// The result of loading a parquet file is also a DataFrame
Dataset<Row> parquetFileDF = spark.read().parquet("people.parquet");
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile");
Dataset<Row> namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19");
Dataset<String> namesDS = namesDF.map(new MapFunction<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}, Encoders.STRING());
namesDS.show();
// +------------+
// | value|
// +------------+
// |Name: Justin|
// +------------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
8. 表分區發現
表分區在系統中是一種常見的優化方式,比如Hive中就使用表分區。在分區表中,數據根據分區列的值被存放在不同的表分區目錄裏面。Parquet數據源現在支持自動發現的檢測分區信息。這是什麼意思呢?舉個例子,比如下面例子中的用戶數據,我們根據用戶的性別和國家將用戶數據存放在不同的目錄結構中,如下:
path
└── to
└── table
├── gender=male
│ ├── ...
│ │
│ ├── country=US
│ │ └── data.parquet
│ ├── country=CN
│ │ └── data.parquet
│ └── ...
└── gender=female
├── ...
│
├── country=US
│ └── data.parquet
├── country=CN
│ └── data.parquet
└── ...
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
將path/to/table作爲參數傳遞給SparkSession.read().parquet()或者是SparkSession.read().load()方法,Spark SQL將會自動從path/to/table路徑中抽取出分區信息,如此操作返回的Dataframe的模式信息爲:
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true) // 分區列
|-- country: string (nullable = true) // 分區列
- 1
- 2
- 3
- 4
- 5
9. JSON Dataset
Spark SQL可以自動從一個JSON數據集推測出模式信息並將json數據加載爲Dataset\,這個轉換可以通過SparkSession.read().json()來完成。
需要注意的是:json文件的內容格式,json文件的內容,一個完整的對象必須寫在一行裏面,而不能通過格式化的方式來生成一個標準的json文件,否則會轉換失敗。
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
Dataset<Row> people = spark.read().json("examples/src/main/resources/people.json");
// The inferred schema can be visualized using the printSchema() method
people.printSchema();
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Creates a temporary view using the DataFrame
people.createOrReplaceTempView("people");
// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
namesDF.show();
// +------+
// | name|
// +------+
// |Justin|
// +------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
10. Hive Tables
Spark SQL也支持從Hive讀寫數據。要支持hive,必須要添加額外的配置文件,將hive-site.xml,core-site.xml和hdfs-site.xml文件放置到conf目錄下。
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public static class Record implements Serializable {
private int key;
private String value;
public int getKey() {
return key;
}
public void setKey(int key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src");
// Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show();
// +---+-------+
// |key| value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...
// Aggregation queries are also supported.
spark.sql("SELECT COUNT(*) FROM src").show();
// +--------+
// |count(1)|
// +--------+
// | 500 |
// +--------+
// The results of SQL queries are themselves DataFrames and support all normal functions.
Dataset<Row> sqlDF = spark.sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key");
// The items in DaraFrames are of type Row, which lets you to access each column by ordinal.
Dataset<String> stringsDS = sqlDF.map(new MapFunction<Row, String>() {
@Override
public String call(Row row) throws Exception {
return "Key: " + row.get(0) + ", Value: " + row.get(1);
}
}, Encoders.STRING());
stringsDS.show();
// +--------------------+
// | value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...
// You can also use DataFrames to create temporary views within a SparkSession.
List<Record> records = new ArrayList<>();
for (int key = 1; key < 100; key++) {
Record record = new Record();
record.setKey(key);
record.setValue("val_" + key);
records.add(record);
}
Dataset<Row> recordsDF = spark.createDataFrame(records, Record.class);
recordsDF.createOrReplaceTempView("records");
// Queries can then join DataFrames data with data stored in Hive.
spark.sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show();
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// | 2| val_2| 2| val_2|
// | 2| val_2| 2| val_2|
// | 4| val_4| 4| val_4|
// ...