1、Data Sources
Spark SQL支持通過DataFrame接口對各種數據源進行操作。DataFrame可以使用關係轉換進行操作,也可以用於創建臨時視圖。將DataFrame註冊爲臨時視圖允許您對其數據運行SQL查詢。本節描述使用Spark數據源加載和保存數據的一般方法,然後介紹可用於內置數據源的特定選項。
數據源是在Spark1.2版本出現的,Spark SQL開放了一系列接入外部數據源的接口,來讓開發者可以實現。例如讀取mysql,hive,hdfs,hbase等,而且支持很多種格式如json, parquet, avro, csv格式。我們可以開發出任意的外部數據源來連接到Spark SQL,然後我們就可以通過外部數據源API了進行操作。
2、讀取Json文件
前面我們讀取Json文件的時候是這麼寫的:
scala> val df=spark.read.json("file:///home/hadoop/data/stu1.json")
df: org.apache.spark.sql.DataFrame = [email: string, id: string ... 2 more fields]
scala> df.show
+--------+---+--------+-----------+
| email| id| name| phone|
+--------+---+--------+-----------+
|1@qq.com| 1|zhangsan|13721442689|
|2@qq.com| 2| lisi|13721442687|
|3@qq.com| 3| wangwu|13721442688|
|4@qq.com| 4|xiaoming|13721442686|
|5@qq.com| 5|xiaowang|13721442685|
+--------+---+--------+-----------+
標準寫法是這樣的:
通過format指定要讀取的文件格式,用load加載。
scala> val df=spark.read.format("json").load("file:///home/hadoop/data/stu1.json")
df: org.apache.spark.sql.DataFrame = [email: string, id: string ... 2 more fields]
scala> df.show
+--------+---+--------+-----------+
| email| id| name| phone|
+--------+---+--------+-----------+
|1@qq.com| 1|zhangsan|13721442689|
|2@qq.com| 2| lisi|13721442687|
|3@qq.com| 3| wangwu|13721442688|
|4@qq.com| 4|xiaoming|13721442686|
|5@qq.com| 5|xiaowang|13721442685|
+--------+---+--------+-----------+
scala> df.printSchema
root
|-- email: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- phone: string (nullable = true)
scala> df.createOrReplaceTempView("student")
scala> spark.sql("select * from student").show
19/08/07 06:47:25 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+--------+---+--------+-----------+
| email| id| name| phone|
+--------+---+--------+-----------+
|1@qq.com| 1|zhangsan|13721442689|
|2@qq.com| 2| lisi|13721442687|
|3@qq.com| 3| wangwu|13721442688|
|4@qq.com| 4|xiaoming|13721442686|
|5@qq.com| 5|xiaowang|13721442685|
+--------+---+--------+-----------+
3、讀取Parquet數據
//把Json文件數據轉成parquet格式,然後保存到該路徑下
//默認是snappy壓縮
scala> df.write.format("parquet").save("file:///home/hadoop/data/parquet_data")
[hadoop@vm01 data]$ cd parquet_data/
[hadoop@vm01 parquet_data]$ ll
total 4
-rw-r--r--. 1 hadoop hadoop 1105 Aug 7 07:13 part-00000-add54ffe-b938-4793-8919-6f74543648a0-c000.snappy.parquet
-rw-r--r--. 1 hadoop hadoop 0 Aug 7 07:13 _SUCCESS
讀取
scala> spark.read.format("parquet").load("file:///home/hadoop/data/parquet_data").show
+--------+---+--------+-----------+
| email| id| name| phone|
+--------+---+--------+-----------+
|1@qq.com| 1|zhangsan|13721442689|
|2@qq.com| 2| lisi|13721442687|
|3@qq.com| 3| wangwu|13721442688|
|4@qq.com| 4|xiaoming|13721442686|
|5@qq.com| 5|xiaowang|13721442685|
+--------+---+--------+-----------+
4、讀取Hive中的數據
讀取hive的數據,先要hive和spark整合後才能讀。
如何整合,看博客:https://blog.csdn.net/greenplum_xiaofan/article/details/98578504
這裏說明一下,在IDEA中需要添加.enableHiveSupport()
支持hive
val spark=SparkSession.builder()
.master("local")
.appName("DataSoureceApp")
.enableHiveSupport() //支持hive
.getOrCreate()
scala> spark.sql("show tables").show(false)
19/08/07 07:20:14 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|default |worker |false |
+--------+---------+-----------+
5、讀取MySQL中的數據
官網:http://spark.apache.org/docs/2.4.2/sql-data-sources-jdbc.html
IDEA上開發,要添加mysql-connector-java-5.1.47.jar
包
找到你的MySQL-Jar包,看到下圖就表名添加成功
然後重啓下IDEA
package com.ruozedata.spark
import org.apache.spark.sql.SparkSession
object DataSoureceApp {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local[2]")
.appName("DataSoureceApp")
.getOrCreate()
// Loading data from a JDBC source
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:mysql://192.168.137.130:3306?useSSL=true")
// .option("driver","com.mysql.jdbc.Driver") //這個在spark-shell中是需要添加這個選項的
.option("dbtable", "ruoze_d6.tbls")
.option("user", "root")
.option("password", "syncdb123!")
.load()
jdbcDF.show(false)
spark.stop()
}
}