Spark SQL(五)—— Spark SQL數據源

在Spark SQL中,可以使用各種各樣的數據源來操作。

1. 使用load(加載函數)、save(存儲函數)

默認的數據源是 Parquet文件。列式存儲文件。

load加載:

// 讀parquet格式文件的時候,不用指定format。因爲默認的就是parquet格式的。
scala> val userDF = spark.read.load("/usr/local/tmp_files/users.parquet")
userDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string ... 1 more field]

scala> userDF.printSchema
root
 |-- name: string (nullable = true)
 |-- favorite_color: string (nullable = true)
 |-- favorite_numbers: array (nullable = true)
 |    |-- element: integer (containsNull = true)

scala> userDF.show
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+

// 讀json文件的時候,要指定文件格式,或者用read.json。因爲默認格式爲parquet。
scala> val testResult = spark.read.json("/usr/local/tmp_files/emp.json")
testResult: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields]

scala> val testResult = spark.read.format("json").load("/usr/local/tmp_files/emp.json")
testResult: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields]

scala> val testResult = spark.read.load("/usr/local/tmp_files/emp.json") //會報錯
 
// 可以不讀指定的文件,因爲有時候文件名字是不確定的,可以直接讀文件所在的目錄。
scala> val testResult = spark.read.load("/usr/local/tmp_files/parquet/part-00000-77d38cbb-ec43-439a-94e2-9a0382035552.snappy.parquet")
testResult: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string]

scala> testResult.show
+------+--------------+
|  name|favorite_color|
+------+--------------+
|Alyssa|          null|
|   Ben|           red|
+------+--------------+

scala> val testResult = spark.read.load("/usr/local/tmp_files/parquet")
testResult: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string]

scala> testResult.show
+------+--------------+
|  name|favorite_color|
+------+--------------+
|Alyssa|          null|
|   Ben|           red|
+------+--------------+

save存儲:

write.save(“路徑”)
write.mode(“overwrite”).save(“路徑”)
write.mode(“overwrite”).parquet(“路徑”)
write.saveAsTable(“表名”)

存儲模式(Save Modes)

  • 可以採用SaveMode執行存儲操作,SaveMode定義了對數據的處理模式。需要注意的是,這些保存模式不使用任何鎖定,不是原子操作。此外,當使用Overwrite方式執行時,在輸出新數據之前原數據就已經被刪除。
  • ErrorIfExists、Append、Overwrite、Ignore。
// 使用save函數時,可以指定存儲模式:追加、覆蓋。
scala> userDF.select("name").write.save("/usr/local/tmp_files/parquet")

// 如果文件存在,直接寫會報已存在的錯。可以用"overwrite"的方式。
scala> userDF.select("name").write.mode("overwrite").save("/usr/local/tmp_files/parquet")

// 將結果保存成表
scala> userDF.select("name").write.saveAsTable("table0821")

scala> spark.sql("select * from table0821").show
+------+
|  name|
+------+
|Alyssa|
|   Ben|
+------+

關閉Spark Shell 後 ,重啓

scala> userDF.show
<console>:24: error: not found: value userDF
       userDF.show
       ^

scala> spark.sql("select * from table0821").show
+------+
|  name|
+------+
|Alyssa|
|   Ben|
+------+

DataFrame中無數據,但table0821中依然可以讀到數據的原因:
在啓動 Spark Shell 的目錄下有 spark-warehouse 文件夾,調用saveAsTable時,會把數據保存到這個文件夾下。從其他路徑啓動spark shell ,不能讀取到數據。如:在spark路徑下啓動spark shell, # ./bin/spark-shell --master spark://node3:7077,則不能訪問到數據。因爲在Spark路徑下,沒有spark-warehouse文件夾。

2. Parquet文件

Parquet文件:列式存儲文件。是Spark SQL 默認的數據源。

理解:就是普通的文件。
Parquet是列式存儲格式的一種文件類型,列式存儲有以下的核心:

  • 可以跳過不符合條件的數據,只讀取需要的數據,降低IO數據量。
  • 壓縮編碼可以降低磁盤存儲空間。由於同一列的數據類型是一樣的,可以使用更高效的壓縮編碼(例如Run Length Encoding和Delta Encoding)進一步節約存儲空間。
  • 只讀取需要的列,支持向量運算,能夠獲取更好的掃描性能。
  • Parquet格式是Spark SQL的默認數據源,可通過spark.sql.sources.default配置。
  • 當寫Parquet文件時,所有的列被自動轉化爲nullable,因爲兼容性的緣故。

2.1 把其他文件轉換成Parquet文件

非常簡單:調用save函數,默認就是格式。
思路:將數據讀進來,再寫出去。就是Parquet文件。

scala> val empDF = spark.read.json("/usr/local/tmp_files/emp.json")
empDF: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields]

scala> empDF.show
+----+------+-----+------+----------+---------+----+----+
|comm|deptno|empno| ename|  hiredate|      job| mgr| sal|
+----+------+-----+------+----------+---------+----+----+
|    |    20| 7369| SMITH|1980/12/17|    CLERK|7902| 800|
| 300|    30| 7499| ALLEN| 1981/2/20| SALESMAN|7698|1600|
| 500|    30| 7521|  WARD| 1981/2/22| SALESMAN|7698|1250|
|    |    20| 7566| JONES|  1981/4/2|  MANAGER|7839|2975|
|1400|    30| 7654|MARTIN| 1981/9/28| SALESMAN|7698|1250|
|    |    30| 7698| BLAKE|  1981/5/1|  MANAGER|7839|2850|
|    |    10| 7782| CLARK|  1981/6/9|  MANAGER|7839|2450|
|    |    20| 7788| SCOTT| 1987/4/19|  ANALYST|7566|3000|
|    |    10| 7839|  KING|1981/11/17|PRESIDENT|    |5000|
|   0|    30| 7844|TURNER|  1981/9/8| SALESMAN|7698|1500|
|    |    20| 7876| ADAMS| 1987/5/23|    CLERK|7788|1100|
|    |    30| 7900| JAMES| 1981/12/3|    CLERK|7698| 950|
|    |    20| 7902|  FORD| 1981/12/3|  ANALYST|7566|3000|
|    |    10| 7934|MILLER| 1982/1/23|    CLERK|7782|1300|
+----+------+-----+------+----------+---------+----+----+

// 兩種方式寫
scala> empDF.write.mode("overwrite").save("/usr/local/tmp_files/parquet") 
scala> empDF.write.mode("overwrite").parquet("/usr/local/tmp_files/parquet")

// 讀
scala> val emp1 = spark.read.parquet("/usr/local/tmp_files/parquet")
emp1: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields]

scala> emp1.createOrReplaceTempView("emp1")

scala> spark
res4: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@71e6bac0

scala> spark.sql("select * from emp1")
res5: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields]

scala> spark.sql("select * from emp1").show
+----+------+-----+------+----------+---------+----+----+
|comm|deptno|empno| ename|  hiredate|      job| mgr| sal|
+----+------+-----+------+----------+---------+----+----+
|    |    20| 7369| SMITH|1980/12/17|    CLERK|7902| 800|
| 300|    30| 7499| ALLEN| 1981/2/20| SALESMAN|7698|1600|
| 500|    30| 7521|  WARD| 1981/2/22| SALESMAN|7698|1250|
|    |    20| 7566| JONES|  1981/4/2|  MANAGER|7839|2975|
|1400|    30| 7654|MARTIN| 1981/9/28| SALESMAN|7698|1250|
|    |    30| 7698| BLAKE|  1981/5/1|  MANAGER|7839|2850|
|    |    10| 7782| CLARK|  1981/6/9|  MANAGER|7839|2450|
|    |    20| 7788| SCOTT| 1987/4/19|  ANALYST|7566|3000|
|    |    10| 7839|  KING|1981/11/17|PRESIDENT|    |5000|
|   0|    30| 7844|TURNER|  1981/9/8| SALESMAN|7698|1500|
|    |    20| 7876| ADAMS| 1987/5/23|    CLERK|7788|1100|
|    |    30| 7900| JAMES| 1981/12/3|    CLERK|7698| 950|
|    |    20| 7902|  FORD| 1981/12/3|  ANALYST|7566|3000|
|    |    10| 7934|MILLER| 1982/1/23|    CLERK|7782|1300|
+----+------+-----+------+----------+---------+----+----+

2.2 支持Schema合併

項目開始,表結構簡單,Schema簡單。隨着項目越來越大,表越來越複雜。逐步向表中增加新的列。
通過這種方式,用戶可以獲取多個有不同Schema但相互兼容的Parquet文件。

spark.read.option(“mergeSchema”,true).parquet(“path”)

scala> val df1 = sc.makeRDD(1 to 5).map(i => (i,i*2)).toDF("single","double")
df1: org.apache.spark.sql.DataFrame = [single: int, double: int]

scala> df1.write.parquet("/usr/local/tmp_files/test_table/key=1")

scala> df1.show
+------+------+
|single|double|
+------+------+
|     1|     2|
|     2|     4|
|     3|     6|
|     4|     8|
|     5|    10|
+------+------+
 
scala> val df2 = sc.makeRDD(6 to 10).map(i=>(i,i*3)).toDF("single","triple")
df2: org.apache.spark.sql.DataFrame = [single: int, triple: int]

scala> df2.show
+------+------+
|single|triple|
+------+------+
|     6|    18|
|     7|    21|
|     8|    24|
|     9|    27|
|    10|    30|
+------+------+


scala> df2.write.parquet("/usr/local/tmp_files/test_table/key=2")
 
// 合併Schema
scala> val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
df3: org.apache.spark.sql.DataFrame = [single: int, double: int ... 2 more fields]

scala> df3.printSchema
root
 |-- single: integer (nullable = true)
 |-- double: integer (nullable = true)
 |-- triple: integer (nullable = true)
 |-- key: integer (nullable = true)

scala> df3.show
+------+------+------+---+
|single|double|triple|key|
+------+------+------+---+
|     8|  null|    24|  2|
|     9|  null|    27|  2|
|    10|  null|    30|  2|
|     3|     6|  null|  1|
|     4|     8|  null|  1|
|     5|    10|  null|  1|
|     6|  null|    18|  2|
|     7|  null|    21|  2|
|     1|     2|  null|  1|
|     2|     4|  null|  1|
+------+------+------+---+
 
// 測試將key換成其他的,也可以進行合併:
scala> val df1 = sc.makeRDD(1 to 5).map(i => (i,i*2)).toDF("single","double")
df1: org.apache.spark.sql.DataFrame = [single: int, double: int]

scala> df1.write.parquet("/usr/local/tmp_files/test_table/hehe=1")

scala> val df2 = sc.makeRDD(6 to 10).map(i=>(i,i*3)).toDF("single","triple")
df2: org.apache.spark.sql.DataFrame = [single: int, triple: int]

scala> df2.write.parquet("/usr/local/tmp_files/test_table/hehe=2")

scala> val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
df3: org.apache.spark.sql.DataFrame = [single: int, double: int ... 2 more fields]

scala> df3.printSchema
root
 |-- single: integer (nullable = true)
 |-- double: integer (nullable = true)
 |-- triple: integer (nullable = true)
 |-- hehe: integer (nullable = true)
  
// 測試兩次使用不同名字,發現報錯,無法合併:
scala> df1.write.parquet("/usr/local/tmp_files/test_table/hehe=1")
                                                                     
scala> df2.write.parquet("/usr/local/tmp_files/test_table/key=2")

scala> val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

    Partition column name list #0: key
    Partition column name list #1: hehe
   
//將兩次名字改成相同後,可以合併:
# mv key\=2/ hehe\=2/
 
scala> val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
df3: org.apache.spark.sql.DataFrame = [single: int, double: int ... 2 more fields]

scala> df3.show
+------+------+------+----+
|single|double|triple|hehe|
+------+------+------+----+
|     8|  null|    24|   2|
|     9|  null|    27|   2|
|    10|  null|    30|   2|
|     3|     6|  null|   1|
|     4|     8|  null|   1|
|     5|    10|  null|   1|
|     6|  null|    18|   2|
|     7|  null|    21|   2|
|     1|     2|  null|   1|
|     2|     4|  null|   1|
+------+------+------+----+

3. JSON文件

需要注意的是,這裏的JSON文件不是常規的JSON格式。JSON文件每一行必須包含一個獨立的、自滿足有效的JSON對象。如果用多行描述一個JSON對象,會導致讀取出錯。

emp.json

{"empno":7369,"ename":"SMITH","job":"CLERK","mgr":"7902","hiredate":"1980/12/17","sal":800,"comm":"","deptno":20}
{"empno":7499,"ename":"ALLEN","job":"SALESMAN","mgr":"7698","hiredate":"1981/2/20","sal":1600,"comm":"300","deptno":30}
scala> val peopleDF = spark.read.json("/usr/local/tmp_files/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> peopleDF.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> peopleDF.createOrReplaceTempView("people")

scala> spark.sql("select name from people where age=19").show
+------+
|  name|
+------+
|Justin|
+------+

4. JDBC

通過JDBC操作關係型數據庫。MySQL中的數據。命令行中必須加入jars和Driver才能連接MySQL數據庫。

./spark-shell --master spark://node3:7077 --jars /usr/local/tmp_files/mysql-connecto8.0.11.jar --driver-class-path /usr/local/tmp_files/mysql-connector-java-8.0.11.jar 
./spark-shell --jars /opt/top_resources/test/mysql-connector-java-5.1.38.jar --driver-class-path /opt/top_resources/test/mysql-connector-java-5.1.38.jar

4.1 方式一:read.format(“jdbc”)

scala> val mysqlDF = spark.read.format("jdbc").option("url","jdbc:mysql://topnpl200:3306/topdb_dev?serverTimezone=UTC&characterEncoding=utf-8").option("user","root").option("password","TOPtop123456").option("driver","com.mysql.jdbc.Driver").option("dbtable","test_table").load  
scala> mysqlDF.show

如果是Oracle:
.option(“url”,“jdbc:oracle:thin@topnpl200:1521/orcl”)
.option(“driver”,“oracle.jdbc.OracleDriver”)

4.2 方式二:定義Properties類

scala> import java.util.Properties
scala> val mysqlProps = new Properties()
scala> mysqlProps.setProperty("user","root")
scala> mysqlProps.setProperty("password","TOPtop123456")
scala> mysqlProps.setProperty("driver","com.mysql.jdbc.Driver")
scala> val mysqlDF1 = spark.read.jdbc("jdbc:mysql://topnpl200:3306/topdb_dev?serverTimezone=UTC&characterEncoding=utf-8","test_table",mysqlProps)
scala> mysqlDF1.show

5. Hive

可以把Hive 中的數據,讀取到Spark SQL 中,使用Spark SQL 來處理。比較常見的模式。

5.1 配置Spark SQL支持Hive

  1. 搭建好Hive。

  2. 拷貝Hive、Hadoop的配置到Spark配置中。
    Hive中的hive-site.xml
    Hadoop中的core-site.xml,hdfs-site.xml
    到 SPARK_HOME/conf 目錄下
    這裏的Hive配置的是多個Hive客戶端,一個Hive服務器端。然後Hive服務器端連接MySQL。這樣的好處是,MySQL的信息只有Hive的服務器端知道,並不會把MySQL的信息暴露給Hive的多個客戶端。

    <!-- Hive的Server端配置 -->
    <configuration> 
      <property> 
        <name>hive.metastore.warehouse.dir</name>  
        <value>/user/yibo/hive/warehouse</value> 
      </property>  
      <property> 
        <name>javax.jdo.option.ConnectionURL</name>  
        <value>jdbc:mysql://192.168.109.1:3306/hive?serverTimezone=UTC</value> 
      </property>  
      <property> 
        <name>javax.jdo.option.ConnectionDriverName</name>  
        <value>com.mysql.jdbc.Driver</value> 
      </property>  
      <property> 
        <name>javax.jdo.option.ConnectionUserName</name>  
        <value>root</value> 
      </property>  
      <property> 
        <name>javax.jdo.option.ConnectionPassword</name>  
        <value>123456</value> 
      </property>  
      <property> 
        <name>hive.querylog.location</name>  
        <value>/data/hive/iotmp</value> 
      </property>  
      <property> 
        <name>hive.server2.logging.operation.log.location</name>  
        <value>/data/hive/operation_logs</value> 
      </property>  
      <property> 
        <name>datanucleus.readOnlyDatastore</name>  
        <value>false</value> 
      </property>  
      <property> 
        <name>datanucleus.fixedDatastore</name>  
        <value>false</value> 
      </property>  
      <property> 
        <name>datanucleus.autoCreateSchema</name>  
        <value>true</value> 
      </property>  
      <property> 
        <name>datanucleus.autoCreateTables</name>  
        <value>true</value> 
      </property>  
      <property> 
        <name>datanucleus.autoCreateColumns</name>  
        <value>true</value> 
      </property> 
    <property>
        <name>datanucleus.schema.autoCreateAll</name>
        <value>true</value>
      </property>
    </configuration>
    
    <!-- Hive的Client端配置 -->
    <configuration> 
      <property> 
        <name>hive.metastore.warehouse.dir</name>  
        <value>/user/yibo/hive/warehouse</value> 
      </property>  
      <property> 
        <name>hive.metastore.local</name>  
        <value>false</value> 
      </property>  
      <property> 
        <name>hive.metastore.uris</name>  
        <value>thrift://192.168.109.132:9083</value> 
      </property> 
    </configuration>
    
  3. 啓動Spark及相關組件。

    # 啓動HDFS、YARN
    start-all
    # 如果server與clinet分離,啓動Hive Server端命令:
    hive-server
    ./hive --service metastore
    # 啓動Spark集羣
    start-all
    # Spark Shell,指定MySQL的驅動。(Hive的元信息保存在MySQL中)
    ./spark-shell --master spark://node3:7077 --jars /usr/local/tmp_files/mysql-connector-java-8.0.11.jar    
    

5.2 使用Spark SQL操作 Hive

// 顯示Hive中的表
scala> spark.sql("show tables").show
+--------+-----------+-----------+
|database|  tableName|isTemporary|
+--------+-----------+-----------+
| default|emp_default|      false|
+--------+-----------+-----------+

scala> spark.sql("select * from company.emp limit 10").show
+-----+------+---------+----+----------+----+----+------+
|empno| ename|      job| mgr|  hiredate| sal|comm|deptno|
+-----+------+---------+----+----------+----+----+------+
| 7369| SMITH|    CLERK|7902|1980/12/17| 800|   0|    20|
| 7499| ALLEN| SALESMAN|7698| 1981/2/20|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698| 1981/2/22|1250| 500|    30|
| 7566| JONES|  MANAGER|7839|  1981/4/2|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698| 1981/9/28|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839|  1981/5/1|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839|  1981/6/9|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566| 1987/4/19|3000|   0|    20|
| 7839|  KING|PRESIDENT|7839|1981/11/17|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698|  1981/9/8|1500|   0|    30|
+-----+------+---------+----+----------+----+----+------+

// 通過Spark SQL創建Hive中的表
scala> spark.sql("create table company.emp_0823( empno Int, ename String, job String, mgr String, hiredate String, sal Int, comm String, deptno Int ) row format delimited fields terminated by ','")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("load data local inpath '/usr/local/tmp_files/emp.csv' overwrite into table company.emp_0823")
19/08/23 05:14:50 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
res3: org.apache.spark.sql.DataFrame = []

scala> spark.sql("load data local inpath '/usr/local/tmp_files/emp.csv' into table company.emp_0823")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from company.emp_0823 limit 10").show
+-----+------+---------+----+----------+----+----+------+
|empno| ename|      job| mgr|  hiredate| sal|comm|deptno|
+-----+------+---------+----+----------+----+----+------+
| 7369| SMITH|    CLERK|7902|1980/12/17| 800|   0|    20|
| 7499| ALLEN| SALESMAN|7698| 1981/2/20|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698| 1981/2/22|1250| 500|    30|
| 7566| JONES|  MANAGER|7839|  1981/4/2|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698| 1981/9/28|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839|  1981/5/1|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839|  1981/6/9|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566| 1987/4/19|3000|   0|    20|
| 7839|  KING|PRESIDENT|7839|1981/11/17|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698|  1981/9/8|1500|   0|    30|
+-----+------+---------+----+----------+----+----+------+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章