Java-Spark系列5-Spark SQL介紹 一.Spark SQL的概述 二.Spark SQL數據抽象 三.Spark SQL 操作數據庫 參考:

一.Spark SQL的概述

1.1 Spark SQL 來源

Hive是目前大數據領域,事實上的數據倉庫標準。


Hive與RDBMS的SQL模型比較類似,容易掌握。 Hive的主要缺陷在於它的底層是基於MapReduce的,執行比較慢。

在Spark 0.x版的時候推出了Shark,Shark與Hive是緊密關聯的,Shark底層很多東西還是依賴於Hive,修改了內存管理、物理計劃、執行三個模塊,底層使用Spark的基於內存的計算模型,性能上比Hive提升了很多倍。

在Spark 1.x的時候Shark被淘汰。在2014 年7月1日的Spark Summit 上, Databricks宣佈終止對Shark的開發,將重點放到 Spark SQL 上。

Shark終止以後,產生了兩個分支:

1). Hive on Spark
hive社區的,源碼在hive中

2). Spark SQL(Spark on Hive)
Spark社區,源碼在Spark中,支持多種數據源,多種優化技術,擴展性好很多;

Spark SQL的源碼在Spark中,而且新增了許多的優化代碼,如果追求速度,例如數據分析的時候,可以使用Hive on Spark,如果追求性能,例如生產的定時報表的時候,應該使用Spark SQL。

1.2 從代碼看Spark SQL的特點

我們來對比Spark RDD、Dataframe、SQL代碼實現wordcount:


我們可以看到,Spark SQL代碼看起來與關係型數據庫是一致的,從上圖可以看到Spark SQL的特點:
1). 集成
通過Spark SQL或DataFrame API運行Spark 程序,操作更加簡單、快速.

從上圖可以看到,Spark SQL和DataFrame底層其實就是調用RDD

2). 統一的數據訪問
DataFrame 和SQL提供了訪問各種數據源的通用方式,包括Hive、Avro、Parquet、ORC、JSON和JDBC。您甚至可以跨這些數據源連接數據。

3). Hive集成
在現有的數據倉庫上運行SQL或HiveQL查詢。

4). 標準的連接
服務器模式爲業務智能工具提供行業標準的JDBC和ODBC連接。

1.3 從代碼運行速度看來看Spark SQL

從上圖我們可以看到:
1). Python操作RDD比Java/Scala慢一倍以上
2). 無論是那種語言操作DataFrame,性能幾乎一致

那麼爲什麼Python用RDD這麼慢?
爲什麼用Python寫的RDD比Scala慢一倍以上,兩種不同的語言的執行引擎,上下文切換、數據傳輸。

Spark SQL其實底層調用的也是RDD執行,其實中間的執行計劃進行了優化,而且是在Spark的優化引擎裏面,所以無論是那種語言操作DataFrame,性能幾乎一致

二.Spark SQL數據抽象

Spark SQL提供了兩個新的抽象,分別是DataFrame 和Dataset;

Dataset是數據的分佈式集合。Dataset是Spark 1.6中添加的一個新接口,它提供了RDDs的優點(強類型、使用強大lambda函數的能力)以及Spark SQL優化的執行引擎的優點。可以從JVM對象構造數據集,然後使用函數轉換(map、flatMap、filter等)操作數據集。數據集API可以在Scala和Java中使用。Python不支持Dataset API。但是由於Python的動態特性,Dataset API的許多優點已經可以使用了(例如,您可以通過名稱natural row. columnname訪問行字段)。R的情況也是類似的。

DataFrame 是組織成命名列的Dataset。它在概念上相當於關係數據庫中的表或R/Python中的數據框架,但在底層有更豐富的優化。數據框架可以從各種各樣的數據源構建,例如:結構化數據文件、Hive中的表、外部數據庫或現有的rdd。DataFrame API可以在Scala、Java、Python和r中使用。在Scala和Java中,DataFrame是由行數據集表示的。在Scala API中,DataFrame只是Dataset[Row]的類型別名。而在Java API中,用戶需要使用Dataset來表示DataFrame。

2.1 DataFrame

DataFrame的前身是SchemaRDD。Spark1.3更名爲DataFrame。不繼承RDD,自己實現RDD的大部分功能。

與RDD類似,DataFrame也是一個分佈式數據集
1). DataFrame可以看做分佈式Row對象的集合,提供了由列組成的詳細模式信息,使其可以得到優化,DataFrame不僅有比RDD更多的算子,還可以進行執行計劃的優化

2). DataFrame更像傳統數據庫的二維表格,除了數據以外,還記錄數據的結構信息,即schema

3). DataFrame也支持嵌套數據類型(struct、array和Map)

4). DataFrame API提供的是一套高層的關係操作,比函數式RDD API更加優化,門檻低

5). DataFrame的劣勢在於在編譯期缺少類型安全檢查,導致運行時出錯。

2.2 Dataset

Dataset時在Spark1.6中添加的新接口;與RDD相比,可以保存更多的描述信息,概念上等同於關係型數據庫中的二維表。與DataFrame相比,保存了類型信息,是強類型,提供了編譯時檢查。

調用Dataset的方法會生成邏輯計劃,然後Spark的優化器進行優化,最終勝出無力計劃,然後提交到集羣中運行。

Dataset包含了DataFrame的功能,在Spark2.0中兩者得到了統一,DataFrame表示爲Dataset[Row],即Dataset的子集.

三.Spark SQL 操作數據庫

3.1 Spark SQL操作Hive數據庫

Spark中所有功能的入口點都是SparkSession類。要創建一個基本的SparkSession,只需使用SparkSession.builder():

import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate();

在Spark repo的“examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java”中可以找到完整的示例代碼。

Spark 2.0中的SparkSession提供了對Hive特性的內置支持,包括使用HiveQL編寫查詢,訪問Hive udf,以及從Hive表中讀取數據的能力。要使用這些特性,您不需要有一個現有的Hive設置。

3.1.1 創建DataFrames

通過SparkSession,應用程序可以從現有的RDD、Hive表或Spark數據源中創建DataFrames。

下面是一個基於text文件內容的DataFrame示例:
代碼:

package org.example;

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;


public class SparkSQLTest1 {
    public static void main(String[] args){
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkSQLTest1")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        Dataset<Row> df = spark.read().text("file:///home/pyspark/idcard.txt");
        df.show();
        
        spark.stop();
    }
}

測試記錄:

[[email protected] javaspark]# spark-submit \
>   --class org.example.SparkSQLTest1 \
>   --master local[2] \
>   /home/javaspark/SparkStudy-1.0-SNAPSHOT.jar
21/08/10 14:41:54 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
21/08/10 14:41:54 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf/__driver_logs__/driver.log
21/08/10 14:41:54 INFO spark.SparkContext: Submitted application: SparkSQLTest1
21/08/10 14:41:54 INFO spark.SecurityManager: Changing view acls to: root
21/08/10 14:41:54 INFO spark.SecurityManager: Changing modify acls to: root
21/08/10 14:41:54 INFO spark.SecurityManager: Changing view acls groups to: 
21/08/10 14:41:54 INFO spark.SecurityManager: Changing modify acls groups to: 
21/08/10 14:41:54 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/08/10 14:41:55 INFO util.Utils: Successfully started service 'sparkDriver' on port 36352.
21/08/10 14:41:55 INFO spark.SparkEnv: Registering MapOutputTracker
21/08/10 14:41:55 INFO spark.SparkEnv: Registering BlockManagerMaster
21/08/10 14:41:55 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/08/10 14:41:55 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/08/10 14:41:55 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-59f07625-d934-48f5-a4d8-30f14fba5274
21/08/10 14:41:55 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
21/08/10 14:41:55 INFO spark.SparkEnv: Registering OutputCommitCoordinator
21/08/10 14:41:55 INFO util.log: Logging initialized @1548ms
21/08/10 14:41:55 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
21/08/10 14:41:55 INFO server.Server: Started @1623ms
21/08/10 14:41:55 INFO server.AbstractConnector: Started [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/08/10 14:41:55 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/jobs,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/jobs/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/jobs/job,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/jobs/job/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages/stage,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages/stage/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages/pool,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages/pool/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/storage,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/storage/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/storage/rdd,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/storage/rdd/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/environment,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/environment/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/executors,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/executors/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/executors/threadDump,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/executors/threadDump/json,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/static,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/api,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/jobs/job/kill,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/stages/stage/kill,null,AVAILABLE,@Spark}
21/08/10 14:41:55 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hp2:4040
21/08/10 14:41:55 INFO spark.SparkContext: Added JAR file:/home/javaspark/SparkStudy-1.0-SNAPSHOT.jar at spark://hp2:36352/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628577715383
21/08/10 14:41:55 INFO executor.Executor: Starting executor ID driver on host localhost
21/08/10 14:41:55 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40459.
21/08/10 14:41:55 INFO netty.NettyBlockTransferService: Server created on hp2:40459
21/08/10 14:41:55 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/10 14:41:55 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hp2, 40459, None)
21/08/10 14:41:55 INFO storage.BlockManagerMasterEndpoint: Registering block manager hp2:40459 with 366.3 MB RAM, BlockManagerId(driver, hp2, 40459, None)
21/08/10 14:41:55 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hp2, 40459, None)
21/08/10 14:41:55 INFO storage.BlockManager: external shuffle service port = 7337
21/08/10 14:41:55 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hp2, 40459, None)
21/08/10 14:41:55 INFO handler.ContextHandler: Started [email protected]{/metrics/json,null,AVAILABLE,@Spark}
21/08/10 14:41:56 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1628577715422
21/08/10 14:41:56 INFO spark.SparkContext: Registered listener com.cloudera.spark.lineage.NavigatorAppListener
21/08/10 14:41:56 INFO logging.DriverLogger$DfsAsyncWriter: Started driver log file sync to: /user/spark/driverLogs/local-1628577715422_driver.log
21/08/10 14:41:56 INFO internal.SharedState: loading hive config file: file:/etc/hive/conf.cloudera.hive/hive-site.xml
21/08/10 14:41:56 INFO internal.SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
21/08/10 14:41:56 INFO internal.SharedState: Warehouse path is '/user/hive/warehouse'.
21/08/10 14:41:56 INFO handler.ContextHandler: Started [email protected]{/SQL,null,AVAILABLE,@Spark}
21/08/10 14:41:56 INFO handler.ContextHandler: Started [email protected]{/SQL/json,null,AVAILABLE,@Spark}
21/08/10 14:41:56 INFO handler.ContextHandler: Started [email protected]{/SQL/execution,null,AVAILABLE,@Spark}
21/08/10 14:41:56 INFO handler.ContextHandler: Started [email protected]{/SQL/execution/json,null,AVAILABLE,@Spark}
21/08/10 14:41:56 INFO handler.ContextHandler: Started [email protected]{/static/sql,null,AVAILABLE,@Spark}
21/08/10 14:41:57 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/08/10 14:41:58 INFO datasources.FileSourceStrategy: Pruning directories with: 
21/08/10 14:41:58 INFO datasources.FileSourceStrategy: Post-Scan Filters: 
21/08/10 14:41:58 INFO datasources.FileSourceStrategy: Output Data Schema: struct<value: string>
21/08/10 14:41:58 INFO execution.FileSourceScanExec: Pushed Filters: 
21/08/10 14:41:58 INFO codegen.CodeGenerator: Code generated in 185.453035 ms
21/08/10 14:41:59 INFO codegen.CodeGenerator: Code generated in 11.048328 ms
21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 342.0 KB, free 366.0 MB)
21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 32.3 KB, free 365.9 MB)
21/08/10 14:41:59 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hp2:40459 (size: 32.3 KB, free: 366.3 MB)
21/08/10 14:41:59 INFO spark.SparkContext: Created broadcast 0 from show at SparkSQLTest1.java:17
21/08/10 14:41:59 INFO execution.FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
21/08/10 14:41:59 INFO spark.SparkContext: Starting job: show at SparkSQLTest1.java:17
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Got job 0 (show at SparkSQLTest1.java:17) with 1 output partitions
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (show at SparkSQLTest1.java:17)
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Parents of final stage: List()
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Missing parents: List()
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest1.java:17), which has no missing parents
21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.4 KB, free 365.9 MB)
21/08/10 14:41:59 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.9 KB, free 365.9 MB)
21/08/10 14:41:59 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hp2:40459 (size: 3.9 KB, free: 366.3 MB)
21/08/10 14:41:59 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1164
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest1.java:17) (first 15 tasks are for partitions Vector(0))
21/08/10 14:41:59 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
21/08/10 14:41:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8309 bytes)
21/08/10 14:41:59 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
21/08/10 14:41:59 INFO executor.Executor: Fetching spark://hp2:36352/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628577715383
21/08/10 14:41:59 INFO client.TransportClientFactory: Successfully created connection to hp2/10.31.1.124:36352 after 35 ms (0 ms spent in bootstraps)
21/08/10 14:41:59 INFO util.Utils: Fetching spark://hp2:36352/jars/SparkStudy-1.0-SNAPSHOT.jar to /tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf/userFiles-ddafbf49-ca6a-4c28-83fe-9e4ff2765c5e/fetchFileTemp442678599817678402.tmp
21/08/10 14:41:59 INFO executor.Executor: Adding file:/tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf/userFiles-ddafbf49-ca6a-4c28-83fe-9e4ff2765c5e/SparkStudy-1.0-SNAPSHOT.jar to class loader
21/08/10 14:41:59 INFO datasources.FileScanRDD: Reading File path: file:///home/pyspark/idcard.txt, range: 0-209, partition values: [empty row]
21/08/10 14:41:59 INFO codegen.CodeGenerator: Code generated in 9.418721 ms
21/08/10 14:41:59 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1441 bytes result sent to driver
21/08/10 14:41:59 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 256 ms on localhost (executor driver) (1/1)
21/08/10 14:41:59 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
21/08/10 14:41:59 INFO scheduler.DAGScheduler: ResultStage 0 (show at SparkSQLTest1.java:17) finished in 0.366 s
21/08/10 14:41:59 INFO scheduler.DAGScheduler: Job 0 finished: show at SparkSQLTest1.java:17, took 0.425680 s
21/08/10 14:41:59 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
21/08/10 14:41:59 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 2.1 using Spark classes.
21/08/10 14:42:00 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
21/08/10 14:42:00 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/1be0271b-776b-42c6-9660-d8f73c14ef2f
21/08/10 14:42:00 INFO session.SessionState: Created local directory: /tmp/root/1be0271b-776b-42c6-9660-d8f73c14ef2f
21/08/10 14:42:00 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/1be0271b-776b-42c6-9660-d8f73c14ef2f/_tmp_space.db
21/08/10 14:42:00 INFO client.HiveClientImpl: Warehouse location for Hive client (version 2.1.1) is /user/hive/warehouse
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 15
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 16
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 23
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 20
21/08/10 14:42:00 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on hp2:40459 in memory (size: 3.9 KB, free: 366.3 MB)
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 12
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 6
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 22
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 10
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 5
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 19
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 28
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 26
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 27
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 7
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 11
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 17
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 24
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 29
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 30
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 18
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 25
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 14
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 13
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 8
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 21
21/08/10 14:42:00 INFO spark.ContextCleaner: Cleaned accumulator 9
21/08/10 14:42:00 INFO hive.metastore: HMS client filtering is enabled.
21/08/10 14:42:00 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
21/08/10 14:42:00 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/08/10 14:42:00 INFO hive.metastore: Connected to metastore.
21/08/10 14:42:01 INFO metadata.Hive: Registering function getdegree myUdf.getDegree
+------------------+
|             value|
+------------------+
|440528*******63016|
|350525*******60813|
|120102*******10789|
|452123*******30416|
|440301*******22322|
|441421*******54614|
|440301*******55416|
|232721*******40630|
|362204*******88412|
|430281*******91015|
|420117*******88355|
+------------------+

21/08/10 14:42:01 INFO spark.SparkContext: Invoking stop() from shutdown hook
21/08/10 14:42:01 INFO server.AbstractConnector: Stopped [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/08/10 14:42:01 INFO ui.SparkUI: Stopped Spark web UI at http://hp2:4040
21/08/10 14:42:01 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/08/10 14:42:01 INFO memory.MemoryStore: MemoryStore cleared
21/08/10 14:42:01 INFO storage.BlockManager: BlockManager stopped
21/08/10 14:42:01 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
21/08/10 14:42:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/08/10 14:42:01 INFO spark.SparkContext: Successfully stopped SparkContext
21/08/10 14:42:01 INFO util.ShutdownHookManager: Shutdown hook called
21/08/10 14:42:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-45aa512d-c4a1-44f4-8152-2273f1e78bda
21/08/10 14:42:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-7bdf3b0f-d374-4b47-b78f-bdea057849cf
[[email protected] javaspark]# 

3.1.2 以編程方式運行SQL查詢

代碼:

package org.example;

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

public class SparkSQLTest2 {
    public static void main(String[] args){
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkSQLTest2")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        Dataset<Row> sqlDF = spark.sql("SELECT * FROM test.ods_fact_sale limit 100");
        sqlDF.show();
        
       spark.stop();
    }

}

測試記錄:

[14:49:40] [[email protected] javaspark]# spark-submit \
[14:49:40] >   --class org.example.SparkSQLTest2 \
[14:49:40] >   --master local[2] \
[14:49:41] >   /home/javaspark/SparkStudy-1.0-SNAPSHOT.jar
[14:49:42] 21/08/10 14:49:46 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
[14:49:42] 21/08/10 14:49:46 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-7c425e78-967b-4dbc-9552-1917c8d38b2f/__driver_logs__/driver.log
[14:49:42] 21/08/10 14:49:46 INFO spark.SparkContext: Submitted application: SparkSQLTest2
[14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing view acls to: root
[14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing modify acls to: root
[14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing view acls groups to: 
[14:49:42] 21/08/10 14:49:46 INFO spark.SecurityManager: Changing modify acls groups to: 
....snip....
21/08/10 14:49:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 237) in 53 ms on localhost (executor driver) (1/1)
21/08/10 14:49:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
21/08/10 14:49:56 INFO scheduler.DAGScheduler: ResultStage 1 (show at SparkSQLTest2.java:16) finished in 0.064 s
21/08/10 14:49:56 INFO scheduler.DAGScheduler: Job 0 finished: show at SparkSQLTest2.java:16, took 3.417211 s
+---+--------------------+---------+---------+
| id|           sale_date|prod_name|sale_nums|
+---+--------------------+---------+---------+
|  1|2011-08-16 00:00:...|    PROD4|       28|
|  2|2011-11-06 00:00:...|    PROD6|       19|
|  3|2011-04-25 00:00:...|    PROD8|       29|
|  4|2011-09-12 00:00:...|    PROD2|       88|
|  5|2011-05-15 00:00:...|    PROD5|       76|
|  6|2011-02-23 00:00:...|    PROD6|       64|
|  7|2012-09-26 00:00:...|    PROD2|       38|
|  8|2012-02-14 00:00:...|    PROD6|       45|
|  9|2010-04-22 00:00:...|    PROD8|       57|
| 10|2010-10-31 00:00:...|    PROD5|       65|
| 11|2010-10-24 00:00:...|    PROD2|       33|
| 12|2011-02-11 00:00:...|    PROD9|       27|
| 13|2012-07-10 00:00:...|    PROD8|       48|
| 14|2011-02-23 00:00:...|    PROD6|       46|
| 15|2010-08-10 00:00:...|    PROD4|       50|
| 16|2011-05-02 00:00:...|    PROD6|       22|
| 17|2012-07-20 00:00:...|    PROD2|       56|
| 18|2012-07-12 00:00:...|    PROD9|       57|
| 19|2011-11-18 00:00:...|    PROD6|       58|
| 20|2010-04-22 00:00:...|    PROD4|        7|
+---+--------------------+---------+---------+
only showing top 20 rows

21/08/10 14:49:56 INFO spark.SparkContext: Invoking stop() from shutdown hook
21/08/10 14:49:56 INFO server.AbstractConnector: Stopped [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/08/10 14:49:56 INFO ui.SparkUI: Stopped Spark web UI at http://hp2:4040
21/08/10 14:49:56 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/08/10 14:49:56 INFO memory.MemoryStore: MemoryStore cleared
21/08/10 14:49:56 INFO storage.BlockManager: BlockManager stopped
21/08/10 14:49:56 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
21/08/10 14:49:56 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/08/10 14:49:56 INFO spark.SparkContext: Successfully stopped SparkContext
21/08/10 14:49:56 INFO util.ShutdownHookManager: Shutdown hook called
21/08/10 14:49:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-0d720459-d9a4-4bb5-b9ce-521f524c87c8
21/08/10 14:49:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-7c425e78-967b-4dbc-9552-1917c8d38b2f
[[email protected] javaspark]# 

3.2 Spark SQL操作MySQL數據庫

Spark SQL不僅可以操作Hive數據庫,也可以操作遠程的MySQL數據庫

代碼:

package org.example;

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

public class SparkSQLTest3 {
    public static void main(String[] args){
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkSQLTest3")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        Dataset<Row> jdbcDF = spark.read()
                .format("jdbc")
                .option("url", "jdbc:mysql://10.31.1.123:3306/test")
                .option("dbtable", "(SELECT * FROM EMP) tmp")
                .option("user", "root")
                .option("password", "abc123")
                .load();

        jdbcDF.printSchema();
        jdbcDF.show();

        spark.stop();
    }

}

測試記錄:

[[email protected] javaspark]# spark-submit \
>   --class org.example.SparkSQLTest3 \
>   --master local[2] \
>   /home/javaspark/SparkStudy-1.0-SNAPSHOT.jar
21/08/10 15:04:01 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
21/08/10 15:04:01 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6/__driver_logs__/driver.log
21/08/10 15:04:01 INFO spark.SparkContext: Submitted application: SparkSQLTest3
21/08/10 15:04:01 INFO spark.SecurityManager: Changing view acls to: root
21/08/10 15:04:01 INFO spark.SecurityManager: Changing modify acls to: root
21/08/10 15:04:01 INFO spark.SecurityManager: Changing view acls groups to: 
21/08/10 15:04:01 INFO spark.SecurityManager: Changing modify acls groups to: 
21/08/10 15:04:01 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/08/10 15:04:02 INFO util.Utils: Successfully started service 'sparkDriver' on port 44266.
21/08/10 15:04:02 INFO spark.SparkEnv: Registering MapOutputTracker
21/08/10 15:04:02 INFO spark.SparkEnv: Registering BlockManagerMaster
21/08/10 15:04:02 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/08/10 15:04:02 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/08/10 15:04:02 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1781c08b-9d73-427d-9bc0-2e7e092a2f16
21/08/10 15:04:02 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
21/08/10 15:04:02 INFO spark.SparkEnv: Registering OutputCommitCoordinator
21/08/10 15:04:02 INFO util.log: Logging initialized @1556ms
21/08/10 15:04:02 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
21/08/10 15:04:02 INFO server.Server: Started @1633ms
21/08/10 15:04:02 INFO server.AbstractConnector: Started [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/08/10 15:04:02 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/jobs,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/jobs/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/jobs/job,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/jobs/job/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages/stage,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages/stage/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages/pool,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages/pool/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/storage,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/storage/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/storage/rdd,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/storage/rdd/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/environment,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/environment/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/executors,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/executors/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/executors/threadDump,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/executors/threadDump/json,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/static,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/api,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/jobs/job/kill,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/stages/stage/kill,null,AVAILABLE,@Spark}
21/08/10 15:04:02 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hp2:4040
21/08/10 15:04:02 INFO spark.SparkContext: Added JAR file:/home/javaspark/SparkStudy-1.0-SNAPSHOT.jar at spark://hp2:44266/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628579042508
21/08/10 15:04:02 INFO executor.Executor: Starting executor ID driver on host localhost
21/08/10 15:04:02 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43617.
21/08/10 15:04:02 INFO netty.NettyBlockTransferService: Server created on hp2:43617
21/08/10 15:04:02 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/10 15:04:02 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hp2, 43617, None)
21/08/10 15:04:02 INFO storage.BlockManagerMasterEndpoint: Registering block manager hp2:43617 with 366.3 MB RAM, BlockManagerId(driver, hp2, 43617, None)
21/08/10 15:04:02 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hp2, 43617, None)
21/08/10 15:04:02 INFO storage.BlockManager: external shuffle service port = 7337
21/08/10 15:04:02 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hp2, 43617, None)
21/08/10 15:04:02 INFO handler.ContextHandler: Started [email protected]{/metrics/json,null,AVAILABLE,@Spark}
21/08/10 15:04:03 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1628579042554
21/08/10 15:04:03 INFO spark.SparkContext: Registered listener com.cloudera.spark.lineage.NavigatorAppListener
21/08/10 15:04:03 INFO logging.DriverLogger$DfsAsyncWriter: Started driver log file sync to: /user/spark/driverLogs/local-1628579042554_driver.log
21/08/10 15:04:03 INFO internal.SharedState: loading hive config file: file:/etc/hive/conf.cloudera.hive/hive-site.xml
21/08/10 15:04:03 INFO internal.SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
21/08/10 15:04:03 INFO internal.SharedState: Warehouse path is '/user/hive/warehouse'.
21/08/10 15:04:03 INFO handler.ContextHandler: Started [email protected]{/SQL,null,AVAILABLE,@Spark}
21/08/10 15:04:03 INFO handler.ContextHandler: Started [email protected]{/SQL/json,null,AVAILABLE,@Spark}
21/08/10 15:04:03 INFO handler.ContextHandler: Started [email protected]{/SQL/execution,null,AVAILABLE,@Spark}
21/08/10 15:04:03 INFO handler.ContextHandler: Started [email protected]{/SQL/execution/json,null,AVAILABLE,@Spark}
21/08/10 15:04:03 INFO handler.ContextHandler: Started [email protected]{/static/sql,null,AVAILABLE,@Spark}
21/08/10 15:04:04 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
root
 |-- empno: integer (nullable = true)
 |-- ename: string (nullable = true)
 |-- job: string (nullable = true)
 |-- mgr: integer (nullable = true)
 |-- hiredate: date (nullable = true)
 |-- sal: decimal(7,2) (nullable = true)
 |-- comm: decimal(7,2) (nullable = true)
 |-- deptno: integer (nullable = true)

21/08/10 15:04:06 INFO codegen.CodeGenerator: Code generated in 194.297923 ms
21/08/10 15:04:06 INFO codegen.CodeGenerator: Code generated in 32.1523 ms
21/08/10 15:04:06 INFO spark.SparkContext: Starting job: show at SparkSQLTest3.java:24
21/08/10 15:04:06 INFO scheduler.DAGScheduler: Got job 0 (show at SparkSQLTest3.java:24) with 1 output partitions
21/08/10 15:04:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (show at SparkSQLTest3.java:24)
21/08/10 15:04:06 INFO scheduler.DAGScheduler: Parents of final stage: List()
21/08/10 15:04:06 INFO scheduler.DAGScheduler: Missing parents: List()
21/08/10 15:04:06 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest3.java:24), which has no missing parents
21/08/10 15:04:06 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 10.9 KB, free 366.3 MB)
21/08/10 15:04:06 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 366.3 MB)
21/08/10 15:04:06 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hp2:43617 (size: 5.1 KB, free: 366.3 MB)
21/08/10 15:04:06 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1164
21/08/10 15:04:06 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at show at SparkSQLTest3.java:24) (first 15 tasks are for partitions Vector(0))
21/08/10 15:04:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
21/08/10 15:04:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7690 bytes)
21/08/10 15:04:07 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
21/08/10 15:04:07 INFO executor.Executor: Fetching spark://hp2:44266/jars/SparkStudy-1.0-SNAPSHOT.jar with timestamp 1628579042508
21/08/10 15:04:07 INFO client.TransportClientFactory: Successfully created connection to hp2/10.31.1.124:44266 after 37 ms (0 ms spent in bootstraps)
21/08/10 15:04:07 INFO util.Utils: Fetching spark://hp2:44266/jars/SparkStudy-1.0-SNAPSHOT.jar to /tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6/userFiles-651b34b1-fee3-4d6b-b332-9246d1f6e35b/fetchFileTemp859077269082454607.tmp
21/08/10 15:04:07 INFO executor.Executor: Adding file:/tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6/userFiles-651b34b1-fee3-4d6b-b332-9246d1f6e35b/SparkStudy-1.0-SNAPSHOT.jar to class loader
21/08/10 15:04:07 INFO jdbc.JDBCRDD: closed connection
21/08/10 15:04:07 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1884 bytes result sent to driver
21/08/10 15:04:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 252 ms on localhost (executor driver) (1/1)
21/08/10 15:04:07 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
21/08/10 15:04:07 INFO scheduler.DAGScheduler: ResultStage 0 (show at SparkSQLTest3.java:24) finished in 0.577 s
21/08/10 15:04:07 INFO scheduler.DAGScheduler: Job 0 finished: show at SparkSQLTest3.java:24, took 0.632211 s
21/08/10 15:04:07 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
21/08/10 15:04:07 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 2.1 using Spark classes.
21/08/10 15:04:07 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
21/08/10 15:04:07 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/75efdbe5-01fb-4f39-b04a-82982a3b5298
21/08/10 15:04:07 INFO session.SessionState: Created local directory: /tmp/root/75efdbe5-01fb-4f39-b04a-82982a3b5298
21/08/10 15:04:07 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/75efdbe5-01fb-4f39-b04a-82982a3b5298/_tmp_space.db
21/08/10 15:04:07 INFO client.HiveClientImpl: Warehouse location for Hive client (version 2.1.1) is /user/hive/warehouse
21/08/10 15:04:08 INFO hive.metastore: HMS client filtering is enabled.
21/08/10 15:04:08 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
21/08/10 15:04:08 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/08/10 15:04:08 INFO hive.metastore: Connected to metastore.
21/08/10 15:04:08 INFO metadata.Hive: Registering function getdegree myUdf.getDegree
+-----+------+---------+----+----------+-------+-------+------+
|empno| ename|      job| mgr|  hiredate|    sal|   comm|deptno|
+-----+------+---------+----+----------+-------+-------+------+
| 7369| SMITH|    CLERK|7902|1980-12-17| 800.00|   null|    20|
| 7499| ALLEN| SALESMAN|7698|1981-02-20|1600.00| 300.00|    30|
| 7521|  WARD| SALESMAN|7698|1981-02-22|1250.00| 500.00|    30|
| 7566| JONES|  MANAGER|7839|1981-04-02|2975.00|   null|    20|
| 7654|MARTIN| SALESMAN|7698|1981-09-28|1250.00|1400.00|    30|
| 7698| BLAKE|  MANAGER|7839|1981-05-01|2850.00|   null|    30|
| 7782| CLARK|  MANAGER|7839|1981-06-09|2450.00|   null|    10|
| 7788| SCOTT|  ANALYST|7566|1987-06-13|3000.00|   null|    20|
| 7839|  KING|PRESIDENT|null|1981-11-17|5000.00|   null|    10|
| 7844|TURNER| SALESMAN|7698|1981-09-08|1500.00|   0.00|    30|
| 7876| ADAMS|    CLERK|7788|1987-06-13|1100.00|   null|    20|
| 7900| JAMES|    CLERK|7698|1981-12-03| 950.00|   null|    30|
| 7902|  FORD|  ANALYST|7566|1981-12-03|3000.00|   null|    20|
| 7934|MILLER|    CLERK|7782|1982-01-23|1300.00|   null|    10|
+-----+------+---------+----+----------+-------+-------+------+

21/08/10 15:04:08 INFO server.AbstractConnector: Stopped [email protected]{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/08/10 15:04:08 INFO ui.SparkUI: Stopped Spark web UI at http://hp2:4040
21/08/10 15:04:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/08/10 15:04:08 INFO memory.MemoryStore: MemoryStore cleared
21/08/10 15:04:08 INFO storage.BlockManager: BlockManager stopped
21/08/10 15:04:08 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
21/08/10 15:04:08 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/08/10 15:04:08 INFO spark.SparkContext: Successfully stopped SparkContext
21/08/10 15:04:08 INFO util.ShutdownHookManager: Shutdown hook called
21/08/10 15:04:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5fdcedf0-eeb8-4089-961f-42a1048761c6
21/08/10 15:04:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-6b5f511b-1bb4-420c-b26f-737dc764fea1
[[email protected] javaspark]# 

參考:

1.http://spark.apache.org/docs/2.4.2/sql-getting-started.html
2.https://www.jianshu.com/p/ad6dc9467a6b
3.https://blog.csdn.net/u011412768/article/details/93426353
4.https://blog.csdn.net/luoganttcc/article/details/88791460

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章