使用Spark SQL讀取Hive上的數據

Spark SQL主要目的是使得用戶可以在Spark上使用SQL,其數據源既可以是RDD,也可以是外部的數據源(比如Parquet、Hive、Json等)。Spark SQL的其中一個分支就是Spark on Hive,也就是使用Hive中HQL的解析、邏輯執行計劃翻譯、執行計劃優化等邏輯,可以近似認爲僅將物理執行計劃從MR作業替換成了Spark作業。本文就是來介紹如何通過Spark SQL來讀取現有Hive中的數據。

  不過,預先編譯好的Spark assembly包是不支持Hive的,如果你需要在Spark中使用Hive,必須重新編譯,加上-Phive選項既可,具體如下:

[[email protected] spark]$ ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests -Dhadoop.version=2.2.0  -Phive

  編譯完成之後,會在SPARK_HOME的lib目錄下多產生三個jar包,分別是datanucleus-api-jdo-3.2.6.jar、datanucleus-core-3.2.10.jar、datanucleus-rdbms-3.2.9.jar,這些包都是Hive所需要的。下面就開始介紹步驟。

一、環境準備

  爲了讓Spark能夠連接到Hive的原有數據倉庫,我們需要將Hive中的hive-site.xml文件拷貝到Spark的conf目錄下,這樣就可以通過這個配置文件找到Hive的元數據以及數據存放。

  如果Hive的元數據存放在Mysql中,我們還需要準備好Mysql相關驅動,比如:mysql-connector-java-5.1.22-bin.jar。

二、啓動spark-shell

  環境準備好之後,爲了方便起見,我們使用spark-shell來進行說明如何通過Spark SQL讀取Hive中的數據。我們可以通過下面的命令來啓動spark-shell:

[[email protected] spark]$  bin/spark-shell --master yarn-client  --jars lib/mysql-connector-java-5.1.22-bin.jar

....

15/08/27 18:21:25 INFO repl.SparkILoop: Created spark context..

Spark context available as sc.

....

15/08/27 18:21:30 INFO repl.SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

  啓動spark-shell的時候會先向ResourceManager申請資源,而且還會初始化SparkContext和SQLContext實例。sqlContext對象其實是HiveContext的實例,sqlContext是進入Spark SQL的切入點。接下來我們來讀取Hive中的數據。

scala> sqlContext.sql("CREATE EXTERNAL  TABLE IF NOT EXISTS ewaplog (key STRING, value STRING)

STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/iteblog/ewaplog' ")

 

res0: org.apache.spark.sql.DataFrame = [result: string]

 

scala> sqlContext.sql("LOAD DATA LOCAL INPATH '/data/test.lzo' INTO TABLE ewaplog")

res1: org.apache.spark.sql.DataFrame = [result: string]

 

scala> sqlContext.sql("FROM ewaplog SELECT key, value").collect().foreach(println)

[12,wyp]

[23,ry]

[12,wyp]

[23,ry]

  我們先創建了ewaplog表,然後導入數據,最後查詢。我們可以看出所有的SQL和在Hive中是一樣的,只是在Spark上運行而已。在執行SQL的時候,默認是調用hiveql解析器來解析SQL的。當然,你完全可以調用Spark SQL內置的SQL解析器sql,可以通過spark.sql.dialect參數來設置。但是建議還是使用hivesql解析器,因爲它支持的語法更多,而且還支持Hive的UDF函數,在多數情況下推薦使用hivesql解析器。

  如果你在創建HiveContext的時候出現了類似以下的錯誤:

 

15/11/20 16:20:07 WARN metadata.Hive: Failed to access metastore. This class should not accessed in runtime.

org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

    at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)

    at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)

    at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)

    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)

    at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:171)

    at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)

    at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)

    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:167)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

    at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)

    at $line4.$read$$iwC$$iwC.<init>(<console>:9)

    at $line4.$read$$iwC.<init>(<console>:18)

    at $line4.$read.<init>(<console>:20)

    at $line4.$read$.<init>(<console>:24)

    at $line4.$read$.<clinit>(<console>)

    at $line4.$eval$.<init>(<console>:7)

    at $line4.$eval$.<clinit>(<console>)

    at $line4.$eval.$print(<console>)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)

    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)

    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)

    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)

    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)

    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)

    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)

    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)

    at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)

    at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)

    at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)

    at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)

    at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)

    at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)

    at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)

    at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)

    at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)

    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)

    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)

    at org.apache.spark.repl.Main$.main(Main.scala:31)

    at org.apache.spark.repl.Main.main(Main.scala)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)

    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)

    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)

    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)

    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)

    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)

    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)

    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)

    at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)

    ... 59 more

Caused by: java.lang.reflect.InvocationTargetException

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)

    ... 65 more

Caused by: MetaException(message:Version information not found in metastore. )

    at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6664)

    at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)

    at com.sun.proxy.$Proxy15.verifySchema(Unknown Source)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)

    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)

    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)

    at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)

    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)

    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

    ... 70 more

15/11/20 16:20:07 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore

15/11/20 16:20:07 INFO metastore.ObjectStore: ObjectStore, initialize called

15/11/20 16:20:07 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY

15/11/20 16:20:07 INFO metastore.ObjectStore: Initialized ObjectStore

java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

    at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:171)

    at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)

    at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)

    at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:167)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

    at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)

    at $iwC$$iwC.<init>(<console>:9)

    at $iwC.<init>(<console>:18)

    at <init>(<console>:20)

    at .<init>(<console>:24)

    at .<clinit>(<console>)

    at .<init>(<console>:7)

    at .<clinit>(<console>)

    at $print(<console>)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)

    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)

    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)

    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)

    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)

    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)

    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)

    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)

    at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)

    at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)

    at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)

    at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)

    at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)

    at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:159)

    at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)

    at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:108)

    at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)

    at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)

    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

    at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)

    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)

    at org.apache.spark.repl.Main$.main(Main.scala:31)

    at org.apache.spark.repl.Main.main(Main.scala)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)

    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)

    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)

    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)

    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)

    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)

    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)

    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)

    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)

    ... 56 more

Caused by: java.lang.reflect.InvocationTargetException

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)

    ... 62 more

Caused by: MetaException(message:Version information not found in metastore. )

    at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6664)

    at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:606)

    at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)

    at com.sun.proxy.$Proxy15.verifySchema(Unknown Source)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)

    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)

    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)

    at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)

    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)

    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

    ... 67 more

 

看下你的Hadoop集羣是否可以連接Mysql元數據。

 

1 場景

在實際過程中,遇到這樣的場景:

日誌數據打到HDFS中,運維人員將HDFS的數據做ETL之後加載到hive中,之後需要使用Spark來對日誌做分析處理,Spark的部署方式是Spark on Yarn的方式。

Hive環境

  • 需要配置好Hive環境,因爲在提交Spark任務時,需要連同hive-site.xml文件一起提交,因爲只有這樣才能夠識別已有的hive環境的元數據信息;
  • 所以其實中Spark on Yarn的部署模式中,需要的只是hive的配置文件,以讓HiveContext能夠讀取存儲在mysql中的元數據信息以及存儲在HDFS上的hive表數據;

2 編寫程序與打包

作爲一個測試案例,這裏的測試代碼比較簡單,如下:

package cn.xpleaf.spark.scala.sql.p2

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

/**
  * @author xpleaf
  */
object _01HiveContextOps {

    def main(args: Array[String]): Unit = {
        Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
        val conf = new SparkConf()
//            .setMaster("local[2]")
            .setAppName(s"${_01HiveContextOps.getClass.getSimpleName}")

        val sc = new SparkContext(conf)
        val hiveContext = new HiveContext(sc)

        hiveContext.sql("show databases").show()

        hiveContext.sql("use mydb1")
        // 創建teacher_info表
        val sql1 = "create table teacher_info(\n" + "name string,\n" + "height double)\n" + "row format delimited\n" + "fields terminated by ','"
        hiveContext.sql(sql1)

        // 創建teacher_basic表
        val sql2 = "create table teacher_basic(\n" + "name string,\n" + "age int,\n" + "married boolean,\n" + "children int)\n" + "row format delimited\n" + "fields terminated by ','"
        hiveContext.sql(sql2)

        // 向表中加載數據
        hiveContext.sql("load data inpath 'hdfs://ns1/data/hive/teacher_info.txt' into table teacher_info")
        hiveContext.sql("load data inpath 'hdfs://ns1/data/hive/teacher_basic.txt' into table teacher_basic")

        // 第二步操作:計算兩張表的關聯數據
        val sql3 = "select\n" + "b.name,\n" + "b.age,\n" + "if(b.married,'已婚','未婚') as married,\n" + "b.children,\n" + "i.height\n" + "from teacher_info i\n" + "inner join teacher_basic b on i.name=b.name"
        val joinDF:DataFrame = hiveContext.sql(sql3)

        val joinRDD = joinDF.rdd
        joinRDD.collect().foreach(println)

        joinDF.write.saveAsTable("teacher")

        sc.stop()
    }

}

可以看到其實只是簡單的在hive中建表、加載數據、關聯數據與保存數據到hive表中。

編寫完成之後打包就可以了,注意不需要將依賴一起打包。之後就可以把jar包上傳到我們的環境中了。

3 部署

編寫submit腳本,如下:

[hadoop@hadoop01 jars]$ cat spark-submit-yarn.sh 
/home/hadoop/app/spark/bin/spark-submit \
--class $2 \
--master yarn \
--deploy-mode cluster \
--executor-memory 1G \
--num-executors 1 \
--files $SPARK_HOME/conf/hive-site.xml \
--jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar \
$1 \

注意其中非常關鍵的--files--jars,說明如下:

--files $HIVE_HOME/conf/hive-site.xml    
//將Hive的配置文件添加到Driver和Executor的classpath中
--jars $HIVE_HOME/lib/mysql-connector-java-5.1.39.jar,….    
//將Hive依賴的jar包添加到Driver和Executor的classpath中

之後就可以執行腳本,將任務提交到Yarn上:

[hadoop@hadoop01 jars]$ ./spark-submit-yarn.sh spark-process-1.0-SNAPSHOT.jar cn.xpleaf.spark.scala.sql.p2._01HiveContextOps

4 查看結果

需要說明的是,如果需要對執行過程進行監控,就需要進行配置historyServer(mr的jobHistoryServer和spark的historyServer),可以參考我之前寫的文章。

4.1 Yarn UI

4.2 Spark UI

5 問題與解決

1.User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

注意我們的Spark部署模式是Yarn,yarn上面是沒有相關spark和hive的相關依賴的,所以在提交任務時,必須要指定要上傳的jar包依賴:

--jars $SPARK_HOME/lib/mysql-connector-java-5.1.39.jar,$SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar \

其實在提交任務時,注意觀察控制檯的輸出:


18/10/09 10:57:44 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/spark-assembly-1.6.2-hadoop2.6.0.jar
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/jars/spark-process-1.0-SNAPSHOT.jar -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/spark-process-1.0-SNAPSHOT.jar
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/mysql-connector-java-5.1.39.jar -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/mysql-connector-java-5.1.39.jar
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/datanucleus-api-jdo-3.2.6.jar -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/datanucleus-api-jdo-3.2.6.jar
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/datanucleus-core-3.2.10.jar -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/datanucleus-core-3.2.10.jar
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/lib/datanucleus-rdbms-3.2.9.jar -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/datanucleus-rdbms-3.2.9.jar
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/home/hadoop/app/spark/conf/hive-site.xml -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/hive-site.xml
18/10/09 10:57:47 INFO yarn.Client: Uploading resource file:/tmp/spark-6f582e5c-3eef-4646-b8c7-0719877434d8/__spark_conf__103916311924336720.zip -> hdfs://ns1/user/hadoop/.sparkStaging/application_1538989570769_0023/__spark_conf__103916311924336720.zip

也可以看到,其會將相關spark相關的jar包上傳到yarn的環境也就是hdfs上,之後再執行相關的任務。

2.User class threw exception: org.apache.spark.sql.execution.QueryExecutionException: FAILED: SemanticException [Error 10072]: Database does not exist: mydb1

mydb1不存在,說明沒有讀取到我們已有的hive環境的元數據信息,那是因爲在提交任務時沒有指定把hive-site.xml配置文件一併提交,如下:

--files $SPARK_HOME/conf/hive-site.xml \

https://www.jianshu.com/p/8ea827872e7e

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章