Hive on Spark：起點

翻譯自官網：
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
開始時翻譯，後面會出集成的具體步驟。主要是講了一些概覽，spark的參數設置，遇到的問題處理等。少環境的搭建。
還有就是問題哪裏，報錯太多了，格式不好整。可以看原文看詳細報錯。
spark的安裝
配置Yarn
配置Hive
配置Spark
問題
推薦的配置
設計文檔
Hive on Spark是Hive1.1發佈之後，成爲了Hive的一部分。在spark分支中，它得到了大力的開發，定期的合併到master的分支中。詳細看[hive-7292]
(https://issues.apache.org/jira/browse/HIVE-7292)。
spark的安裝
根據下面的連接安裝spark：http://spark.apache.org/docs/latest/running-on-yarn.html（https://spark.apache.org/docs/latest/spark-standalone.html，如果你要運行spark的standalone模式）。hive on spark模式默認支持spark on yarn。特別注意的是，要安裝，注意以下幾點：
1安裝spark(下載已經編譯好的spark或者從源碼自己編譯)-貌似是maven來管理，編譯的。
安裝/編譯一個兼容的版本。hive的pom.xml其中的spark.version定義你相應重新編譯的版本。
一旦spark編譯好了，找到spark-assembly-*.jar。
主意下，你必須是spark沒有帶有hive jar包的版本。意味着你編譯的時候不要帶有hive的依賴。如果你使用parquet table，推薦開啓parquet-provided。另外在parquet依賴下，可能會有衝突。爲了移除hive的jar包，在定spark的依賴的時候，使用下面的命令。

./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

2開始spark集羣（standalone和spark on yarn都支持）
保持注意，spark master url，這個可以在spark的master的webui可以查看。

配置Yarn
yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.Fairscheduler
配置hive
1有幾個方式增加hive的spark的依賴、
a.設置這個屬性spark.home，指向spark的安裝目錄

hive> set spark.home=/location/to/sparkHome;

b.在啓動hive 客戶端/hiveserver2之前，定義SPARK_HOME環境變量

export SPARK_HOME=/usr/lib/spark....

c.把spark-assembly.jar拷貝到HIVE_HOME/lib目錄
2配置hive的執行方式爲spark。

hive> set hive.execution.engine=spark;

看spark section of hive configuration properties連接，其他的hive配置和遠程spark驅動。
3在hive客服端，配置spark-application配置。詳情請看http://spark.apache.org/docs/latest/configuration.html。也可以增加一個文件spark-defaults.conf把這些配置寫進去，保存在hive的classpath，或者在hive的hive-site.xml這是他們。比如（這是在命名行設置的，也可以直接寫入hive-site.xml）：

hive> set spark.master=<Spark Master URL>

hive> set spark.eventLog.enabled=true;

hive> set spark.eventLog.dir=<Spark event log folder (must exist)>

hive> set spark.executor.memory=512m;             

hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

當對於配置屬性的一點點解釋：
spark.executor.memory:每一個executor用於計算的內存。
spark.executor.cores:每一個executor用於計算的cpu核數。
spark.yarn.executor.memoryOverhead:當運行模式爲spark on yarn，每一個executor內存溢出邊界。這個內存有點像VM的 overheads。另外，executor的內存，這個容器需要另外的一些內存來運行系統的進程。
spark.executor.instances:executor能夠運行的allication的數量。
spark.driver.memor:被分配給遠程spark context的內存數量，我們推薦爲4g。
spark.yarn.driver.memoryOverhead:我們推薦400m。
配置spark
設置executor的內存合適大小，比設置竟可能大的內存要好。有如下幾點需要考慮到：
1越多的執行內存意味着能夠更多的優化查詢。
2越多的內存，從另外一方面來說，對於GC來說是不明智的。
3一些實驗表明，hdfs客戶端不能很好的控制寫文件的一致性。如果executor太大的話，將會面臨衝突。
當運行spark on yarn模式時，我們推薦設置spark.executor.cores爲5,6,7,依靠典型的節點能夠被整除（我覺得，就是所有的spark節點，能夠被整除）。列如，如果yarn.nodemanager.resource.cpu-vcores（單機所有核數）是19，那麼設置成6是比較好的選擇（所有的executor只能擁有相同的核數，這裏如果我們選擇5，那麼每一個executor只會得到3核，如果我們選擇7，那麼僅僅只有2個executor能夠使用，而且5核cpu被浪費了）。如果總共核數爲20，那麼選擇5是很好的選擇（如果你只有4個executor，那麼不會有浪費）。
對於spark.executor.memory，我們建議使用計算，yarn.nodemanager.resource.memory-mb*(spark.executor.cores/yarn.nodemanager.resource.cpu-vcores)，然後按比例分配給，spark.executor.memory和spark.yarn.executor.mamoryOverhead。根據我們的環境，我們推薦設置spark.yarn.executor.memoryOverhead爲計算的的15%-20%。
在你決定給每一個executor選擇選擇多少內存之後。你也決定了有多少個executor分配來做查詢。在GA運行時，spark的動態執行分配會被支持。然後beta版本僅僅支持靜態資源分配。基於每一個executor的內存和配置sprak.executor.memory和spark.yarn.executor.memoryOverheader，你會選擇能夠執行多少個實例，通過設置spark.executor.instances。
在真實的案例中。假設有10個節點，每個節點64g內存，12個虛擬核數。那麼可以被分配的cpu核數量爲yarn.nodemanager.resource.cpu-vcores=12，1個節點做master，9個節點做slaves。我們推薦spark.executor.cores爲6。給定可分配的內存資源爲yarn.nodemanager.resource.memory-mb爲50g。那麼我們就算每個executor分的運行內存和溢出類型爲：先計算50g/(6/12)=25G。我們把20%分配給spark.yar.executor.memoryOverhead=5g，把80%分配給spark.executor.memory=20g。
在這個9個節點集羣中，每個機器有2個executor。所以我們配置spark.executor.instances在2到2*9之間比較合適。如果是18將會利用整個集羣。
問題

問題	原因	解決
Error:Could not find or load main class org.apache.spark.deploy.sparksubmit	spark的依賴沒有設置正確	給hive增加spark的jar包依賴，看上文
org.apache.spark.sparkException:job aborted due to stage failure:task5.0:0had a not serialable result:java.io.notserializableExcption:org.apache.hadoop.io.byteswritable
spark沒有設置序列化爲Kryo	設置spark的序列化org.apache.spark.serializer.KryoSerializer，見上文
terminal initialization failed;failling back to unsupported java.lang.incompatableclasschangeerror:found class jline.Terminal,but interface was expeted	hive已經有了jline2的jar包，但是在hadooplib中存在jline0.94的	1刪除hadoolib中的jline,2export HADOOP_USER_CLASSPATH_FIRST=true,3如果錯誤是在mvn test的時候，先clean install，然後在test
Spark executor gets killed all the time and Spark keeps retrying the failed stage; you may find similar information in the YARN nodemanager log.WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=217989,containerID=container_1421717252700_0716_01_50767235] is running beyond physical memory limits. Current usage: 43.1 GB of 43 GB physical memory used; 43.9 GB of 90.3 GB virtual memory used. Killing container.	在spark on yarn中，nodemanager將會kill掉spark的executor，如果executor使用了超過spark.executor.memory+spark.yarn.executor.memoryOverhead時。	增加spark.executor.memoryOverhead來確保不會溢出
運行查詢得到的錯誤：FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask	會發生在Mac系統中，這是一個常見的mac系統snappy問題	在啓動hive或者hiveserver2之前，下面的命令，export HADOOP_OPTS=”-Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib $HADOOP_OPTS”
Stack trace: ExitCodeException exitCode=1: …/launch_container.sh: line 27: PWD: PWD/spark.jar:HADOOPCONFDIR.../usr/hdp/ {hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.hdp.version.jar:/etc/hadoop/conf/secure: PWD/app.jar:PWD/∗:badsubstitution\|這個keymapreduce.application.classpath在/etc/hadoop/conf/mapred−site.xml包含了一個變量，在bash中是無效的\|從mapreduce.application.classpath（文件路徑，/etc/hadoop/conf/mapred−site.xml）中移除，:/usr/hdp/ {hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar
Exception in thread “Driver” scala.MatchError: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/TaskAttemptContext (of class java.lang.NoClassDefFoundError))	MR不在yarn的classpath中	把/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework改爲/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework

推薦的配置

# see HIVE-9153
mapreduce.input.fileinputformat.split.maxsize=750000000
hive.vectorized.execution.enabled=true

hive.cbo.enable=true
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.orc.splits.include.file.footer=false
hive.merge.mapfiles=true
hive.merge.sparkfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.orcfile.stripe.level=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096
hive.vectorized.groupby.flush.percent=0.1
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.exec.orc.default.stripe.size=67108864
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.fetch.task.aggr=false
mapreduce.input.fileinputformat.list-status.num-threads=5
spark.kryo.referenceTracking=false
spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch