Compatibility with hadoop and hive

Spark 3.0 官方默認支持的Hadoop最低版本爲2.7, Hive最低版本爲 1.2。我們平臺使用的CDH 5.13,對應的版本分別爲hadoop-2.6.0, hive-1.1.0。所以嘗試自己去編譯Spark 3.0 來使用。

編譯環境： Maven 3.6.3, Java 8, Scala 2.12

Hive版本預先編譯

因爲Hive 1.1.0 實在是太久遠了，很多依賴包和Spark3.0中不兼容，需要需要重新編譯。
hive exec 模塊編譯: mvn clean package install -DskipTests -pl ql -am -Phadoop-2

代碼改動:

diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
          <enabled>false</enabled>
        </snapshots>
     </repository>
+    <repository>
+      <id>spring</id>
+      <name>Spring repo</name>
+      <url>https://repo.spring.io/plugins-release/</url>
+      <releases>
+        <enabled>true</enabled>
+      </releases>
+    </repository>
   </repositories>

   <!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
   <profiles>
     <profile>
       <id>thriftif</id>
+      <properties>
+        <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+      </properties>
       <build>
         <plugins>
           <plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
                   <include>org.apache.hive:hive-exec</include>
                   <include>org.apache.hive:hive-serde</include>
                   <include>com.esotericsoftware.kryo:kryo</include>
-                  <include>com.twitter:parquet-hadoop-bundle</include>
-                  <include>org.apache.thrift:libthrift</include>
                   <include>commons-lang:commons-lang</include>
-                  <include>org.apache.commons:commons-lang3</include>
                   <include>org.jodd:jodd-core</include>
                   <include>org.json:json</include>
                   <include>org.apache.avro:avro</include>

Spark 編譯

# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0

# Leyan Version 主要設計spark hive的兼容性改造
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera

./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive

--本地倉庫更新
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive

# deploy

rm -rf /root/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /root/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /root/spark-3.0.0-bin-cloudera/conf

Tips

在使用Maven編譯的時候，以前版本支持多CPU併發編譯，現在不可以了，否則編譯的時候會導致死鎖
在使用maven命令進行編譯的使用不能同時指定package 和 install，否則編譯時會有衝突
模版編譯命令mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive，可以自定義編譯模塊和編譯target
想要使用Spark3.0，還是需要進行魔改的。yarn 模塊要稍微改動。 mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive
所有Spark3.0 包在本地全部安裝完畢後，可以繼續編譯above-board項目
刪除Spark3.0 中對高版本hive的支持
當切換到CDH的hive版本時發現，該hive版本shade的commons jar太舊了，進行重新打包

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark 3.0 測試與使用

Compatibility with hadoop and hive

Hive版本預先編譯

Spark 編譯

Tips

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

Spark 3.0 測試與使用

Spark動態資源分配的資源釋放過程及BlockManager清理過程

Uber jvm-profiler學習

Mysql 常用操作及mysql8 遇到的問題記錄

實現自定義Spark優化規則

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結