Spark-2.4.0下載地址:
官方地址:https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2.tgz
編譯Spark源碼的文檔(參考官方文檔)
http://spark.apache.org/docs/latest/building-spark.html
編譯Spark源碼的前置要求
軟件 |
Hadoop |
scala |
maven |
JDK |
版本 |
2.6.0-cdh5.7.0 |
2.11.12 |
3.6.1 |
jdk1.8.0_45 |
編譯與配置:
1解壓Spark源碼:
1 2 3 4 5 6 7 |
[hadoop@hadoop001 software]$ ll spark-2.4.2.tgz
-rw-r--r--. 1 hadoop hadoop 16165557 4月 28 04:41 spark-2.4.2.tgz
[hadoop@hadoop001 software]$ tar -zxvf spark-2.4.2.tgz
[hadoop@hadoop001 software]$ cd spark-2.4.2 |
2 修改make-make-distribution.sh中的版本號,避免編譯時自己取尋找,此過程比較耗時
make-distribution.sh腳本的Github地址:
https://github.com/apache/spark/blob/master/dev/make-distribution.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
[hadoop@hadoop001 spark-2.4.2]$ vim dev/make-distribution.sh //修改 VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\ | grep -v "INFO"\ | grep -v "WARNING"\ | tail -n 1) SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\ | grep -v "INFO"\ | grep -v "WARNING"\ | tail -n 1) SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\ | grep -v "INFO"\ | grep -v "WARNING"\ | tail -n 1) SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\ | grep -v "INFO"\ | grep -v "WARNING"\ | fgrep --count "<id>hive</id>";\ # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\ # because we use "set -o pipefail" echo -n)
//修改爲: VERSION=2.4.2 SCALA_VERSION=2.11 SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0 SPARK_HIVE=1 |
3.修改 pom.xml文件
如果要編譯 cdh,必須要添加一個倉庫
1 2 3 4 5 6 7 8 |
[hadoop@hadoop614 spark-2.4.2]$ vim pom.xml
<repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> |
4.編譯命令
通過觀察pom.xml,可以觀察到編譯Spark的時候,如果不手動指定hadoop與yarn的版本,會默認採用hadoop、yarn的版本
1 2 3 4 5 6 7 8 9 10 11 |
[hadoop@hadoop001 spark-2.4.2]$ pwd /home/hadoop/software/spark-2.4.2 [hadoop@hadoop614 spark-2.4.2]$ ./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
--name:設置打包後的包名字中添加2.6.0-cdh-5.7.0,方便自己知道支持哪個版本 -Phadoop-2.6:指定hadoop的版本是2.6,通過-P進行指定profile -Dhadoop.version=2.6.0-cdh-5.7.0 通過-D 設定Properties屬性值,指定hadoop具體是使用哪一個版本,如果不指定,竟會使用默認版本 -Phive:支持使用hive -Phive-thriftserver 支持使用Jdbc連接池 -Pyarn:支持使用yarn,並且版本號與hadoop相同,如果想更換版本號,則採用-Dhadoop.version |
另外在編譯之前需要設置MAVEN_OPTS,否則會CompileFailed,以下是官網的說明
Building Apache Spark
Apache Maven
The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.5.4 and Java 8. Note that support for Java 7 was removed as of Spark 2.2.0.
Setting up Maven’s Memory Usage
You’ll need to configure Maven to use more memory than usual by setting MAVEN_OPTS
:
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
(The ReservedCodeCacheSize
setting is optional but recommended.) If you don’t add these parameters to MAVEN_OPTS
, you may see errors and warnings like the following:
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.12/classes...
[ERROR] Java heap space -> [Help 1]
You can fix these problems by setting the MAVEN_OPTS
variable as discussed before.
Note:
- If using
build/mvn
with noMAVEN_OPTS
set, the script will automatically add the above options to theMAVEN_OPTS
environment variable. - The
test
phase of the Spark build will automatically add these options toMAVEN_OPTS
, even when not usingbuild/mvn
.
解壓部署
1.解壓
1 2 3 4 5 6 7 8 |
[hadoop@hadoop001 spark-2.4.2]$ ll spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz -rw-rw-r--. 1 hadoop hadoop 231193116 4月 28 06:32 spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz [hadoop@hadoop001 spark-2.4.2]$ pwd /home/hadoop/software/spark-2.4.2 [hadoop@hadoop001 spark-2.4.2]$ tar -zxvf spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz -C ~/app [hadoop@hadoop001 spark-2.4.2]$ cd ~/app [hadoop@hadoop001 app]$ ls -ld spark-2.4.2-bin-2.6.0-cdh5.7.0/ drwxrwxr-x. 11 hadoop hadoop 4096 4月 28 06:31 spark-2.4.2-bin-2.6.0-cdh5.7.0/ |
2.配置環境變量
1 2 3 4 5 6 |
[hadoop@hadoop001 app]$ vim ~/.bash_profile
export SPARK_HOME=/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0 export PATH=${SPARK_HOME}/bin:$PATH
[hadoop@hadoop001 app]$ source ~/.bash_profile |
啓動Spark
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[hadoop@hadoop001 spark-2.4.2]$ ./spark-shell 19/04/28 06:44:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://hadoop614:4040 Spark context available as 'sc' (master = local[*], app id = local-1556405067469). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.2 /_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45) Type in expressions to have them evaluated. Type :help for more information.
scala> |
master:運行的模式
local[*]:表示在本地上運行
參考 <https://skygzx.github.io/2019/04/28/Spark%E7%BC%96%E8%AF%91hadoop-2.6.0-cdh2.7.0/>