基於Hadoop CDH進行Spark編譯

Spark-2.4.0下載地址:

官方地址:https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2.tgz

編譯Spark源碼的文檔(參考官方文檔)

http://spark.apache.org/docs/latest/building-spark.html

編譯Spark源碼的前置要求

軟件

Hadoop

scala

maven

JDK

版本

2.6.0-cdh5.7.0

2.11.12

3.6.1

jdk1.8.0_45

編譯與配置:

1解壓Spark源碼:

1

2

3

4

5

6

7

[hadoop@hadoop001 software]$ ll spark-2.4.2.tgz

 

-rw-r--r--. 1 hadoop hadoop 16165557 4月  28 04:41 spark-2.4.2.tgz

 

[hadoop@hadoop001 software]$ tar -zxvf spark-2.4.2.tgz

 

[hadoop@hadoop001 software]$ cd spark-2.4.2

2 修改make-make-distribution.sh中的版本號,避免編譯時自己取尋找,此過程比較耗時

make-distribution.sh腳本的Github地址:

https://github.com/apache/spark/blob/master/dev/make-distribution.sh

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

[hadoop@hadoop001 spark-2.4.2]$ vim dev/make-distribution.sh

//修改

VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| tail -n 1)

SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| tail -n 1)

SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| tail -n 1)

SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| fgrep --count "<id>hive</id>";\

# Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\

# because we use "set -o pipefail"

echo -n)

 

//修改爲:

VERSION=2.4.2

SCALA_VERSION=2.11

SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0

SPARK_HIVE=1

3.修改 pom.xml文件

如果要編譯 cdh,必須要添加一個倉庫

1

2

3

4

5

6

7

8

[hadoop@hadoop614 spark-2.4.2]$ vim pom.xml

 

<repositories>

<repository>

<id>cloudera</id>

<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>

</repository>

</repositories>

4.編譯命令

通過觀察pom.xml,可以觀察到編譯Spark的時候,如果不手動指定hadoop與yarn的版本,會默認採用hadoop、yarn的版本

 

1

2

3

4

5

6

7

8

9

10

11

[hadoop@hadoop001 spark-2.4.2]$ pwd

/home/hadoop/software/spark-2.4.2

[hadoop@hadoop614 spark-2.4.2]$ ./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver -Pyarn -Pkubernetes

 

 

--name:設置打包後的包名字中添加2.6.0-cdh-5.7.0,方便自己知道支持哪個版本

-Phadoop-2.6:指定hadoop的版本是2.6,通過-P進行指定profile

-Dhadoop.version=2.6.0-cdh-5.7.0 通過-D 設定Properties屬性值,指定hadoop具體是使用哪一個版本,如果不指定,竟會使用默認版本

-Phive:支持使用hive

-Phive-thriftserver 支持使用Jdbc連接池

-Pyarn:支持使用yarn,並且版本號與hadoop相同,如果想更換版本號,則採用-Dhadoop.version

另外在編譯之前需要設置MAVEN_OPTS,否則會CompileFailed,以下是官網的說明

Building Apache Spark

Apache Maven

The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.5.4 and Java 8. Note that support for Java 7 was removed as of Spark 2.2.0.

Setting up Maven’s Memory Usage

You’ll need to configure Maven to use more memory than usual by setting MAVEN_OPTS:

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

(The ReservedCodeCacheSize setting is optional but recommended.) If you don’t add these parameters to MAVEN_OPTS, you may see errors and warnings like the following:

[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.12/classes...
[ERROR] Java heap space -> [Help 1]

You can fix these problems by setting the MAVEN_OPTS variable as discussed before.

Note:

  • If using build/mvn with no MAVEN_OPTS set, the script will automatically add the above options to the MAVEN_OPTS environment variable.
  • The test phase of the Spark build will automatically add these options to MAVEN_OPTS, even when not using build/mvn.

 

解壓部署

1.解壓

1

2

3

4

5

6

7

8

[hadoop@hadoop001 spark-2.4.2]$ ll spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz

-rw-rw-r--. 1 hadoop hadoop 231193116 4月  28 06:32 spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz

[hadoop@hadoop001 spark-2.4.2]$ pwd

/home/hadoop/software/spark-2.4.2

[hadoop@hadoop001 spark-2.4.2]$ tar -zxvf spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz -C ~/app

[hadoop@hadoop001 spark-2.4.2]$ cd ~/app

[hadoop@hadoop001 app]$ ls -ld spark-2.4.2-bin-2.6.0-cdh5.7.0/

drwxrwxr-x. 11 hadoop hadoop 4096 4月  28 06:31 spark-2.4.2-bin-2.6.0-cdh5.7.0/

2.配置環境變量

1

2

3

4

5

6

[hadoop@hadoop001 app]$ vim ~/.bash_profile

 

export SPARK_HOME=/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0

export PATH=${SPARK_HOME}/bin:$PATH

 

[hadoop@hadoop001 app]$ source ~/.bash_profile

啓動Spark

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

[hadoop@hadoop001 spark-2.4.2]$ ./spark-shell

19/04/28 06:44:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://hadoop614:4040

Spark context available as 'sc' (master = local[*], app id = local-1556405067469).

Spark session available as 'spark'.

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 2.4.2

/_/

 

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala>

master:運行的模式

local[*]:表示在本地上運行

 

參考 <https://skygzx.github.io/2019/04/28/Spark%E7%BC%96%E8%AF%91hadoop-2.6.0-cdh2.7.0/>

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章