完全從0搭建Spark集羣
備註:這個步驟,只適合用root來搭建,正式環境下應該要有權限類的東西后面另外再進行實驗寫教程
1、安裝各個軟件,設置環境變量(每種軟件需自己單獨下載)
export JAVA_HOME=/usr/java/jdk1.8.0_71
export JAVA_BIN=/usr/java/jdk1.8.0_71/bin
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME JAVA_BIN PATH CLASSPATH
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH
export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=${SCALA_HOME}/bin:$PATH
export SPARK_HOME=/usr/local/spark/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:$PATH
export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.6
2、ssh設置
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa //生成key到~/.ssh/id_dsa中
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys //追加到key中
3、主機名和域名的設置
vi /etc/hostname 改成Master或者Worker1、2、3、4
vim /etc/hosts 改域名,各個系統ip對應的域名
4、Hadoop的配置
1)cd $HADOOP_HOME/etc/hadoop/ 下改動core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://Master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/tmp</value>
</property>
<property>
<name>hadoop.native.lib</name>
<value>true</value>
<description>Should native hadoop libraries,if present,be used</description>
</property>
</configuration>
2)還是cd $HADOOP_HOME/etc/hadoop/ 下改動hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>Master:50090</value>
<description>The secondary namenode http server address and port</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/dfs/name</value>
</property>
<property>
<name>dfs.datanode.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/dfs/data</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:///usr/local/hadoop/hadoop-2.6.0/dfs/namesecondary</value>
<description>Determines where on the local filesystem the DFSsecondary name node should store th temporary p_w_picpaths to merge.If this is acomma-delimited list of directories then the p_w_picpath is replicated in all of the irectories foe redundancy.</description>
</property>
</configuration>
3)還是cd $HADOOP_HOME/etc/hadoop/ 下改動 mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
4)還是cd $HADOOP_HOME/etc/hadoop/ 下改動 yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
5)還是cd $HADOOP_HOME/etc/hadoop/ 下改動 hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_71 對應的jdk目錄
如果你想把你的Master也作爲一個節點,也可以Master加入,但是機器不夠多Driver如果是Master,Driver還有其它程序比如web查詢等在跑,不建議把Master作爲節點來跑。
=======一臺機器配置到上面這個程度,然後開始複製機器,再繼續下面的操作=============
6)先再看第3步裏面改域名的,到各個機器上把主機名稱和域名改好了
7)還是cd $HADOOP_HOME/etc/hadoop/ 下改動 slaves
看你有多少slaves,分別把那些機器的域名加入進來,比如
Worker1
Worker2
Worker3
然後分別複製給幾臺機器:
scp slaves root@Worker1:/usr/local/hadoop-2.6.0/etc/hadoop/slaves
scp slaves root@Worker2:/usr/local/hadoop-2.6.0/etc/hadoop/slaves
scp slaves root@Worker3/usr/local/hadoop-2.6.0/etc/hadoop/slaves
8)還是cd $HADOOP_HOME/etc/hadoop/ 下改動 Master,內容就是Master
在Master不做集羣的情況下,需要把Master分別拷貝到各個機器上,其實應該拷貝,這樣如果不啓動集羣,也能跑
如果Master是集羣,ZooKeeper,在ZooKeeper進行配置
scp Master root@Worker1:/usr/local/hadoop-2.6.0/etc/hadoop/Master
scp Master root@Worker2:/usr/local/hadoop-2.6.0/etc/hadoop/Master
scp Master root@Worker3/usr/local/hadoop-2.6.0/etc/hadoop/Master
9)在Master上格式化系統
mkdir /usr/local/hadoop/hadoop-2.6.0/tmp 如果原來存在,就刪除
hdfs namenode -format
10)啓動dfs
cd $HADOOP_HOME/sbin
./start-dfs.sh
然後
http://Master:50070/dfshealth.html 可以看dfs文件狀態
看不到,比如Configured Capacity只有0B,嘗試每臺機器防火牆關閉:
systemctl stop firewalld.service
systemctl disable firewalld.service
但是這個只適合於開發機,實際生產環境需要仔細看什麼端口來確定。
***************如果只做Spark,到這裏就足夠了,做Hadoop另外說****************
5、Spark的配置
1)spark-env.sh
cd $SPARK_HOME/conf
cp出來spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_71
export SCALA_HOME=/usr/local/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export HADOOP_CONF_DIR=/usr/local/hadoop-2.6.0/etc/hadoop
#export SPARK_CLASSPATH=$SPARK_CLASSPATH:$SPARK_HOME/lib/ojdbc-14.jar:$SPARK_HOME/lib/jieyi-tools-1.2.0.7.RELEASE.jar
#export SPARK_MASTER_IP=Master
export SPARK_WORKER_MEMORY=2g
export SPARK_EXCUTOR_MEMORY=2g
export SPARK_DRIVER_MEMORY=2g
export SPARK_WORKER_CORES=8
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=Master:2181,Worker1:2181,Worker2:2181 -Dspark.deploy.zookeeper.dir=/spark"
參數意思講解:
export JAVA_HOME=/usr/java/jdk1.8.0_71
export SCALA_HOME=/usr/local/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop-2.6.0
export HADOOP_CONF_DIR=/usr/local/hadoop-2.6.0/etc/hadoop //運行在yarn模式下必須配置
export SPARK_MASTER_IP=Master //Saprk運行的主ip
export SPARK_WORKER_MEMORY=2g //具體機器
export SPARK_EXCUTOR_MEMORY=2g //具體計算
export SPARK_DRIVER_MEMORY=2g
export SPARK_WORKER_CORES=8 //線程池併發數
其中export SPARK_MASTER_IP=Master是作爲單機的時候,export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=Master:2181,Worker1:2181,Worker2:2181 -Dspark.deploy.zookeeper.dir=/spark"是作爲集羣的時候的配置
改完之後同步:
scp spark-env.sh root@Worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh
scp spark-env.sh root@Worker2:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh
scp spark-env.sh root@Worker3:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh
2)slaves
cd $SPARK_HOME/conf
cp出來slaves
內容爲爲:
Worker1
Worker2
Worker3
改完之後同步:
scp slaves root@Worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/slaves
scp slaves root@Worker2:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/slaves
scp slaves root@Worker3:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/slaves
3)spark-defaults.conf
cd $SPARK_HOME/conf
cp spark-defaults.conf出來
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.eventLog.enabled true
spark.eventLog.dir hdfs://Master:9000/historyserverforSpark1
spark.yarn.historyServer.address Master:18080
spark.history.fs.logDirectory hdfs://Master:9000/historyserverforSpark1
改完之後同步:
scp spark-defaults.conf root@Worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-defaults.conf
scp spark-defaults.conf root@Worker2:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-defaults.conf
scp spark-defaults.conf root@Worker3:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-defaults.conf
或者上面三步一起:
cd $SPARK_HOME
scp -r ./spark-1.6.0-bin-hadoop2.6/ root@Worker1:/usr/local/spark
4)創建歷史目錄(第一次安裝必須做)
hadoop dfs -rmr /historyserverforSpark
hadoop dfs -mkdir /historyserverforSpark
然後這裏就有了:
5)啓動spark
cd $SPARK_HOME/sbin
./start-all.sh
看web控制檯
master:8080/
6)啓動歷史信息的服務
cd $SPARK_HOME/sbin
./start-history-server.sh
7)實驗下Pi的算法:
./spark-submit --class org.apache.spark.examples.SparkPi --master spark://Master:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
開始神奇的Spark之旅吧!
***************如果Spark單機,到這裏就足夠了,開始補充Zookeeper做集羣的東西****************
6、ZooKeeper安裝集羣的東西
1)先在第一臺機器上解壓zookeeper,目錄按照開頭的環境變量解壓就可以
進入到zookeeper下,創建data和logs兩個目錄
root@Master:/usr/local/zookeeper-3.4.6# mkdir data
root@Master:/usr/local/zookeeper-3.4.6# mkdir logs
2)從zoo_sample.cfg中cp出zoo.cfg並設置
root@Master:/usr/local/zookeeper-3.4.6/conf# cp zoo_sample.cfg zoo.cfg
root@Master:/usr/local/zookeeper-3.4.6/conf# vi zoo.cfg
修改(做3臺機器的集羣)
dataDir=/usr/local/zookeeper-3.4.6/data
dataLogDir=/usr/local/zookeeper-3.4.6/logs
server.0=Master:2888:3888
server.1=Worker1:2888:3888
server.2=Worker2:2888:3888
3)在data下面爲機器編號
root@Master:/usr/local/zookeeper-3.4.6/conf# cd ../data/
爲機器編號
root@Master:/usr/local/zookeeper-3.4.6/data# echo 0>myid
root@Master:/usr/local/zookeeper-3.4.6/data# echo 0>>myid
root@Master:/usr/local/zookeeper-3.4.6/data# ls
myid
root@Master:/usr/local/zookeeper-3.4.6/data# cat myid
root@Master:/usr/local/zookeeper-3.4.6/data# vi myid 在這裏在裏面寫一個0
root@Master:/usr/local/zookeeper-3.4.6/data# cat myid
0
到這個時候一臺機器已經配置好了
4)拷貝給其它兩臺機器同時更改myid
root@Master:/usr/local# scp -r ./zookeeper-3.4.6 root@Worker1:/usr/local
root@Master:/usr/local# scp -r ./zookeeper-3.4.6 root@Worker2:/usr/local
然後分別進去Worker1和Worker2更改myid爲1和2
到這個時候3臺機器的Zookeeper已經配置好了
5)下一步就是讓Spark支持zookeeper下HA
到spark-env.sh中配置
root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# vi spark-env.sh
//整個集羣的狀態的維護和恢復都是通過zookeeper的,狀態信息都是(下面的這段就是上面被註釋的東西,要切單機和集羣就靠這個了)
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=Master:2181,Worker1:2181,Worker2:2181 -Dspark.deploy.zookeeper.dir=/spark"
已經配置集羣了,所以還要註釋
#export SPARK_MASTER_IP=Master
root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# scp spark-env.sh root@Worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh
spark-env.sh 100% 500 0.5KB/s 00:00
root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# scp spark-env.sh root@Worker2:/usr/local/spark /spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh
spark-env.sh 100% 500 0.5KB/s 00:00
到這個時候3臺機器的Spark也已經配置好了,下面就是啓動了
6)整體啓動步驟
啓動Hadoop hdfs
cd $HADOOP_HOME/sbin
./start-dfs.sh
三臺Zookeeper的機器分別啓動Zookeeper:
cd $ZOOKEEPER_HOME/bin
./zkServer.sh start
啓動Spark
在Master啓動:
cd $SPARK_HOME/sbin
./start-all.sh
./start-history-server.sh
在另外兩臺機器啓動:
cd $SPARK_HOME/sbin
./start-mastser.sh
jps分別在三臺機器上查看進程
或者看控制檯
整個集羣算啓動好了
7)如果要實驗集羣效果
可以啓動./spark-shell --master spark://Master:7077,Worker1:7077,Worker2:7077
然後把Master的master進程用 ./stop-master停止,過一段時間(根據機器幾秒到幾分鐘不等)自動切換到另外的機器上