1.鏡像製作方案
我們要使用Docker來搭建hadoop,spark,hive及mysql集羣,首先使用Dockerfile製作鏡像,把相關的軟件拷貝到約定好的目錄下,把配置文件在外面先配置好,再使用docker and / docker run
,拷貝移動到hadoop,spark,hive的配置目錄。需要注意一點在spark中讀取hive中的數據,需要把配置文件hive-site.xml
拷貝到spark的conf
目錄(Spark在讀取Hive表時,會從hive-site.xml
要與Hive配置通信)此外,爲了能使得mysql能從其他節點被訪問到(要用mysql存儲Hive元數據),要配置mysql的訪問權限。
如果在容器裏面配置文件,當我們使用docker rm
將容器刪除之後,容器裏的內容如果沒有使用docker commit
更新到鏡像中,刪除後容器裏的配置會全部丟失。
2.集羣整體架構設計
一共5個節點,即啓動5個容器。hadoop-maste,hadoop-node1,hadoop-node2這三個容器裏面安裝hadoop和spark集羣,hadoop-hive這個容器安裝Hive,hadoop-mysql這個容器安裝mysql數據庫。
Spark中可以在SparkSession中的builder中通過enableHiveSupport()方法,啓用對hive數據倉庫表操作的支持。Mysql用於保存hive的元數據。當然spark中的DataFrame也可以通過write方法將數據寫入Mysql中。
3. 集羣網絡規劃及子網配置
網絡可以通過Docker中的DockerNetworking的支持配置。首先設置網絡,docker中設置子網可以通過docker network create
方法,這裏我們通過名利設置如下的子網。
docker network create --subnet=172.16.0.0/16 spark
–subnet制定自網絡的網段,併爲這個子網明明一個名字叫做spark.
接下來在我們創建的自網絡spark中規劃集羣中每個容器的ip地址。網絡ip分配如下:
注意:5個容器的hostname都是以hadoop-*開頭,因爲我們要配置容器之間的SSH密鑰登陸,在不生成id_rsa.pub
公鑰的條件下,我們可以通過配置SSH過濾規則來配置容器間的互通信。
4.軟件版本
Spark:最新版本2.3.0
Hadoop:採用穩定的hadoop-2.7.3
Hive:最新的穩定版本hive-2.3.2
Scala:Scala-2.11.8
JDK:jdk-8u101-linux-x64
Mysql:mysql-5.5.45-linux2.6-x86_64
Hive和Spark連接Mysql的驅動程序:mysql-connector-java-5.1.37-bin.jar
5.SSH無密鑰登陸規則配置
這裏不使用ssh-keygen -t rsa -P
這種方式生成id_rsa.pub,然後集羣節點互拷貝id_rsa.pub到authorized_keys文件 ,而是通過在.ssh目錄下配置ssh_conf文件的方式,ssh_conf中可以配置SSH的通信規則。
ssh_conf配置內容:
Host localhost
StrictHostKeyChecking no
Host 0.0.0.0
StrictHostKeyChecking no
Host hadoop-*
StrictHostKeyChecking no
6.Hadoop、HDFS、Yarn配置文件
hadoop的配置文件位於HADOOP_HOME/etc/hadoop
文件下,重要的配置文件有core-site.xml
hadoop-env.sh
hdfs-site.xml
mapred-env.sh
mapred-site.xml
yarn-env.sh
yarn-site.xml
master
slaves
這九個配置文件。
其中core-site.xml
用於配置hadoop默認的文件系統的訪問路徑,訪問文件系統的用戶及用戶組等相關的配置。core-site.xml
配置如下
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-maste:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
</configuration>
hadoop-env.sh
這個配置文件用來匹配hadoop與逆行依賴的JDK環境,及一些JVM參數的配置,除了JDK路徑的配置外,其他的我們不用管,內容如下:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.
# 這裏需要特殊配置! 導入JAVA_HOME
export JAVA_HOME=/usr/local/jdk1.8.0_101
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""
###
# Advanced Users Only!
###
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
之後配置hdfs-site.xml
,它主要用來配置hdfs分佈式文件系統的namenode即datanode數據的存儲路徑,及數據區塊的冗餘數。
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop2.7/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop2.7/dfs/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
mapred-env.sh
和mapred-site.xml
這兩個配置文件是對mapreduce計算框架的運行環境參數及網絡的配置文件,因爲我們不會用到mapreduce,因爲它的計算性能不如spark。
mapred-site.xml
配置
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<!-- 配置實際的Master主機名和端口-->
<value>hadoop-maste:10020</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/stage</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
</property>
</configuration>
對於Yarn的配置有yarn-env.sh
和0yarn-site.xml
兩個配置文件,yarn是hadoop的任務調度系統,從配置文件的名字可以看出,他們分別用於yarn運行環境的配置及網絡的配置。yarn-env.sh中會讀取JAVA_HOME環境變量,還會設置一些默認的jdk參數,因此通常情況下我們都不用修改yarn-env.sh
這個配置文件。
yarn-site.xml
配置
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-maste</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-maste:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-maste:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-maste:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-maste:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-maste:8088</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>22528</value>
<discription>每個節點可用內存,單位MB</discription>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>4096</value>
<discription>單個任務可申請最少內存,默認1024MB</discription>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
<discription>單個任務可申請最大內存,默認8192MB</discription>
</property>
</configuration>
最後是master和slaves兩個配置文件。hadoop是一個master-slave結構的分佈式系統,制定哪個節點爲master節點,哪些節點爲slave節點的方案是通過master和slaves兩個配置文件決定的。
master配置文件:
hadoop-maste
即指定master主節點運行在網絡規劃中的hadoop-maste這個hostname對應的容器中。
slaves配置文件:
hadoop-node1
hadoop-node2
即指定slaves節點分別爲hadoop-node1和hadoop-node2,在這兩個容器中將會啓動Hdfs對應的DataNode進程及YARN資源管理系統啓動的NodeManager進程。
7. Spark配置文件
主要有masters
slaves
spark-defaults.conf
spark-env.sh