基於Docker的Spark環境搭建

1.鏡像製作方案

我們要使用Docker來搭建hadoop,spark,hive及mysql集羣,首先使用Dockerfile製作鏡像,把相關的軟件拷貝到約定好的目錄下,把配置文件在外面先配置好,再使用docker and / docker run,拷貝移動到hadoop,spark,hive的配置目錄。需要注意一點在spark中讀取hive中的數據,需要把配置文件hive-site.xml拷貝到spark的conf目錄(Spark在讀取Hive表時,會從hive-site.xml要與Hive配置通信)此外,爲了能使得mysql能從其他節點被訪問到(要用mysql存儲Hive元數據),要配置mysql的訪問權限。

如果在容器裏面配置文件,當我們使用docker rm將容器刪除之後,容器裏的內容如果沒有使用docker commit更新到鏡像中,刪除後容器裏的配置會全部丟失。

2.集羣整體架構設計

一共5個節點,即啓動5個容器。hadoop-maste,hadoop-node1,hadoop-node2這三個容器裏面安裝hadoop和spark集羣,hadoop-hive這個容器安裝Hive,hadoop-mysql這個容器安裝mysql數據庫。

在這裏插入圖片描述

Spark中可以在SparkSession中的builder中通過enableHiveSupport()方法,啓用對hive數據倉庫表操作的支持。Mysql用於保存hive的元數據。當然spark中的DataFrame也可以通過write方法將數據寫入Mysql中。

3. 集羣網絡規劃及子網配置

網絡可以通過Docker中的DockerNetworking的支持配置。首先設置網絡,docker中設置子網可以通過docker network create方法,這裏我們通過名利設置如下的子網。

docker network create --subnet=172.16.0.0/16 spark

–subnet制定自網絡的網段,併爲這個子網明明一個名字叫做spark.

接下來在我們創建的自網絡spark中規劃集羣中每個容器的ip地址。網絡ip分配如下:

在這裏插入圖片描述

注意:5個容器的hostname都是以hadoop-*開頭,因爲我們要配置容器之間的SSH密鑰登陸,在不生成id_rsa.pub公鑰的條件下,我們可以通過配置SSH過濾規則來配置容器間的互通信。

4.軟件版本

Spark:最新版本2.3.0

Hadoop:採用穩定的hadoop-2.7.3

Hive:最新的穩定版本hive-2.3.2

Scala:Scala-2.11.8

JDK:jdk-8u101-linux-x64

Mysql:mysql-5.5.45-linux2.6-x86_64

Hive和Spark連接Mysql的驅動程序:mysql-connector-java-5.1.37-bin.jar

5.SSH無密鑰登陸規則配置

這裏不使用ssh-keygen -t rsa -P這種方式生成id_rsa.pub,然後集羣節點互拷貝id_rsa.pub到authorized_keys文件 ,而是通過在.ssh目錄下配置ssh_conf文件的方式,ssh_conf中可以配置SSH的通信規則。

ssh_conf配置內容:

Host localhost
	StrictHostKeyChecking no
	
Host 0.0.0.0
	StrictHostKeyChecking no
	
Host hadoop-*
	StrictHostKeyChecking no

6.Hadoop、HDFS、Yarn配置文件

hadoop的配置文件位於HADOOP_HOME/etc/hadoop文件下,重要的配置文件有core-site.xml hadoop-env.sh hdfs-site.xml mapred-env.sh mapred-site.xml yarn-env.sh yarn-site.xml master slaves這九個配置文件。

其中core-site.xml用於配置hadoop默認的文件系統的訪問路徑,訪問文件系統的用戶及用戶組等相關的配置。core-site.xml配置如下

<?xml version="1.0"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop-maste:9000/</value>
    </property>
	<property>
         <name>hadoop.tmp.dir</name>
         <value>file:/usr/local/hadoop/tmp</value>
    </property>
	<property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
	<property>
        <name>hadoop.proxyuser.oozie.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.oozie.groups</name>
        <value>*</value>
    </property>
</configuration>

hadoop-env.sh這個配置文件用來匹配hadoop與逆行依賴的JDK環境,及一些JVM參數的配置,除了JDK路徑的配置外,其他的我們不用管,內容如下:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
# 這裏需要特殊配置! 導入JAVA_HOME
export JAVA_HOME=/usr/local/jdk1.8.0_101

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

之後配置hdfs-site.xml,它主要用來配置hdfs分佈式文件系統的namenode即datanode數據的存儲路徑,及數據區塊的冗餘數。

<?xml version="1.0"?>
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop2.7/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop2.7/dfs/data</value>
    </property>
	<property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
	<property>
	    <name>dfs.permissions.enabled</name>
		<value>false</value>
	</property>
 </configuration>

mapred-env.shmapred-site.xml這兩個配置文件是對mapreduce計算框架的運行環境參數及網絡的配置文件,因爲我們不會用到mapreduce,因爲它的計算性能不如spark。

mapred-site.xml配置

<?xml version="1.0"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <!-- 配置實際的Master主機名和端口-->
        <value>hadoop-maste:10020</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>8192</value>
    </property>
	<property>
      <name>yarn.app.mapreduce.am.staging-dir</name>
      <value>/stage</value>
    </property>
    <property>
      <name>mapreduce.jobhistory.done-dir</name>
      <value>/mr-history/done</value>
    </property>
    <property>
      <name>mapreduce.jobhistory.intermediate-done-dir</name>
      <value>/mr-history/tmp</value>
    </property>
</configuration>

對於Yarn的配置有yarn-env.sh和0yarn-site.xml兩個配置文件,yarn是hadoop的任務調度系統,從配置文件的名字可以看出,他們分別用於yarn運行環境的配置及網絡的配置。yarn-env.sh中會讀取JAVA_HOME環境變量,還會設置一些默認的jdk參數,因此通常情況下我們都不用修改yarn-env.sh這個配置文件。

yarn-site.xml配置

<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-maste</value>
    </property>
	<property>
        <name>yarn.resourcemanager.address</name>
        <value>hadoop-maste:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>hadoop-maste:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>hadoop-maste:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>hadoop-maste:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>hadoop-maste:8088</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
   </property>
    <property>
       <name>yarn.nodemanager.vmem-pmem-ratio</name>
       <value>5</value>
    </property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>22528</value>
    <discription>每個節點可用內存,單位MB</discription>
  </property>
  
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>4096</value>
    <discription>單個任務可申請最少內存,默認1024MB</discription>
  </property>
  
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>16384</value>
    <discription>單個任務可申請最大內存,默認8192MB</discription>
  </property>
</configuration>

最後是master和slaves兩個配置文件。hadoop是一個master-slave結構的分佈式系統,制定哪個節點爲master節點,哪些節點爲slave節點的方案是通過master和slaves兩個配置文件決定的。

master配置文件:

hadoop-maste

即指定master主節點運行在網絡規劃中的hadoop-maste這個hostname對應的容器中。

slaves配置文件:

hadoop-node1
hadoop-node2

即指定slaves節點分別爲hadoop-node1和hadoop-node2,在這兩個容器中將會啓動Hdfs對應的DataNode進程及YARN資源管理系統啓動的NodeManager進程。

7. Spark配置文件

主要有masters slaves spark-defaults.conf spark-env.sh

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章