Hadoop集群的安装和配置,主要分为两个部分:一部分是主机环境配置,主要是指Hadoop集群所依赖的操作系统及其相关软件的安装配置,包括操作系统安装、JDK安装配置、主机规划与IP地址映射配置、无密码认证会话配置;另一部分是Hadoop基本配置,主要是指Hadoop集群的各种基本组件的配置,包括HDFS的配置、MapReduce配置。
- 操作系统:Ubuntu-11.10
- Sun JDK:jdk-6u30-linux-i586.bin
- Hadoop:hadoop-0.22.0.tar.gz
主机环境配置
- JDK安装配置
cd ~/installation
chmod +x jdk-6u30-linux-i586.bin
./jdk-6u30-linux-i586.bin
JDK配置,需要修改环境变量文件~/.bashrc(vi ~/.bashrc), 在~/.bashrc文件的最后面增加如下配置行,如下所示:export JAVA_HOME=/home/hadoop/installation/jdk1.6.0_30
export CLASSPATH=$JAVA_HOME/lib/*.jar:$JAVA_HOME/jre/lib/*.jar
export PATH=$PATH:$JAVA_HOME/bin
最后,使配置生效,并验证:. ~/.bashrc
echo $JAVA_HOME
在每台机器上的配置,都按照上面的步骤进行配置。- 主机规划与IP地址映射配置
1、主结点master配置
集群中分为主结点(Master Node)和从结点(Slave Node)两种结点。我们选择一个主结点,两个从结点进行配置。127.0.0.1 localhost
192.168.0.190 master
192.168.0.186 slave-01
192.168.0.183 slave-02
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
master
2、从结点slave-01配置
127.0.0.1 localhost
192.168.0.190 master
192.168.0.186 slave-01
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
slave-01
3、从结点slave-02配置
127.0.0.1 localhost
192.168.0.190 master
192.168.0.183 slave-02
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
slave-02
4、总结说明
第二,扩展性好。因为通过/etc/hosts对主机名和IP地址进行了映射,即使IP变了,主机名可以保持不变。在Hadoop安装的时候,需要配置master和slaves两个文件,如果这两个文件都使用IP的话,试想,一个具有200个结点的集群,因为一次企业的网络规划重新分配网段,导致IP全部变更,那么这些配置都要进行修改,工作量很大。但是,如果使用主机名的话,Hadoop的配置不需要任何改变,而对于主机名到IP地址的映射配置交给系统管理员去做好了。
- 无密码认证会话配置
1、基本配置
首先检查,你的ssh是否安装并启动,如果没有则进行安装:sudo apt-get install openssh-server
在master主结点上,生成密钥对:ssh-keygen -t rsa
一路回车下去即可,即可生成公钥(~/.ssh/id_rsa.pub)和私钥(~/.ssh/id_rsa)文件。cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
验证配置,执行如下命令:ssh master
如果不需要输入密码,即可登录master(登录本机),说明配置正确。scp ~/.ssh/id_rsa.pub hadoop@slave-01:/home/hadoop/.ssh/id_rsa.pub.master
这时,因为结点之间(master到slave-01)要进行数据交换,需要输入slave-01结点的登录密码(hadoop用户),才能执行文件的远程拷贝。这里输入密码是正常的,不要与结点之间通过ssh进行无密码公钥认证混淆。注意:分发公钥时,主要修改目标master公钥文件副本名称,以防覆盖掉从结点上已经存在并且正在使用的公钥文件。
cat ~/.ssh/id_rsa.pub.master >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
scp ~/.ssh/id_rsa.pub hadoop@slave-02:/home/hadoop/.ssh/id_rsa.pub.master
cat ~/.ssh/id_rsa.pub.master >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys
ssh slave-01
ssh slave-02
如果不需要输入密码,则配置成功。2、总结说明
(1)为什么要将master上的公钥分发到集群中各从结点slaves上?一是通过输入密码进行认证,你必须知道对方主机的登录用户名和口令,才能登录到对方主机上进行合法的授权操作;
二是不需要密码就能够认证,这就需要用到密钥,当一个主机A访问另一个主机B时,如果对方主机B的认证密钥配置了A的公钥,那么A访问B是就能够通过配置的A的公钥进行认证,而不需要进行输入密码认证。
Hadoop集群master分发公钥到slaves结点,并且,在各个slaves结点上配置了公钥认证,这时,当master通过ssh登录到slaves结点上以后,可以执行相应的授权操作,例如,当master要停止整个HDFS集群(namenode、datanode)时,可以在master上就能直接登录到各个slaves结点上,直接关闭datanode,从而关闭整个集群。这样的话,你就不需要分别登录每个datanode上,去执行相应的关闭操作。
DSA和RSA都可以用于认证,不过是基于不同的加密和解密算法的而已。有关DSA和RSA可以查阅相关资料。
Hadoop集群配置
- Hadoop集群基本配置
cd ~/installation
tar -xvzf hadoop-0.22.0.tar.gz
2、配置Hadoop环境变量
export HADOOP_HOME=/home/hadoop/installation/hadoop-0.22.0
export PATH=$PATH:$HADOOP_HOME/bin
. ~/.bashrc
4、配置master和slaves文件
修改conf/master文件,内容如下所示:
master
修改conf/slaves文件,内容如下所示:slave-01
slave-02
5、配置hadoop-env.sh文件
export JAVA_HOME=/home/hadoop/installation/jdk1.6.0_30
其它选项,可以根据需要进行配置。6、配置conf/core-site.xml文件
配置文件conf/core-site.xml的内容,如下所示:<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000/</value>
<description></description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>fs.inmemory.size.mb</name>
<value>10</value>
<description>Larger amount of memory allocated for the in-memory file-system used to merge map-outputs at the reduces.</description>
</property>
<property>
<name>io.sort.factor</name>
<value>10</value>
<description>More streams merged at once while sorting files.</description>
</property>
<property>
<name>io.sort.mb</name>
<value>10</value>
<description>Higher memory-limit while sorting data.</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>Size of read/write buffer used in SequenceFiles.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/storage/tmp/hadoop-${user.name}</value>
<description></description>
</property>
</configuration>
上面配置内容,是与HDFS的基本属性相关的,一般在系统运行过程中比较固定的配置,都放到这里面。如果需要根据实际应用的变化,可以配置到hdfs-site.xml文件中,下面会解释。
7、配置conf/hdfs-site.xml文件
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/storage/name/a,/home/hadoop/storage/name/b</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/storage/data/a,/home/hadoop/storage/data/b,/home/hadoop/storage/data/c</value>
<description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<description>HDFS blocksize of 64MB for large file-systems.</description>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>10</value>
<description>More NameNode server threads to handle RPCs from large number of DataNodes.</description>
</property>
</configuration>
8、配置conf/mapred-site.xml文件
配置文件conf/mapred-site.xml是与MapReduce计算相关的,在实际使用中根据需要进行配置某些参数,如JVM堆内存分配大小等。该配置文件的内容,配置如下所示:<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://master:19830/</value>
<description>Host or IP and port of JobTracker.</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/storage/mapred/system</value>
<description>Path on the HDFS where where the MapReduce framework stores system files.Note: This is in the default filesystem (HDFS) and must be accessible from both the server and client machines.</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/storage/mapred/local</value>
<description>Comma-separated list of paths on the local filesystem where temporary MapReduce data is written. Note: Multiple paths help spread disk i/o.</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>10</value>
<description>The maximum number of Map tasks, which are run simultaneously on a given TaskTracker, individually.Note: Defaults to 2 maps, but vary it depending on your hardware.</description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<description>The maximum number of Reduce tasks, which are run simultaneously on a given TaskTracker, individually. Note: Defaults to 2 reduces, but vary it depending on your hardware.</description>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>5</value>
<description>Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.</description>
</property>
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx128M</value>
<description>Larger heap-size for child jvms of maps.</description>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xms64M</value>
<description>Larger heap-size for child jvms of reduces.</description>
</property>
<property>
<name>tasktracker.http.threads</name>
<value>5</value>
<description>More worker threads for the TaskTracker's http server. The http server is used by reduces to fetch intermediate map-outputs.</description>
</property>
<property>
<name>mapred.queue.names</name>
<value>default</value>
<description>Comma separated list of queues to which jobs can be submitted. Note: The MapReduce system always supports atleast one queue with the name as default. Hence, this parameter's value should always contain the string default. Some job schedulers supported in Hadoop, like the Capacity Scheduler(http://hadoop.apache.org/common/docs/stable/capacity_scheduler.html), support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.</description>
</property>
<property>
<name>mapred.acls.enabled</name>
<value>false</value>
<description>Boolean, specifying whether checks for queue ACLs and job ACLs are to be done for authorizing users for doing queue operations and job operations. Note: If true, queue ACLs are checked while submitting and administering jobs and job ACLs are checked for authorizing view and modification of jobs. Queue ACLs are specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below under mapred-queue-acls.xml. Job ACLs are described at Job Authorization(http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Job+Authorization).</description>
</property>
</configuration>
9、安装文件远程分发
前面,我们已经在各台机器上配置好了基础环境,同时在master上配置好了Hadoop,这时需要将Hadoop拷贝到从结点上相应的目录下。由于Hadoop发行包中存在很多源码和文档,占用了很大存储空间,可以将其删除以后,再进行远程拷贝,删除前先做好备份,我暂时删除了如下文件:
rm -rf common/ hdfs/ contrib/ c++/ mapreduce/
接着,执行远程拷贝命令:
cd ~/installation
scp -r ~/installation/hadoop-0.22.0/ hadoop@slave-01:/home/hadoop/installation
scp -r ~/installation/hadoop-0.22.0/ hadoop@slave-02:/home/hadoop/installation
10、其它配置
因为我们在上面的配置中,使用了存储目录storage,需要预先在master和slaves结点上创建该目录,执行如下命令:mkdir ~/storage
- Hadoop集群配置验证
1、启动HDFS集群
hdfs namenode -format
如果没有错误,继续执行,启动HDFS,执行命令:start-dfs.sh
此时:在master上,执行jps你可以看到,启动了NameNode和SecondaryNameNode;
在slaves上,执行jps你可以看到,启动了DameNode。
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-namenode-master.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-secondarynamenode-master.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-datanode-slave-01.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-datanode-slave-02.log
master结点: http://master:50070 或者 http://192.168.0.190:50070
slave-01结点:http://slave-01:50075 或者 http://192.168.0.186:50075
slave-02结点:http://slave-02:50075 或者 http://192.168.0.183:50075
start-mapred.sh
此时:tail -500f $HADOOP_HOME/logs/hadoop-hadoop-jobtracker-master.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-tasktracker-slave-01.log
tail -500f $HADOOP_HOME/logs/hadoop-hadoop-tasktracker-slave-02.log
还可以通过Hadoop内置的Web Server(Jetty),通过浏览器访问监控:master结点: http://master:50030 或者 http://192.168.0.190:50030
slave-01结点:http://slave-01:50060 或者 http://192.168.0.186:50060
slave-02结点:http://slave-02:50060 或者 http://192.168.0.183:50060
3、上传文件到HDFS
例如,上传一个文件到HDFS上,使用copyFromLocal命令:
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -lsr
drwxr-xr-x - hadoop supergroup 0 2011-12-31 11:40 /user/hadoop/storage
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -mkdir storage/input
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -copyFromLocal ~/storage/files/extfile.txt ./storage/input/attractions.txt
hadoop@master:~/installation/hadoop-0.22.0$ hadoop fs -lsr
drwxr-xr-x - hadoop supergroup 0 2011-12-31 11:41 /user/hadoop/storage
drwxr-xr-x - hadoop supergroup 0 2011-12-31 11:41 /user/hadoop/storage/input
-rw-r--r-- 3 hadoop supergroup 66609 2011-12-31 11:41 /user/hadoop/storage/input/attractions.txt
将文件extfile.txt上传到HDFS,改名为attractions.txt。
4、运行MapReduce任务
hadoop jar $HADOOP_HOME/hadoop-mapred-examples-0.22.0.jar wordcount ./storage/input/ $HADOOP_HOME/output
5、总结说明
192.168.0.190 master
192.168.0.186 slave-01
192.168.0.183 slave-02
保存以后,这时,你再通过域名访问Hadoop集群结点,就可以看到该结点的一些基本信息。