centos下hadoop僞分佈式配置及python測試

安裝和配置java

$ yum -y install java-1.7.0-openjdk*
$ ls -lrt /usr/bin/java
# lrwxrwxrwx 1 root root 22 Apr 29 13:47 /usr/bin/java -> /etc/alternatives/java
$ ls -lrt /etc/alternatives/java
#lrwxrwxrwx 1 root root 76 Apr 29 13:47 /etc/alternatives/java -> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.221-2.6.18.0.el7_6.x86_64/jre/bin/java
$ echo 'export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.211-2.6.17.1.el7_6.x86_64/' >> /etc/bashrc
$ echo 'export JRE_HOME=$JAVA_HOME/jre' >> /etc/bashrc
$ echo 'export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH' >> /etc/bashrc
$ echo 'export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH' >> /etc/bashrc

設置ssh密鑰訪問

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

安裝hadoop

$ mkdir ~/download
$ wget -P ~/download/ http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
$ tar zxf ~/download/hadoop-2.6.5.tar.gz -C /opt/
$ echo 'export HADOOP_HOME=/opt/hadoop-2.6.5' >> /etc/bashrc
$ echo 'export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar' >> /etc/bashrc
$ source /etc/bashrc
$ cd $HADOOP_HOME

配置hadoop

  1. 修改core-site.xml
$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

修改內容爲

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
  1. 修改hdfs-site.xml
$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

修改內容爲

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
  1. 啓動hdfs服務
$ $HADOOP_HOME/bin/hdfs namenode -format
$ $HADOOP_HOME/sbin/start-dfs.sh
# 查看端口是否啓動
$ netstat -ntpl|grep 9000
  1. 修改mapred-site.xml
$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

修改內容爲

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
  1. 修改yarn-site.xml
$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

修改內容爲

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
  1. 啓動yarn服務
$ $HADOOP_HOME/sbin/start-yarn.sh
# 查看端口是否啓動
$ netstat -ntpl|grep 8088
  1. 其他配置
$ echo 'alias hadoop=$HADOOP_HOME/bin/hadoop' >> /etc/bashrc
$ echo 'alias hdfs=$HADOOP_HOME/bin/hdfs' >> /etc/bashrc
$ echo 'export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools/jar' >> /etc/bashrc
$ source /etc/bashrc

安裝Python3

$ yum install epel-release
$ yum install python36
$ echo 'alias python=python3' >> /etc/bashrc
$ source /etc/bashrc

編寫python程序

  1. 寫 mapper. py
#!/usr/bin/env python3
import sys
import re

for line in sys.stdin:
	words = line.strip().split()
	for word in words:
		word = word.lower()
		ws = re.findall("[a-z][a-z]*", word)
		for w in ws:
			print(w, 1)
  1. 寫 reducer. py
#!/usr/bin/env python3
import sys

curr_w, curr_c, word = None, 0, None

for line in sys.stdin:
	word, cnt = line.strip().split()
	if curr_w == word:
		curr_c += int(cnt)
	else:
		if curr_w is not None:
			print(curr_w, curr_c)
		curr_c = int(cnt)
		curr_w = word

if curr_w == word:
	print(curr_w, curr_c)
  1. 賦予權限
$ chmod +x mapper.py
$ chmod +x reducer.py
  1. 測試程序
    首先下載3篇英文文章,這裏從ChinaDaily上覆制三篇,命名爲p1.txt、p2.txt、p3.txt,注意編碼格式應爲utf-8。
$ cat p1.txt | ./mapper.py | sort | ./reducer.py | more
# a 11
# absorb 1
# according 1
# activated 1
# active 1
# activities 1
# added 1
# adopted 1
# after 5
# airport 3
# --More--

測試hadoop

  1. 放置輸入文件
$ hdfs dfs -mkdir -p /user/`whoami`/input
$ hdfs dfs -put ~/p*.txt /user/`whoami`/input
  1. 寫run. sh
$HADOOP_HOME/bin/hadoop jar $STREAM \
-files ./mapper.py,./reducer.py \
-mapper ./mapper.py \
-reducer ./reducer.py \
-input /user/`whoami`/input/p*.txt \
-output /user/`whoami`/output
  1. 運行run. sh
$ chmod +x run.sh
$ ./run.sh
# 顯示進度
  1. 查看結果
$ hdfs dfs -cat output/part-00000

參考文獻

  1. 高揚, 衛崢, 尹會生. 白話大數據與機器學習[M]. 機械工業出版社, 2016.
  2. 讓python在hadoop上跑起來

問題解答

  1. [python]使用python實現Hadoop MapReduce程序:計算一組數據的均值和方差
  2. could only be replicated to 0 nodes, instead of 1 錯誤的解決方法
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章