全文作爲個人記錄用。不做任何參考。
環境搭建參考:http://www.ityouknow.com/hadoop/2017/07/24/hadoop-cluster-setup.html
詞頻代碼參考:https://blog.csdn.net/a60782885/article/details/71308256
1、環境搭建
總共選擇了3臺虛擬機作爲本次的主角
master:192.168.21.130
slave1:192.168.21.131
slave2:192.168.21.132
1.1、首先是虛擬機的安裝,物理主機是win10,虛擬機用的是Centos7,採用最小化方式安裝,安裝完後,有可能需要激活網卡,修改/etc/sysonfig/network-scripts/ifcfg-xxxx(我的是ifcfg-ens33),將ONBOOT=no修改爲yes,使得能夠聯網。如下所示:
1.2、依次安裝完3臺虛擬機後,再修改主機的名字,依次爲 master、slave1、slave2。修改文件/etc/sysconfig/network,在master機器中加入:HOSTNAME=master,其他機器中依次加入 HOSTNAME=slave1,HOSTNAME=slave2.
1.3、修改三臺機器的hosts 加入下面這段話(具體ip視自己的機子而定):
1.4、軟件的安裝,首先是jdk的安裝。
http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz
wget http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz
tar -zxvf jdk-8u161-linux-x64.tar.gz
mv jdk-8u151-linux-x64 jdk180161
修改環境變量:
export JAVA_HOME=/root/jdk180161
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
source /etc/profile
1.5、免密登陸
免密登陸的思想:
A機器能夠免密登陸B機器。
首先在A機器上生成密鑰:
ssh-keygen -t rsa
然後將密鑰拷貝到B機器的authorized_keys中,就可以了。
這裏以master遠程免密登陸slave1爲例。
①、登陸master,執行 ssh-keygen -t rsa ,可以一路回車。
②、登陸slave1,執行 scp root@master:~/.ssh/id_rsa.pub /root/
③、在slave1上,執行 cat /root/id_rsa.pub >> ~/.ssh/authorized_keys。
(如果失敗,執行一下 chmod 600 .ssh/authorized_keys)
④、在master上測試 ssh slave1,能夠登陸則成功。
然後依次配置三臺機器之間的免密登陸和本機的免密登陸(例如在master 中執行 ssh master,可以登陸)
1.6 Hadoop配置
依次在3臺機器上執行。
wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz
tar -zxvf hadoop-2.7.5.tar.gz
修改環境變量:
export HADOOP_HOME=/root/hadoop-2.7.5
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin
依次修改 hadoop的配置文件,在hadoop的安裝目錄下的/etc/hadoop中。
一共有4個文件修改:
①、core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/root/hadoop-2.7.5/tmp</value><!--修改爲自己hadoop的安裝目錄下的tmp-->
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value><!--這裏的master名字和主節點名字一樣,如果主節點不叫master,就換掉-->
</property>
</configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/root/hadoop-2.7.5/tmp</value><!--修改爲自己hadoop的安裝目錄下的tmp-->
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value><!--這裏的master名字和主節點名字一樣,如果主節點不叫master,就換掉-->
</property>
</configuration>
②、hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value><!--根據實際情況定-->
</property>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop-2.7.5/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop-2.7.5/hdfs/data</value>
</property>
</configuration>
<name>dfs.replication</name>
<value>2</value><!--根據實際情況定-->
</property>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop-2.7.5/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop-2.7.5/hdfs/data</value>
</property>
</configuration>
③、拷貝mapred-site.xml.template爲mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://master:9001</value>
</property>
</configuration>
reducer:
runner:
打包成jar包後,放到集羣上運行。
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://master:9001</value>
</property>
</configuration>
④、yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
接下來是2個比較重要的更改,在三臺機器上
vi /root/hadoop-2.7.5/etc/hadoop/masters
添加 master
在master主機(master特有)
vi /root/hadoop-2.7.5/etc/hadoop/slaves
## 添加
slave1
slave2
## 添加
slave1
slave2
1.7 Hadoop啓動
1.7.1 格式化HDFS文件系統
bin/hadoop namenode -format(hadoop目錄下執行)
1.7.2 啓動hadoop
sbin/start-all.sh
1.8、可能出現的問題
JAVA_HOME is not set and could not be found
vi /root/hadoop-2.7.5/etc/hadoop/hadoop-env.sh
## 配置項
export JAVA_HOME=你的jdk路徑
## 配置項
export JAVA_HOME=你的jdk路徑
1.9、詞頻程序
程序很簡單。一步帶過。
maven建立quick-start工程。
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.edu.bupt.wcy</groupId>
<artifactId>wordcount</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>wordcount</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
</project>
3個java代碼,mapper、reducer、runner主類:
mapper:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
//super.map(key, value, context);
//String[] words = StringUtils.split(value.toString());
String[] words = StringUtils.split(value.toString(), " ");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
reducer:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text arg0, Iterable<LongWritable> arg1,
Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
//super.reduce(arg0, arg1, arg2);
int sum=0;
for(LongWritable num:arg1)
{
sum += num.get();
}
context.write(arg0,new LongWritable(sum));
}
}
runner:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCountRunner {
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordCountRunner.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
打包成jar包後,放到集羣上運行。
先在集羣上新建一個文件夾:
hdfs dfs -mkdir /input_wordcount
再放入單詞文件,比如:
hello world
I like playing basketball
hello java
。。。
運行hadoop jar WordCount.jar(jar包) WordCountRunner(主類) /input_wordcount /output_wordcount
運行完成後,查看:
hdfs dfs -ls /output_wordcount。已經生成了結果,在cat一下查看內容即可。