四、WordCount案例

下载Eclipse

我是在windows下载的，然后从物理机上将下载的文件传入到虚拟机中。使用的是FTP服务。

1、安装Eclipse

1 解压

将下载的文件放入到虚拟机中普通用户的~/目录下，然后解压：tar -xzvf ...

2 打开eclipse

1.进入到解压后的目录下。执行./eclipse。一般不会报错。
2.设置workspace。使用默认的workspace。或者为了管理更改为eclipseworkspace。
3.进入到welcome界面。不妨尝试新建一个项目。具体细节参考其他文章。
4.尝试一下经典的HelloWord。

2、开始项目

1 导入Jar包

右击项目，选择BuildPath->configure build path->Libraries->Add external jars
选择在~/hadoop/share/hadoop/yarn/下的jar包及其lib中的包。
选择在~/hadoop/share/hadoop/mapreduce/下的jar包及其lib中的包。
选择在~/hadoop/share/hadoop/hdfs/下的jar包及其lib中的包。
选择在~/hadoop/share/hadoop/common/下的jar包及其lib中的包。

导入后，保存，退出。

2 创建类

新建一个包比如：com.xiaoguan.mapreduce。新建一个类WordCount。然后来着这样的界面：

然后新建一个WordMapper类：

根据需要将KEYIN…等的参数列表设置完毕后，发现下面的问题。解决方案

最后可以看到正常情况下的图片：

3 导入Hadoop源码。源码链接

自行百度

4 编码

在编码之前，我们先准备一下需要的文件。这里为了方面，使用的是Hadoop根目录下的README.txt。

创建该文件的副本，放在~/目录下：cat README.txt >> ~/wordcount.txt

WordCount

package com.xiaoguan.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
	/**
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		
		String inputPath = args[0];
		String outputPath = args[1];
		
		/*
		 * This section use the feature named reflect, I know little about it.
		 */
		
		Job job = Job.getInstance(new Configuration(), WordCount.class.getSimpleName());

		// Use the following format can set the data type of the job.
		job.setJarByClass(WordCount.class);
		job.setMapperClass(WordMapper.class);
		job.setReducerClass(WordReduce.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		// Set the input and output file path.
		FileInputFormat.addInputPath(job, new Path(inputPath));
		FileOutputFormat.setOutputPath(job, new Path(outputPath));
		
		boolean flag = job.waitForCompletion(true);
	
		if (flag) {
			System.exit(0);
		}
	}

}

WordMapper

package com.xiaoguan.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
	/** Like the Interge one in the java. */
	IntWritable one = new IntWritable(1);
	
	public void map(
			LongWritable key,
			Text value,
			Mapper<LongWritable, Text, Text, IntWritable>.Context context
			) throws IOException, InterruptedException {
		
		// Transfer the data type that HADOOP support to the one that java support.
		String line = value.toString();
		
		// Split it to word.
		String[] words = line.split(" ");
		
		// Iterate the words to get the value pair ("word", 1).
		for (String word : words) {
			
			// Only collect the specific word.
			if (word != null && word.length() > 0) {
				context.write(new Text(word), one);
			}
		}
		
	}

}

WordReduce

package com.xiaoguan.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
	public void reduce(
			Text key,
			Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context
			) throws IOException, InterruptedException {
		// Used to store the times.
		int sum = 0;
		
		// Count.
		for (IntWritable count : values) {
			sum += count.get();
		}
		
		// Write the result to the context.
		context.write(key, new IntWritable(sum));
	}
}

完成编码后，首先在本地运行，调式可能存在的bug。
右击WordCount类，选择run as -> run configuration->双击Java Application->选择Arguments，这里设置两个参数，一个是输入文件的路径，一个是输出文件的目录。

运行完毕后的结果如下：

进入输出结果的文件夹下查看一下统计的结果如下：

(BIS),	1
(ECCN)	1
(TSU)	1
(see	1
5D002.C.1,	1
740.13)	1
<http://www.wassenaar.org/>	1
Administration	1
Apache	1
BEFORE	1
BIS	1
Bureau	1
Commerce,	1
Commodity	1
Control	1
Core	1
Department	1
ENC	1
Exception	1
Export	2
For	1
Foundation	1
Government	1
Hadoop	1
Hadoop,	1
Industry	1
Jetty	1
License	1
Number	1
Regulations,	1
SSL	1
Section	1
Security	1
See	1
Software	2
Technology	1
The	4
This	1
U.S.	1
Unrestricted	1
about	1
algorithms.	1
and	6
and/or	1
another	1
any	1
as	1
asymmetric	1
at:	2
both	1
by	1
check	1
classified	1
code	1
code.	1
concerning	1
country	1
country's	1
country,	1
cryptographic	3
currently	1
details	1
distribution	2
eligible	1
encryption	3
exception	1
export	1
following	1
for	3
form	1
from	1
functions	1
has	1
have	1
http://hadoop.apache.org/core/	1
http://wiki.apache.org/hadoop/	1
if	1
import,	2
in	1
included	1
includes	2
information	2
information.	1
is	1
it	1
latest	1
laws,	1
libraries	1
makes	1
manner	1
may	1
more	2
mortbay.org.	1
object	1
of	5
on	2
or	2
our	2
performing	1
permitted.	1
please	2
policies	1
possession,	2
project	1
provides	1
re-export	2
regulations	1
reside	1
restrictions	1
security	1
see	1
software	2
software,	2
software.	2
software:	1
source	1
the	8
this	3
to	2
under	1
use,	2
uses	1
using	2
visit	1
website	1
which	2
wiki,	1
with	1
written	1
you	1
your	1

程序基本没有问题，可以正常运行了。

3、在集群上测试

完成了在本地的测试，现在在集群上进行测试。由于电脑的配置太低，所以只有一个Slave节点。

首先我们打包一下类文件：参考

1、开启集群：start-all.sh。OK，看到如下的情况，集群开启成功。

2、上传文件，使用命令：hadoop fs -put /home/xiaoguan/wordcount.txt /，第一个目录表示本机文件目录，第二个目录表示集群的文件目录。可以通过命令hadoop fd -ls \查看集群根目录下的文件。

3、开始执行任务：hadoop jar wordcount.jar /wordcunt.txt /wordcount_output。在这里第一个文件是需要统计词频的文件，第二个文件夹表示输出结果的目录。可以看到，已经开始了。

可以看到字结点的MRAppMaster进程。说明字结点在执行任务，由于任务太小，很快就结束了。

在Master结点可以看到已经成功结束任务。

也可以在浏览器中查看任务情况：http://master:18088

现在我们查看输出结果：
使用命令：hadoop fs -ls / 查看文件系统中的根目录下的文件。
使用命令：hadoop fd -cat /wordcount_output/part-r-00000 查看输出结果。如下图：

Congratulation! You have completed it.

本系列文章

一、安装Centos 6.5
二、linux基本配置
 三、Hadoop安装部署
 四、WordCount案例

四、WordCount案例

下载Eclipse

1、安装Eclipse

1 解压

2 打开eclipse

2、开始项目

1 导入Jar包

2 创建类

3 导入Hadoop源码。源码链接

4 编码

3、在集群上测试

Congratulation! You have completed it.

本系列文章

[转帖]cpupower

今天，昨天，近七天，近30天，近90天，js封装

FTP之物理機與多臺虛擬機之間的文件傳輸

計算機要素--第十二章 Hack操作系統

計算機要素--第一章布爾邏輯

不利用第三個變量，交換兩個變量的值

計算機要素--第四章 Hack機器語言規範詳述

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結