在實踐中應用Hadoop MapReduce 實驗2 以tab space分隔的文本排序

原創

magina507

2018-08-24 23:21

一、實驗題目

編寫MapReduce程序給以tab space分割的文本排序。

二、實驗目的

遍歷整個文本，搜索帶tab space的句子並對它們進行排序。

三、任務分析

同上一個實驗一樣，處理文本，必然要先觀察待處理文檔，由於回車符的表示不同，需要在linux中查看，如下圖：

可以看到文檔中一共有21句話，並且通過tab space分開了。實驗目的是將這21句話分開，然後排序。

因此mapper部分就很好寫了，就是按行讀取文件中的內容，空的部分不讀。代碼如下：（代碼來自於羣友廣州-Carl的分享）

package com.apress.hadoop.examples.ch2;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


public class SortingMapper extends Mapper<LongWritable, Text, Text, Text> {
	@Override
	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
		String line = value.toString();
		if (!line.isEmpty()) {
			context.write(new Text(line), new Text("temp"));
		}
	}
}

其中字符串的值保存到了鍵值key中，因此在reducer中操作時，操作key值就可以了。

而由於MapReduce程序是自帶排序功能的，因此reducer程序十分簡單，如下：

package com.apress.hadoop.examples.ch2;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SortingReducer extends Reducer<Text, Text, IntWritable, Text> {
	int index = 0;
	@Override
	protected void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		index ++;
		context.write(new IntWritable(index), key);
	}
}

自定義一個變量index作爲鍵值，而後面的鍵值key爲mapper中的key，也就是句子本身，而且已經排好序，因此按順序來就好了。

最後是driver程序：

package com.apress.hadoop.examples.ch2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class SortingDriver {
	public static void main(String[] args) throws Exception {
		Configuration conf =  new Configuration();
		Job job = new Job(conf, "sortingdata");
		job.setJarByClass(SortingDriver.class);
		job.setMapperClass(SortingMapper.class);
		job.setReducerClass(SortingReducer.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(IntWritable.class);
		job.setOutputValueClass(Text.class);
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		boolean result = job.waitForCompletion(true);
		
		System.exit(result ? 0 : 1);
	}
}

四、運行結果

1.將Hadoop的安全模式關閉，命令爲：

hadoop dfsadmin -safemode leave

2.將待處理文件導入到hdfs文件中，命令爲：

bin/hadoop dfs -copyFromLocal 源文件位置 hdfs:/

3.啓動eclipse，建立Java project，導入相關jar文件，開始編碼。

4.編碼完畢後export成jar文件

5.執行mapreduce

6.查看結果

可以看到該文件已經按照A-Z的順序將句子排列好了。

五、總結

現在的任務越來越重了，不能按照原來一天做一週的安排來了。本週時間安排出了問題，這次文檔代碼有問題，多虧羣友幫助才能完成實驗，下週應儘早實驗。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

在實踐中應用Hadoop MapReduce 實驗2 以tab space分隔的文本排序

一、實驗題目

二、實驗目的

三、任務分析

四、運行結果

五、總結

物理機開關機

在MapReduce中連接Hbase數據庫

熟悉Hive 實驗1

在實踐中應用Hadoop MapReduce 實驗1 dictionary

開發MapReduce程序實驗2

運行Pig Grunt程序

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結