系統環境
Linux Ubuntu 16.04
jdk-7u75-linux-x64
hadoop-2.6.0-cdh5.4.5
hadoop-2.6.0-eclipse-cdh5.4.5.jar
eclipse-java-juno-SR2-linux-gtk-x86_64
相關知識
一些複雜的任務難以用一次MapReduce處理完成,需要多次MapReduce才能完成任務。Hadoop2.0開始MapReduce作業支持鏈式處理,類似於工廠的生產線,每一個階段都有特定的任務要處理,比如提供原配件——>組裝——打印出廠日期,等等。通過這樣進一步的分工,從而提高了生成效率,我們Hadoop中的鏈式MapReduce也是如此,這些Mapper可以像水流一樣,一級一級向後處理,有點類似於Linux的管道。前一個Mapper的輸出結果直接可以作爲下一個Mapper的輸入,形成一個流水線。
鏈式MapReduce的執行規則:整個Job中只能有一個Reducer,在Reducer前面可以有一個或者多個Mapper,在Reducer的後面可以有0個或者多個Mapper。
Hadoop2.0支持的鏈式處理MapReduce作業有以下三種:
(1)順序鏈接MapReduce作業
類似於Unix中的管道:mapreduce-1 | mapreduce-2 | mapreduce-3 …,每一個階段創建一個job,並將當前輸入路徑設爲前一個的輸出。在最後階段刪除鏈上生成的中間數據。
(2)具有複雜依賴的MapReduce鏈接
若mapreduce-1處理一個數據集, mapreduce-2 處理另一個數據集,而mapreduce-3對前兩個做內部鏈接。這種情況通過Job和JobControl類管理非線性作業間的依賴。如x.addDependingJob(y)意味着x在y完成前不會啓動。
(3)預處理和後處理的鏈接
一般將預處理和後處理寫爲Mapper任務。可以自己進行鏈接或使用ChainMapper和ChainReducer類,生成作業表達式類似於:
MAP+ | REDUCE | MAP*
如以下作業: Map1 | Map2 | Reduce | Map3 | Map4,把Map2和Reduce視爲MapReduce作業核心。Map1作爲前處理,Map3, Map4作爲後處理。ChainMapper使用模式:預處理作業,ChainReducer使用模式:設置Reducer並添加後處理Mapper
本實驗中用到的就是第三種作業模式:預處理和後處理的鏈接,生成作業表達式類似於 Map1 | Map2 | Reduce | Map3
任務內容
練習使用ChainMapReduce處理文件,現有某電商一天商品瀏覽情況數據goods_0,功能爲在第一個Mapper裏面過濾掉點擊量大於600的商品,在第二個Mapper中過濾掉點擊量在100~600之間的商品,Reducer裏面進行分類彙總並輸出,在Reducer後的Mapper裏過濾掉商品名長度大於或等於3的商品
實驗數據如下:
表goods_0,包含兩個字段(商品名稱,點擊量),分隔符爲"\t"
商品名稱 點擊量
襪子 189
毛衣 600
褲子 780
鞋子 30
呢子外套 90
牛仔外套 130
羽絨服 7
帽子 21
帽子 6
羽絨服 12
結果數據如下:
商品名稱 點擊量
帽子 27.0
鞋子 30.0
任務步驟
1,切換到/apps/hadoop/sbin目錄下,開啓Hadoop。
cd /apps/hadoop/sbin
./start-all.sh
2,在Linux本地新建/data/mapreduce10目錄。
mkdir -p /data/mapreduce10
3,在Linux中切換到/data/mapreduce10目錄下,用wget命令從http://192.168.1.100:60000/allfiles/mapreduce10/goods_0網址上下載文本文件goods_0。
cd /data/mapreduce10
wget http://192.168.1.100:60000/allfiles/mapreduce10/goods_0
然後在當前目錄下用wget命令從http://192.168.1.100:60000/allfiles/mapreduce10/hadoop2lib.tar.gz網址上下載項目用到的依賴包。
wget http://192.168.1.100:60000/allfiles/mapreduce10/hadoop2lib.tar.gz
將hadoop2lib.tar.gz解壓到當前目錄下。
tar zxvf hadoop2lib.tar.gz
4,首先在HDFS上新建/mymapreduce10/in目錄,然後將Linux本地/data/mapreduce10目錄下的goods_0文件導入到HDFS的/mymapreduce10/in目錄中。
hadoop fs -mkdir -p /mymapreduce10/in
hadoop fs -put /data/mapreduce10/goods_0 /mymapreduce10/in
5,打開Eclipse,新建Java Project項目,項目名爲mapreduce10
在mapreduce10項目下新建mapreduce包,在mapreduce包下新建ChainMapReduce類。
6,添加項目所需依賴的jar包右鍵項目,新建一個文件夾,用於存放項目所需的jar包。
將/data/mapreduce10目錄下,hadoop2lib目錄中的jar包,拷貝到eclipse中mapreduce10項目的hadoop2lib目錄下,選中所有項目hadoop2lib目錄下所有jar包,單擊右鍵選擇Build Path=>Add to Build Path。
7,編寫程序代碼,並描述其設計思路。
mapreduce執行的大體流程如下圖所示:
由上圖可知,ChainMapReduce的執行流程爲:①首先將文本文件中的數據通過InputFormat實例切割成多個小數據集InputSplit,然後通過RecordReader實例將小數據集InputSplit解析爲<key,value>的鍵值對並提交給Mapper1;②Mapper1裏的map函數將輸入的value進行切割,把商品名字段作爲key值,點擊數量字段作爲value值,篩選出value值小於等於600的<key,value>,將<key,value>輸出給Mapper2,③Mapper2裏的map函數再篩選出value值小於100的<key,value>,並將<key,value>輸出;④Mapper2輸出的<key,value>鍵值對先經過shuffle,將key值相同的所有value放到一個集合,形成<key,value-list>,然後將所有的<key,value-list>輸入給Reducer;⑤Reducer裏的reduce函數將value-list集合中的元素進行累加求和作爲新的value,並將<key,value>輸出給Mapper3,⑥Mapper3裏的map函數篩選出key值小於3個字符的<key,value>,並將<key,value>以文本的格式輸出到hdfs上。該ChainMapReduce的Java代碼主要分爲四個部分,分別爲:FilterMapper1,FilterMapper2,SumReducer,FilterMapper3。
FilterMapper1代碼
public static class FilterMapper1 extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private Text outKey = new Text(); //聲明對象outKey
private DoubleWritable outValue = new DoubleWritable(); //聲明對象outValue
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, DoubleWritable>.Context context)
throws IOException,InterruptedException {
String line = value.toString();
if (line.length() > 0) {
String[] splits = line.split("\t"); //按行對內容進行切分
double visit = Double.parseDouble(splits[1].trim());
if (visit <= 600) { //if循環,判斷visit是否小於等於600
outKey.set(splits[0]);
outValue.set(visit);
context.write(outKey, outValue); //調用context的write方法
}
}
}
}
首先定義輸出的key和value的類型,然後在map方法中獲取文本行內容,用Split("\t")對行內容進行切分,把包含點擊量的字段轉換成double類型並賦值給visit,用if判斷,如果visit小於等於600,則設置商品名稱字段作爲key,設置該visit作爲value,用context的write方法輸出<key,value>。
FilterMapper2代碼
public static class FilterMapper2 extends Mapper<Text, DoubleWritable, Text, DoubleWritable> {
@Override
protected void map(Text key, DoubleWritable value, Mapper<Text, DoubleWritable, Text, DoubleWritable>.Context context)
throws IOException,InterruptedException {
if (value.get() < 100) {
context.write(key, value);
}
}
}
接收mapper1傳來的數據,通過value.get()獲取輸入的value值,再用if判斷如果輸入的value值小於100,則直接將輸入的key賦值給輸出的key,輸入的value賦值給輸出的value,輸出<key,value>。
SumReducer代碼
public static class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
private DoubleWritable outValue = new DoubleWritable();
@Override
protected void reduce(Text key, Iterable<DoubleWritable> values, Reducer<Text, DoubleWritable, Text, DoubleWritable>.Context context)
throws IOException, InterruptedException {
double sum = 0;
for (DoubleWritable val : values) {
sum += val.get();
}
outValue.set(sum);
context.write(key, outValue);
}
}
FilterMapper2輸出的<key,value>鍵值對先經過shuffle,將key值相同的所有value放到一個集合,形成<key,value-list>,然後將所有的<key,value-list>輸入給SumReducer。在reduce函數中,用增強版for循環遍歷value-list中元素,將其數值進行累加並賦值給sum,然後用outValue.set(sum)方法把sum的類型轉變爲DoubleWritable類型並將sum設置爲輸出的value,將輸入的key賦值給輸出的key,最後用context的write()方法輸出<key,value>。
FilterMapper3代碼
public static class FilterMapper3 extends Mapper<Text, DoubleWritable, Text, DoubleWritable> {
@Override
protected void map(Text key, DoubleWritable value, Mapper<Text, DoubleWritable, Text, DoubleWritable>.Context context)
throws IOException, InterruptedException {
if (key.toString().length() < 3) { //for循環,判斷key值是否大於3
System.out.println("寫出去的內容爲:" + key.toString() +"++++"+ value.toString());
context.write(key, value);
}
}
}
接收reduce傳來的數據,通過key.toString().length()獲取key值的字符長度,再用if判斷如果key值的字符長度小於3,則直接將輸入的key賦值給輸出的key,輸入的value賦值給輸出的value,輸出<key,value>。
完整代碼
package mapreduce;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;
import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.DoubleWritable;
public class ChainMapReduce {
private static final String INPUTPATH = "hdfs://localhost:9000/mymapreduce10/in/goods_0";
private static final String OUTPUTPATH = "hdfs://localhost:9000/mymapreduce10/out";
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(new URI(OUTPUTPATH), conf);
if (fileSystem.exists(new Path(OUTPUTPATH))) {
fileSystem.delete(new Path(OUTPUTPATH), true);
}
Job job = new Job(conf, ChainMapReduce.class.getSimpleName());
FileInputFormat.addInputPath(job, new Path(INPUTPATH));
job.setInputFormatClass(TextInputFormat.class);
ChainMapper.addMapper(job, FilterMapper1.class, LongWritable.class, Text.class, Text.class, DoubleWritable.class, conf);
ChainMapper.addMapper(job, FilterMapper2.class, Text.class, DoubleWritable.class, Text.class, DoubleWritable.class, conf);
ChainReducer.setReducer(job, SumReducer.class, Text.class, DoubleWritable.class, Text.class, DoubleWritable.class, conf);
ChainReducer.addMapper(job, FilterMapper3.class, Text.class, DoubleWritable.class, Text.class, DoubleWritable.class, conf);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(1);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
FileOutputFormat.setOutputPath(job, new Path(OUTPUTPATH));
job.setOutputFormatClass(TextOutputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class FilterMapper1 extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private Text outKey = new Text();
private DoubleWritable outValue = new DoubleWritable();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, DoubleWritable>.Context context)
throws IOException,InterruptedException {
String line = value.toString();
if (line.length() > 0) {
String[] splits = line.split("\t");
double visit = Double.parseDouble(splits[1].trim());
if (visit <= 600) {
outKey.set(splits[0]);
outValue.set(visit);
context.write(outKey, outValue);
}
}
}
}
public static class FilterMapper2 extends Mapper<Text, DoubleWritable, Text, DoubleWritable> {
@Override
protected void map(Text key, DoubleWritable value, Mapper<Text, DoubleWritable, Text, DoubleWritable>.Context context)
throws IOException,InterruptedException {
if (value.get() < 100) {
context.write(key, value);
}
}
}
public static class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
private DoubleWritable outValue = new DoubleWritable();
@Override
protected void reduce(Text key, Iterable<DoubleWritable> values, Reducer<Text, DoubleWritable, Text, DoubleWritable>.Context context)
throws IOException, InterruptedException {
double sum = 0;
for (DoubleWritable val : values) {
sum += val.get();
}
outValue.set(sum);
context.write(key, outValue);
}
}
public static class FilterMapper3 extends Mapper<Text, DoubleWritable, Text, DoubleWritable> {
@Override
protected void map(Text key, DoubleWritable value, Mapper<Text, DoubleWritable, Text, DoubleWritable>.Context context)
throws IOException, InterruptedException {
if (key.toString().length() < 3) {
System.out.println("寫出去的內容爲:" + key.toString() +"++++"+ value.toString());
context.write(key, value);
}
}
}
}
8,在ChainMapReduce類文件中,右鍵並點擊=>Run As=>Run on Hadoop選項,將MapReduce任務提交到Hadoop中。
9,待執行完畢後,進入命令模式下,在hdfs上/mymapreduce10/out中查看實驗結果。
hadoop fs -ls /mymapreduce10/out
hadoop fs -cat /mymapreduce10/out/part-r-00000