MapReduce
Map:每個map任務可分爲四個階段:record reader、mapper、combiner和partitioner。map任務輸出被稱爲中間鍵或中間值,會被髮送到reducer做後續處理。
Reduce:reduce任務可分爲4個階段:混排(shuffle)、排序(sort)、reducer和輸出格式。
概要模式
數值概要模式是計算數據聚合統計值的一般性模式。
目的
基於某個鍵將記錄分組,並對每個分組計算的一系列的聚合值,從而得到較大數據集的高層次視圖。
對於很對概要函數來說,combiner可以極大地減少網絡傳輸到reduce端的中間值/鍵對的數量。使用combiner需要對於的函數滿足結合律和交換律。
適用場景
- 要處理的數值數據或計數
- 數據可以按某些特定的字段分組
例如求最小值、最大值、平均值、中位數以及標準差。
模式結構
數據準備
建表 dw_minmaxcount 和 dw0_minmaxcount
CREATE TABLE `dw_minmaxcount` (
`userid` STRING,
`min` BIGINT,
`max` BIGINT,
`count` BIGINT
)
LIFECYCLE 37000;
CREATE TABLE `dw0_minmaxcount` (
`userid` STRING,
`min` BIGINT,
`max` BIGINT,
`count` BIGINT
)
LIFECYCLE 37000;
初始化數據
insert into table dw_minmaxcount values
('12345',10,10,1),('12345',8,8,1),('12345',21,21,1),
('54321',1,1,1),('54321',8,8,1),
('99999',110,110,1),('99999',81,81,1);
創建ODPS項目
如何創建項目參考阿里雲官方文檔。
Deriver類
package minmaxcount;
import com.aliyun.odps.OdpsException;
import com.aliyun.odps.data.TableInfo;
import com.aliyun.odps.mapred.JobClient;
import com.aliyun.odps.mapred.RunningJob;
import com.aliyun.odps.mapred.conf.JobConf;
import com.aliyun.odps.mapred.utils.InputUtils;
import com.aliyun.odps.mapred.utils.OutputUtils;
import com.aliyun.odps.mapred.utils.SchemaUtils;
public class MinMaxCountDriver {
public static void main(String[] args) throws OdpsException {
JobConf job = new JobConf();
//key可以由一個或多個組成
job.setMapOutputKeySchema(SchemaUtils.fromString("userid:string"));
//value可以由一個或多個組成
job.setMapOutputValueSchema(SchemaUtils.fromString("min:Bigint,max:Bigint,count:Bigint"));
//輸入表
InputUtils.addTable(TableInfo.builder().tableName("dw_minmaxcount").build(),
job);
//輸出表
OutputUtils.addTable(TableInfo.builder().tableName("dw0_minmaxcount").build(),
job);
//按照userid分組
job.setOutputGroupingColumns(new String[]{"userid"});
job.setMapperClass(MinMaxCountMap.class);
job.setCombinerClass(MinMaxCountCombiner.class);
job.setReducerClass(MinMaxCountReducer.class);
RunningJob rj = JobClient.runJob(job);
rj.waitForCompletion();
}
}
Map類
package minmaxcount;
import java.io.IOException;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.mapred.MapperBase;
public class MinMaxCountMap extends MapperBase{
private Record userid;
private Record value;
@Override
public void setup(TaskContext context) throws IOException {
userid = context.createMapOutputKeyRecord();
value = context.createMapOutputValueRecord();
}
@Override
public void map(long recordNum, Record record, TaskContext context)
throws IOException {
//獲取userid字段作爲key
userid.set(0, record.get("userid").toString());
//獲取min、max、count字段值作爲value的組合
value.set(0, record.getBigint("min"));
value.set(1, record.getBigint("max"));
value.set(2, record.getBigint("count"));
context.write(userid, value);
}
@Override
public void cleanup(TaskContext context) throws IOException {
}
}
Combiner類
combiner是一個可選的本地reducer,可以在map階段聚合數據。在某些場景下可以通過Combiner進行數據聚合從而減少網絡傳輸的數據量。
例如:在網絡上發送一次(hello world,4)要比4次(hello world,1)節省更多字節量。
因爲它滿足結合律和交換律,所以可以使用reduce代碼作爲combiner。
package minmaxcount;
import java.io.IOException;
import java.util.Iterator;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.mapred.ReducerBase;
public class MinMaxCountCombiner extends ReducerBase{
private Record value;
private long min,max,count;
@Override
public void setup(TaskContext context) throws IOException {
value = context.createMapOutputValueRecord();
}
@Override
public void reduce(Record key, Iterator<Record> values, TaskContext context)
throws IOException {
int i = 0;
while(values.hasNext()){
Record record = values.next();
if(i++==0){
min = record.getBigint(0);
max = record.getBigint(1);
count = record.getBigint(2);
}else{
if(min > record.getBigint(0)){
min = record.getBigint(0);
}
if(max < record.getBigint(1)){
max = record.getBigint(1);
}
count += record.getBigint(2);
}
}
value.setBigint(0, min);
value.setBigint(1, max);
value.setBigint(2, count);
context.write(key, value);
}
@Override
public void cleanup(TaskContext context) throws IOException {
}
}
Reducer
package minmaxcount;
import java.io.IOException;
import java.util.Iterator;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.mapred.ReducerBase;
public class MinMaxCountReducer extends ReducerBase{
private Record result;
private long min,max,count;
@Override
public void setup(TaskContext context) throws IOException {
result = context.createOutputRecord();
}
@Override
public void reduce(Record key, Iterator<Record> values, TaskContext context)
throws IOException {
int i = 0;
while(values.hasNext()){
Record record = values.next();
if(i==0){
min = record.getBigint(0);
max = record.getBigint(1);
count = record.getBigint(2);
}else{
if(min > record.getBigint(0)){
min = record.getBigint(0);
}
if(max < record.getBigint(1)){
max = record.getBigint(1);
}
count += record.getBigint(2);
}
}
result.set(0, key.get(0).toString());
result.setBigint(1, min);
result.setBigint(2, max);
result.setBigint(3, count);
//MapReduce 的結果寫入到表或分區時,會覆蓋掉原有的表數據或者分區數據
context.write(result);
}
@Override
public void cleanup(TaskContext context) throws IOException {
}
}
將程序打包成jar包,只需勾選所需的程序文件即可
將jar包上傳到資源管理,創建MR開發任務,引入jar包,執行即可
結果: