Hadoop生態中的Mapreduce在map階段可以將大數據或大文件進行分區,然後到Reduce階段可並行處理,分區數量一般與reduce任務數量一致;自定義實現Hadoop的WritableComparable接口(序列化並排列接口)的Bean在mapreduce中進行排序;分組的好處是在Reduce階段時可將數據按照自定義的分組屬性進行分組處理。
文章通過“尋找訂單中的最大金額”的Demo將以上幾個要點合成一起來說明用法。
訂單Bean序列化、排序
實現WritableComparable接口中的方法,其中Writable是Hadoop中的序列化接口,Comparable是比較接口,用來排序算法調用。另外需要複寫toString方法,在mapreduce序列化輸出時會調用toString方法
/**
* 訂單bean, 實現序列化Writable和比較接口Comparable
*/
public class OrderBean implements WritableComparable<OrderBean>{
private Text orderId;
private DoubleWritable amount;
public Text getOrderId() {
return orderId;
}
public void setOrderId(Text orderId) {
this.orderId = orderId;
}
public DoubleWritable getAmount() {
return amount;
}
public void setAmount(DoubleWritable amount) {
this.amount = amount;
}
//按照屬性順序 序列化bean
public void write(DataOutput out) throws IOException {
out.writeUTF(orderId.toString());
out.writeDouble(amount.get());
}
//按照屬性順序讀取序列化bean
public void readFields(DataInput in) throws IOException {
String readUTF = in.readUTF();
double readDouble = in.readDouble();
this.orderId = new Text(readUTF);
this.amount = new DoubleWritable(readDouble);
}
public int compareTo(OrderBean o) {
int res = this.getOrderId().compareTo(o.getOrderId());
if (res == 0) {
res = -this.getAmount().compareTo(o.getAmount()); //從大到不排序
}
return res;
}
//重寫toString方法,在mapreduce輸出時調用toString方法輸出
@Override
public String toString() {
return orderId.toString() + "\t" + amount.get();
}
}
分區
Hadoop中的分區是HashCode求模分區算法,用分區屬性的hashCode對Reduce任務數求模,可將數據進行分區(相同屬性的必然在一個分區)輸出,在Reduce階段時則每個分區對應一個reducetask處理分區數據。
/**
* 根據id進行的Map分區
*/
public class OrderIdPartitioner extends Partitioner<OrderBean, NullWritable>{
/**
* 分區數由reduceTasks的數量決定,即numPartitions=reduceTasks的數量
*/
@Override
public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
return (key.getOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
分組
hadoop中的分組是用WritableComparator來實現的,實際使用時需要自定義一個分組類,繼承WritableComparator指定實例類型和分組屬性(實現compare比較屬性方法),並在提交yarn任務時指定分組所有class
/**
* 分組order, 在reduce端根據orderId進行分組
*/
public class OrderGroupingComparator extends WritableComparator{
//指定分組所用的Bean
public OrderGroupingComparator() {
super(OrderBean.class, true);
}
@SuppressWarnings("rawtypes")
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderBean orderA = (OrderBean) a;
OrderBean orderB = (OrderBean) b;
return orderA.getOrderId().compareTo(orderB.getOrderId());
}
}
MapReduce方法
/**
* 利用groupingcomparator在reduce端聚合分組任務提交類
*/
public class RunMain {
static class OrderGroupMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {
OrderBean order = new OrderBean();
NullWritable NullCons = NullWritable.get();
//讀取文件寫入bean中,並輸出
@Override
protected void map(
LongWritable key,
Text value,
Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context)
throws IOException, InterruptedException {
String[] fields = value.toString().split("\t");
order.setOrderId(new Text(fields[0]));
order.setAmount(new DoubleWritable(Double.parseDouble(fields[2])));
context.write(order, NullCons);
}
}
static class OrderGroupReduce extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {
//不作處理,直接輸出。在reduce輸出之前,hadoop已調用自定實現的分組class
@Override
protected void reduce(
OrderBean arg0,
Iterable<NullWritable> arg1,
Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context)
throws IOException, InterruptedException {
context.write(arg0, NullWritable.get());
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(RunMain.class);
job.setMapperClass(OrderGroupMapper.class);
job.setReducerClass(OrderGroupReduce.class);
//設置業務邏輯:Map\Reduce的輸入輸出類型
job.setOutputKeyClass(OrderBean.class);
job.setOutputValueClass(NullWritable.class);
//指定位置
FileInputFormat.setInputPaths(job, new Path("hdfs://server1:9000/grouptest/order.txt"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://server1:9000/grouptest/result"));
//設置reduceTask,即分區數量
job.setNumReduceTasks(2);
//設置shuffle所使用的類
job.setGroupingComparatorClass(OrderGroupingComparator.class);
//設置shuffle所使用的partioner類
job.setPartitionerClass(OrderIdPartitioner.class);
//提交任務,打印信息
boolean completion = job.waitForCompletion(true);
System.exit(completion ? 0 : 1);
}
}
運行測試
打開hadoop集羣,上傳所需要的數據到hadoop分佈文件系統hdfs中,訂單數據如下:
Order_0000001 Pdt_01 222.8
Order_0000001 Pdt_05 25.8
Order_0000002 Pdt_03 522.8
Order_0000002 Pdt_04 122.4
Order_0000002 Pdt_05 722.4
Order_0000003 Pdt_01 223.8
Order_0000003 Pdt_01 23.8
Order_0000003 Pdt_01 322.8
Order_0000004 Pdt_01 701.9
Order_0000004 Pdt_01 120.8將以上java文件導出testgroup.jar包,上傳到服務器中運行:
hadoop jar testgroup.jar com.spark.mapreduce.group.RunMain
查看結果
可通過WEB端http://192.168.10.121:8088/
查看任務運行情況
結果顯示,由於設置reducetask的數量是兩個,因此出現兩個結果文件。