MapReduce中二次排序

MR自帶的源碼SecondarySort，即二次排序。二次排序可以實現類似下例功能：計算每年的最高氣溫。如果key設置爲氣溫，value設置爲年份及其他信息，那麼我們不必遍歷他們以找到最大值，只需獲取每年的第一個值而忽略其他。但這不是最有效的解決問題的方法，考慮將key變成複合的，即年份和氣溫，先按年份升序，再按氣溫降序。但是這樣不能保證同一年的記錄去同一個reducer，需要設置partitioner使其按照鍵的年份部分進行分區。然而這樣還是沒有改變Reducer通過分區按鍵成組的事實，還需要控制分組的設置，通過在reducer中以鍵的年份部分來分組值，那麼就將同一年的記錄放在同一個reduce組中。同時因爲他們以氣溫降序排列，第一個就是最高氣溫。

下面對MR中的自帶源碼SecondarySort進行分析：

(1) 自定義key

在mr中，所有的key是需要被比較和排序的，並且是二次，先根據partition，再根據大小。而本例中也是要比較兩次。先按照第一字段排序，然後再對第一字段相同的按照第二字段排序。根據這一點，我們可以構造一個複合類IntPair，他有兩個字段，先利用分區對第一字段排序，再利用分區內的比較對第二字段排序。所有自定義的key應該實現接口WritableComparable，因爲是可序列的並且可比較的。

//自己定義的key類應該實現WritableComparable接口

public static class IntPair implements WritableComparable<IntPair> {

int first;

int second;

public void set(int left, int right) {

first = left;

second = right;

}

public int getFirst() {

return first;

}

public int getSecond() {

return second;

}

@Override

//反序列化，從流中的二進制轉換成IntPair

public void readFields(DataInput in) throws IOException {

// TODO Auto-generated method stub

first = in.readInt();

second = in.readInt();

}

@Override

//序列化，將IntPair轉化成使用流傳送的二進制

public void write(DataOutput out) throws IOException {

// TODO Auto-generated method stub

out.writeInt(first);

out.writeInt(second);

}

@Override

//key的比較

public int compareTo(IntPair o) {

// TODO Auto-generated method stub

if (first != o.first) {

return first < o.first ? -1 : 1;

} else if (second != o.second) {

return second < o.second ? -1 : 1;

} else {

return 0;

}

//新定義類應該重寫的兩個方法

@Override

//The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)

public int hashCode() {

return first * 157 + second;

}

@Override

public boolean equals(Object right) {

if (right == null)

return false;

if (this == right)

return true;

if (right instanceof IntPair) {

IntPair r = (IntPair) right;

return r.first == first && r.second == second;

} else {

return false;

}

(2) 分區函數類

key的第一次比較。

public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>{

@Override

public int getPartition(IntPair key, IntWritable value,

int numPartitions) {

return Math.abs(key.getFirst() * 127) % numPartitions;

}

(3) 分組函數類

在reduce階段，構造一個key對應的value迭代器的時候，只要first相同就屬於同一個組，放在一個value迭代器。這是一個比較器，需要繼承WritableComparator。

//繼承WritableComparator

public static class GroupingComparator extends WritableComparator {

protected GroupingComparator() {

super(IntPair.class, true);

}

@Override

//Compare two WritableComparables.

public int compare(WritableComparable w1, WritableComparable w2) {

IntPair ip1 = (IntPair) w1;

IntPair ip2 = (IntPair) w2;

int l = ip1.getFirst();

int r = ip2.getFirst();

return l == r ? 0 : (l < r ? -1 : 1);

}

(4) Main函數中的設置

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

// TODO Auto-generated method stub

// 讀取hadoop配置

Configuration conf = new Configuration();

// 實例化一道作業

Job job = new Job(conf, "secondarysort");

job.setJarByClass(Sort.class);

// Mapper類型

job.setMapperClass(Map.class);

// 不再需要Combiner類型，因爲Combiner的輸出類型<Text, IntWritable>對Reduce的輸入類型<IntPair, IntWritable>不適用

//job.setCombinerClass(Reduce.class);

// Reducer類型

job.setReducerClass(Reduce.class);

// 分區函數

job.setPartitionerClass(FirstPartitioner.class);

// 分組函數

job.setGroupingComparatorClass(GroupingComparator.class);

// map 輸出Key的類型

job.setMapOutputKeyClass(IntPair.class);

// map輸出Value的類型

job.setMapOutputValueClass(IntWritable.class);

// reduce輸出Key的類型，是Text，因爲使用的OutputFormatClass是TextOutputFormat

job.setOutputKeyClass(Text.class);

// reduce輸出Value的類型

job.setOutputValueClass(IntWritable.class);

// 將輸入的數據集分割成小數據塊splites，同時提供一個RecordReder的實現。

job.setInputFormatClass(TextInputFormat.class);

// 提供一個RecordWriter的實現，負責數據輸出。

job.setOutputFormatClass(TextOutputFormat.class);

// 輸入hdfs路徑

FileInputFormat.setInputPaths(job, new Path(args[0]));

// 輸出hdfs路徑

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 提交job

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

MapReduce中二次排序

MapReduce與遺傳算法、MapReduce與粒子羣算法結合與實現

2013年01月01日

POJ1018 Communication System

POJ1050 To the Max

POJ1125 Stockbroker Grapevine

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結