如何編寫MapReduce代碼

關於maperduce，可以參考：http://en.wikipedia.org/wiki/MapReduce

這裏假設你具備一定的hadoop編程經驗。

Mapper接受原始輸入，比如網站日誌，分析並輸出中間結果。經歷排序，分組成爲Reducer的輸入，經過統計彙總，輸出結果。當然這個過程可以是多個。

其中Mapper比較簡單，但是需要對輸入具有深入的理解，不光是格式還包括意義。其中有如下注意：

一條輸入儘量不要拓展爲多條輸出，因爲這會增加網絡傳輸
對於partition的key要仔細選擇，這會決定有多少reducer，確保這個的結果儘量均勻分佈

reducer其實有現實的模板，這個是我要重點介紹的。下面的例子都是基於Perl語言。

對於簡單的輸入，模板如下：

# read configuration
# initiate global vairables 
# initiate key level counter
# initiate group level counter
# initiate final counter

### reset all key level counter
sub onBeginKey() {}

### aggregate count
sub onSameKey {}

### print out the counter
sub onEndKey() {}

### main loop
while (<STDIN>) {
	chomp($_);

	# step 1:filter input

	# step 2: split input

	# step 3: get group and key

	# main logic
	if ($cur_key) {
		if ( $key ne $cur_key ) {
			&onEndKey();
			&onBeginKey();
		}
		&onSameKey();
	}
	else {
		&onBeginKey();
		&onSameKey();
	}
}
if ($cur_key) {
	&onEndKey();
}

對於複雜的輸入，模板如下：

# read configuration
# initiate global vairables 
# initiate key level counter
# initiate group level counter
# initiate final counter

### reset all group level counter
sub onBeginGroup() {}

### reset all key level counter
sub onBeginKey() {}

### add count at key level
sub onSameKey {}

### aggregate count from key level to group level
sub onEndKey() {}

### aggregate count from group level to final result
sub onEndGroup() {}

### main loop
while (<STDIN>) {
	chomp($_);

	# step 1:filter input

	# step 2: split input

	# step 3: get group and key

	# main logic
	if ($cur_group) {
		if ( $group ne $cur_group ) {
			&onEndKey();
			&onEndGroup();
			&onBeginGroup();
			&onBeginKey();
		}
		else {
			if ( $key ne $cur_key ) {
				&onEndKey();
				&onBeginKey();
			}    #else just the same key
		}
		&onSameKey();
	}
	else {
		&onBeginGroup();
		&onBeginKey();
		&onSameKey();
	}
}
if ($cur_key) {
	&onEndKey();
	&onEndGroup();
}

### print out the final counter

兩個版本的區別在於，多了一級的group，但是原理一樣。當然理論上還可以再嵌套更多的級別。

最後推薦一下市面上的hadoop編程書籍：

Hadoop: The Definitive Guide
Hadoop in Action
Pro Hadoop

如何編寫MapReduce代碼

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

如何在Linux下禁用ARP協議

如何判斷數據庫中是否存在一個數據表

Apache Hadoop 項目介紹

推薦《冒號課堂——編程範式與OOP思想》

HTTP Client 編寫

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結