MIT6.824 2018 MapReduce Part II: Single-worker word count

原創

2021-01-30 10:13

Part II: Single-worker word count

Now you will implement word count — a simple Map/Reduce example. Look in main/wc.go; you'll find empty mapF() and reduceF() functions. Your job is to insert code so that wc.go reports the number of occurrences of each word in its input. A word is any contiguous sequence of letters, as determined by unicode.IsLetter.

There are some input files with pathnames of the form pg-*.txt in ~/6.824/src/main, downloaded from Project Gutenberg. Here's how to run wc with the input files:

$ cd 6.824
$ export "GOPATH=$PWD"
$ cd "$GOPATH/src/main"
$ go run wc.go master sequential pg-*.txt
# command-line-arguments
./wc.go:14: missing return at end of function
./wc.go:21: missing return at end of function

The compilation fails because mapF() and reduceF() are not complete.

Review Section 2 of the MapReduce paper. Your mapF() and reduceF() functions will differ a bit from those in the paper's Section 2.1. Your mapF() will be passed the name of a file, as well as that file's contents; it should split the contents into words, and return a Go slice of mapreduce.KeyValue. While you can choose what to put in the keys and values for the mapF output, for word count it only makes sense to use words as the keys. Your reduceF() will be called once for each key, with a slice of all the values generated by mapF() for that key. It must return a string containing the total number of occurences of the key.

a good read on Go strings is the Go Blog on strings.
you can use strings.FieldsFunc to split a string into components.
the strconv package (http://golang.org/pkg/strconv/) is handy to convert strings to integers etc.

You can test your solution using:

$ cd "$GOPATH/src/main"
$ time go run wc.go master sequential pg-*.txt
master: Starting Map/Reduce task wcseq
Merge: read mrtmp.wcseq-res-0
Merge: read mrtmp.wcseq-res-1
Merge: read mrtmp.wcseq-res-2
master: Map/Reduce task completed
2.59user 1.08system 0:02.81elapsed

The output will be in the file "mrtmp.wcseq". Your implementation is correct if the following command produces the output shown here:

$ sort -n -k2 mrtmp.wcseq | tail -10
that: 7871
it: 7987
in: 8415
was: 8578
a: 13382
of: 13536
I: 14296
to: 16079
and: 23612
the: 29748

You can remove the output file and all intermediate files with:

$ rm mrtmp.*

To make testing easy for you, run:

$ bash ./test-wc.sh

and it will report if your solution is correct or not. 來讀一下這段Shell腳本

#!/bin/bash
go run wc.go master sequential pg-*.txt
sort -n -k2 mrtmp.wcseq | tail -10 | diff - mr-testout.txt > diff.out
if [ -s diff.out ]
then
echo "Failed test. Output should be as in mr-testout.txt. Your output differs as follows (from diff.out):" > /dev/stderr
  cat diff.out
else
  echo "Passed test" > /dev/stderr
fi

這一部分主要是實現一個單線程串行化的mapreduce，這裏的map函數是對英文文本進行分詞（這裏是去查Go的API手冊），然後添加到key-value數組，value均爲1，reduce，是將同一個key，對應的value求和。得到真實的value，接口如下：

//
// The map function is called once for each file of input. The first
// argument is the name of the input file, and the second is the
// file's complete contents. You should ignore the input file name,
// and look only at the contents argument. The return value is a slice
// of key/value pairs.
//
func mapF(filename string, contents string) []mapreduce.KeyValue {
	// Your code here (Part II).
	// function to detect word separators.
	ff := func(r rune) bool { return !unicode.IsLetter(r) }

	// split contents into an array of words.
	words := strings.FieldsFunc(contents, ff)

	kva := []mapreduce.KeyValue{}
	for _, w := range words {
		kv := mapreduce.KeyValue{w, "1"}
		kva = append(kva, kv)
	}
	return kva

}

//
// The reduce function is called once for each key generated by the
// map tasks, with a list of all the values created for that key by
// any map task.
//
func reduceF(key string, values []string) string {
	// Your code here (Part II).
	return strconv.Itoa(len(values))
}

這樣Part2就完成了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

MIT6.824 2018 MapReduce Part II: Single-worker word count

Part II: Single-worker word count

《日本蠟燭圖》讀書筆記 & 技術分析回測

Python多線程編程深度探索：從入門到實戰

《期貨-市場技術分析》讀書筆記

mongodb處理json數據很好

頂級 Javaer 都在用的 20 個類庫，真香！

[轉帖]cpupower

google瀏覽器插件開發

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

Django Migrate 框架原理分析

2021.02.05學習總結

集齊支付寶福卡祕籍來了！

百度搞了一個“開發者搜索”

小於200人且有vlan劃分的網絡設計與配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結