MIT6.824 MapReduce 2018 Part V: Inverted index generation

For this optional no-credit exercise, you will build Map and Reduce functions for generating an inverted index.

Inverted indices are widely used in computer science, and are particularly useful in document searching. Broadly speaking, an inverted index is a map from interesting facts about the underlying data, to the original location of that data. For example, in the context of search, it might be a map from keywords to documents that contain those words.

We have created a second binary in main/ii.go that is very similar to the wc.go you built earlier. You should modify mapF and reduceF in main/ii.go so that they together produce an inverted index. Running ii.go should output a list of tuples, one per line, in the following format:

$ go run ii.go master sequential pg-*.txt
$ head -n5 mrtmp.iiseq
A: 8 pg-being_ernest.txt,pg-dorian_gray.txt,pg-frankenstein.txt,pg-grimm.txt,pg-huckleberry_finn.txt,pg-metamorphosis.txt,pg-sherlock_holmes.txt,pg-tom_sawyer.txt
ABOUT: 1 pg-tom_sawyer.txt
ACT: 1 pg-being_ernest.txt
ACTRESS: 1 pg-dorian_gray.txt
ACTUAL: 8 pg-being_ernest.txt,pg-dorian_gray.txt,pg-frankenstein.txt,pg-grimm.txt,pg-huckleberry_finn.txt,pg-metamorphosis.txt,pg-sherlock_holmes.txt,pg-tom_sawyer.txt

 

If it is not clear from the listing above, the format is:

word: #documents documents,sorted,and,separated,by,commas
word: #documents documents,sorted,and,separated,by,commas

 

这一部分就是重新定义MapF和ReduceF的函数,就能实现Inverted index generation, 就是指key是一个词,value是所有documents,value用一个字符串的形式输出,第一项是documents的个数,后面是每个documents用,分割。

输入的document是对应的document,value是contens,所以我们用同样的split方法,将文本处理成单词slice words,然后遍历这些words,key是word,value就是对应的document

返回kv数组,这就是MapF过程

// The mapping function is called once for each piece of the input.
// In this framework, the key is the name of the file that is being processed,
// and the value is the file's contents. The return value should be a slice of
// key/value pairs, each represented by a mapreduce.KeyValue.
func mapF(document string, value string) (res []mapreduce.KeyValue) {
	// Your code here (Part V).
	ff := func(r rune) bool { return !unicode.IsLetter(r) }

	// split contents into an array of words.
	words := strings.FieldsFunc(value, ff)

	kva := []mapreduce.KeyValue{}
	for _, w := range words {
		kv := mapreduce.KeyValue{w, document}
		kva = append(kva, kv)
	}
	return kva

}

Reduce过程是将value slice合并成一个string,注意借助一个map结构来去重,以及排序

// The reduce function is called once for each key generated by Map, with a
// list of that key's string value (merged across all inputs). The return value
// should be a single output value for that key.
func reduceF(key string, values []string) string {
	// Your code here (Part V).
	values = removeDuplicationAndSort(values)
	return strconv.Itoa(len(values)) + " " + strings.Join(values, ",")
}

func removeDuplicationAndSort(values []string) []string {
	kvs := make(map[string]struct{})
	for _, value := range values {
		_, ok := kvs[value]
		if !ok {
			kvs[value] = struct{}{}
		}
	}
	var ret []string
	for k := range kvs {
		ret = append(ret, k)
	}
	sort.Strings(ret)
	return ret
}

这个例子就是来说明,我们提供的MapReduce的demo框架是一个通用的框架,只需要用户提供对应的mapF和reduceF函数,就能解决对应的问题。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章