MIT6.824 MapReduce 2018 Part V: Inverted index generation

For this optional no-credit exercise, you will build Map and Reduce functions for generating an inverted index.

Inverted indices are widely used in computer science, and are particularly useful in document searching. Broadly speaking, an inverted index is a map from interesting facts about the underlying data, to the original location of that data. For example, in the context of search, it might be a map from keywords to documents that contain those words.

We have created a second binary in main/ii.go that is very similar to the wc.go you built earlier. You should modify mapF and reduceF in main/ii.go so that they together produce an inverted index. Running ii.go should output a list of tuples, one per line, in the following format:

$ go run ii.go master sequential pg-*.txt
$ head -n5 mrtmp.iiseq
A: 8 pg-being_ernest.txt,pg-dorian_gray.txt,pg-frankenstein.txt,pg-grimm.txt,pg-huckleberry_finn.txt,pg-metamorphosis.txt,pg-sherlock_holmes.txt,pg-tom_sawyer.txt
ABOUT: 1 pg-tom_sawyer.txt
ACT: 1 pg-being_ernest.txt
ACTRESS: 1 pg-dorian_gray.txt
ACTUAL: 8 pg-being_ernest.txt,pg-dorian_gray.txt,pg-frankenstein.txt,pg-grimm.txt,pg-huckleberry_finn.txt,pg-metamorphosis.txt,pg-sherlock_holmes.txt,pg-tom_sawyer.txt

 

If it is not clear from the listing above, the format is:

word: #documents documents,sorted,and,separated,by,commas
word: #documents documents,sorted,and,separated,by,commas

 

這一部分就是重新定義MapF和ReduceF的函數,就能實現Inverted index generation, 就是指key是一個詞,value是所有documents,value用一個字符串的形式輸出,第一項是documents的個數,後面是每個documents用,分割。

輸入的document是對應的document,value是contens,所以我們用同樣的split方法,將文本處理成單詞slice words,然後遍歷這些words,key是word,value就是對應的document

返回kv數組,這就是MapF過程

// The mapping function is called once for each piece of the input.
// In this framework, the key is the name of the file that is being processed,
// and the value is the file's contents. The return value should be a slice of
// key/value pairs, each represented by a mapreduce.KeyValue.
func mapF(document string, value string) (res []mapreduce.KeyValue) {
	// Your code here (Part V).
	ff := func(r rune) bool { return !unicode.IsLetter(r) }

	// split contents into an array of words.
	words := strings.FieldsFunc(value, ff)

	kva := []mapreduce.KeyValue{}
	for _, w := range words {
		kv := mapreduce.KeyValue{w, document}
		kva = append(kva, kv)
	}
	return kva

}

Reduce過程是將value slice合併成一個string,注意藉助一個map結構來去重,以及排序

// The reduce function is called once for each key generated by Map, with a
// list of that key's string value (merged across all inputs). The return value
// should be a single output value for that key.
func reduceF(key string, values []string) string {
	// Your code here (Part V).
	values = removeDuplicationAndSort(values)
	return strconv.Itoa(len(values)) + " " + strings.Join(values, ",")
}

func removeDuplicationAndSort(values []string) []string {
	kvs := make(map[string]struct{})
	for _, value := range values {
		_, ok := kvs[value]
		if !ok {
			kvs[value] = struct{}{}
		}
	}
	var ret []string
	for k := range kvs {
		ret = append(ret, k)
	}
	sort.Strings(ret)
	return ret
}

這個例子就是來說明,我們提供的MapReduce的demo框架是一個通用的框架,只需要用戶提供對應的mapF和reduceF函數,就能解決對應的問題。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章