- Distributed Sort via MapReduce
- Map function just output key+record
- Partition immediate keys to R pieces and this R pieces is sorted partitions for the key value domain. This functions as bucket sorting
- R function does quick sort on input keys(suppose all keys can be held in memory and no external sort needed)
- Then the computation complexity is(suppose N keys in total)
- Map phase: N
- Reduce pahse: R * (N/R * log(N/R)) = NlogN - NlogR
- Two rounds read & write on input
- K路歸併+快排
- 快排複雜度:K * (N/K * log(N/K)) = NlogN - NlogK
- K路歸併複雜度:NlogK
- 對input的兩輪讀寫
- 總結
- 若R==K,兩者的計算與IO複雜度都相當,但Reduce階段可分佈式併發執行,而K路歸併排序只能串行操作,總體來說MapReduce在實際應用中更好。
- 另外,需要注意的是,兩者中IO的時間與CPU計算的時間都相當,假設數據量爲1TB(2^40B),IO速度爲100MB/s,CPU爲2GHZ,K=R=1000,串行處理情況下大致計算如下,併發情況類似:
- 計算時間:2^40 * (1 + log(2^40) - log1000) / (2 * 2^30)= 2^9* (1 + 40 - 10) = 15000s
- IO時間:1TB/(100MB/s) * 2 = 2^21/100 = 20000s
Distributed Sort via MapReduce vs. K路歸併+快排
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.