Distributed Sort via MapReduce vs. K路歸併+快排

  • Distributed Sort via MapReduce
    • Map function just output key+record
    • Partition immediate keys to R pieces and this R pieces is sorted partitions for the key value domain. This functions as bucket sorting
    •  R function does quick sort on input keys(suppose all keys can be held in memory and no external sort needed)
    • Then the computation complexity is(suppose N keys in total)
      • Map phase: N
      • Reduce pahse: R * (N/R * log(N/R)) = NlogN - NlogR
      • Two rounds read & write on input
  • K路歸併+快排
    • 快排複雜度:K * (N/K * log(N/K)) = NlogN - NlogK
    • K路歸併複雜度:NlogK
    • 對input的兩輪讀寫
  • 總結
    • 若R==K,兩者的計算與IO複雜度都相當,但Reduce階段可分佈式併發執行,而K路歸併排序只能串行操作,總體來說MapReduce在實際應用中更好。
    • 另外,需要注意的是,兩者中IO的時間與CPU計算的時間都相當,假設數據量爲1TB(2^40B),IO速度爲100MB/s,CPU爲2GHZ,K=R=1000,串行處理情況下大致計算如下,併發情況類似:
      • 計算時間:2^40 * (1 + log(2^40) - log1000) / (2 * 2^30)= 2^9* (1 + 40 - 10) = 15000s
      • IO時間:1TB/(100MB/s)  * 2 = 2^21/100 = 20000s 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章