MapReduce 能夠計算非常複雜的聚合邏輯,非常靈活,但是,MapReduce非常慢,不應該用於實時的數據分析中。MapReduce能夠在多臺Server上並行執行,每臺Server只負責完成一部分wordload,最後將wordload發送到Master Server上合併,計算出最終的結果集,返回客戶端。
MapReduce的基本思想,如下圖所示:
在這個例子中,我們以一個求和爲例。首先執行Map階段,把一個大任務拆分成若干個小任務,每個小任務運行在不同的節點上,從而支持分佈式計算,這個階段叫做Map(如藍框所示);每個小任務輸出的結果再進行二次計算,最後得到結果55,這個階段叫做Reduce(如紅框所示)。
使用MapReduce方式計算聚合,主要分爲三步:Map,Shuffle(拼湊)和Reduce,Map和Reduce需要顯式定義,shuffle由MongoDB來實現。
- Map:將操作映射到每個doc,產生Key和Value
- Shuffle:按照Key進行分組,並將key相同的Value組合成數組
- Reduce:把Value數組化簡爲單值
我們以下面的測試數據(員工數據)爲例,來爲大家演示。
db.emp.insert(
[
{_id:7369,ename:'SMITH' ,job:'CLERK' ,mgr:7902,hiredate:'17-12-80',sal:800,comm:0,deptno:20},
{_id:7499,ename:'ALLEN' ,job:'SALESMAN' ,mgr:7698,hiredate:'20-02-81',sal:1600,comm:300 ,deptno:30},
{_id:7521,ename:'WARD' ,job:'SALESMAN' ,mgr:7698,hiredate:'22-02-81',sal:1250,comm:500 ,deptno:30},
{_id:7566,ename:'JONES' ,job:'MANAGER' ,mgr:7839,hiredate:'02-04-81',sal:2975,comm:0,deptno:20},
{_id:7654,ename:'MARTIN',job:'SALESMAN' ,mgr:7698,hiredate:'28-09-81',sal:1250,comm:1400,deptno:30},
{_id:7698,ename:'BLAKE' ,job:'MANAGER' ,mgr:7839,hiredate:'01-05-81',sal:2850,comm:0,deptno:30},
{_id:7782,ename:'CLARK' ,job:'MANAGER' ,mgr:7839,hiredate:'09-06-81',sal:2450,comm:0,deptno:10},
{_id:7788,ename:'SCOTT' ,job:'ANALYST' ,mgr:7566,hiredate:'19-04-87',sal:3000,comm:0,deptno:20},
{_id:7839,ename:'KING' ,job:'PRESIDENT',mgr:0,hiredate:'17-11-81',sal:5000,comm:0,deptno:10},
{_id:7844,ename:'TURNER',job:'SALESMAN' ,mgr:7698,hiredate:'08-09-81',sal:1500,comm:0,deptno:30},
{_id:7876,ename:'ADAMS' ,job:'CLERK' ,mgr:7788,hiredate:'23-05-87',sal:1100,comm:0,deptno:20},
{_id:7900,ename:'JAMES' ,job:'CLERK' ,mgr:7698,hiredate:'03-12-81',sal:950,comm:0,deptno:30},
{_id:7902,ename:'FORD' ,job:'ANALYST' ,mgr:7566,hiredate:'03-12-81',sal:3000,comm:0,deptno:20},
{_id:7934,ename:'MILLER',job:'CLERK' ,mgr:7782,hiredate:'23-01-82',sal:1300,comm:0,deptno:10}
]
);
(案例一)求員工表中,每種職位的人數
var map1=function(){emit(this.job,1)}
var reduce1=function(job,count){return Array.sum(count)}
db.emp.mapReduce(map1,reduce1,{out:"mrdemo1"})
(案例二)求員工表中,每個部門的工資總和
var map2=function(){emit(this.deptno,this.sal)}
var reduce2=function(deptno,sal){return Array.sum(sal)}
db.emp.mapReduce(map2,reduce2,{out:"mrdemo2"})
(案例三)Troubleshoot the Map Function
定義自己的emit函數:
var emit = function(key, value) {
print("emit");
print("key: " + key + " value: " + tojson(value));
}
測試一條數據:
emp7839=db.emp.findOne({_id:7839})
map2.apply(emp7839)
輸出以下結果:
emit
key: 10 value: 5000
測試多條數據:
var myCursor=db.emp.find()
while (myCursor.hasNext()) {
var doc = myCursor.next();
print ("document _id= " + tojson(doc._id));
map2.apply(doc);
print();
}
(案例四)Troubleshoot the Reduce Function
一個簡單的測試案例
var myTestValues = [ 5, 5, 10 ];
var reduce1=function(key,values){return Array.sum(values)}
reduce1("mykey",myTestValues)
測試:Reduce的value包含多個值
測試數據:薪水、獎金:
var myTestObjects = [
{ sal: 1000, comm: 5 },
{ sal: 2000, comm: 10 },
{ sal: 3000, comm: 15 }
];
開發reduce方法:
var reduce2=function(key,values) {
reducedValue = { sal: 0, comm: 0 };
for(var i=0;i<values.length;i++) {
reducedValue.sal += values[i].sal;
reducedValue.comm += values[i].comm;
}
return reducedValue;
}
測試:
reduce2("aa",myTestObjects)