數據格式:
4234 4565 89579 0989 ····
3455 879 123 9090 ····
2342 9897 765 5746 ····
987 8098 8008 80099 ····
····
需求:
計算這一組數中出現次數最多的數字,按出現次數從大到小排序,取前n個數以及他們出現的次數(top n)
Python 代碼:
- mapper:
對於讀入的每個數做一個(num, 1)的簡單映射
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
def map():
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print('%s\t%s' % (word, 1))
if __name__ == '__main__':
map()
- reducer:
用groupby方法對每相同的關鍵字(num)進行分組,分組後key爲num,value爲(num,1),在根據value的第二項計算該num出現的總次數count ,最後比較大小篩選出top n ,這裏的n爲通過Streaming運行python腳本時傳入的參數
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
from itertools import groupby
def from_stdin():
for line in sys.stdin:
word, count = line.strip().split('\t')
yield (word, count)
def reduce():
n = int(sys.argv[1])
a = {}
for word, group in groupby(from_stdin(), key=lambda x: x[0]):
count = sum([int(tup[1]) for tup in group])
if len(a) < n:
a.setdefault(word, count)
else:
y = min(a, key=a.get)
if count > a[y]:
a.pop(y)
a.setdefault(word, count)
a = [(key, value) for key, value in a.items()]
a.sort(reverse=True, key=lambda x: x[1])
for b in a:
print('%s\t%s' % (b[1], b[0]))
if __name__ == '__main__':
reduce()