突然間想起來,之前用過Python版的WordCount
,之前沒有做整理,現在想想還不晚,整理一下,說不定以後還會用到。
MapReduce
我最近用的不多了,但是感覺不少業務場景,都可在WordCount
的基礎上改進實現。
Python 具體實現(一個shell 腳本、一個Python腳本):
##############################################################
# File Name: wordCount.sh
# Author:
# mail:
#=============================================================
#!/usr/bin/bash
cd `dirname $0`
source ~/.bashrc
in_dir=/user/lyx/input/*
out_dir=/user/lyx/output/
hadoop fs -rm -r -skipTrash $out_dir
hadoop jar $HADOOP_STREAMING \
-D mapreduce.job.name="word count test" \
-D mapreduce.job.queuename=root.xxxxxxxxxxxxx \
-D mapred.map.tasks=500 \
-D mapred.reduce.tasks=1000 \
-input ${in_dir} \
-output ${out_dir} \
-file wordCount.py \
-mapper "python wordCount.py mapper" \
-reducer "python wordCount.py reducer"
##############################################################
# -*- coding=utf-8 -*-
# File Name: wordCount.py
# Author:
# mail:
# =============================================================
# !/usr/bin/python
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
import sys
def mapper():
for line in sys.stdin:
lsp = line.strip().split()
for word in lsp:
print(word + "\t" + str(1))
def reducer():
current_key = None
current_count = 0
for line in sys.stdin:
lsp = line.strip().split("\t")
if len(lsp) < 2:
continue
key = lsp[0]
count = int(lsp[1])
if current_key == key:
current_count += count
else:
if current_key:
print(current_key + "\t" + str(current_count))
current_key = key
current_count = count
print(current_key + "\t" + str(current_count))
if __name__ == "__main__":
if sys.argv[1] == "mapper":
mapper()
elif sys.argv[1] == "reducer":
reducer()
聲明: 總結學習,有問題或不當之處,可以批評指正哦,謝謝。