WordCount Python版（整理）

原創

GrowthDiary007

2020-06-17 07:27

突然間想起來，之前用過Python版的WordCount，之前沒有做整理，現在想想還不晚，整理一下，說不定以後還會用到。

MapReduce我最近用的不多了，但是感覺不少業務場景，都可在WordCount 的基礎上改進實現。

Python 具體實現（一個shell 腳本、一個Python腳本）:

##############################################################
# File Name: wordCount.sh
# Author: 
# mail: 
#=============================================================
#!/usr/bin/bash

cd `dirname $0`

source ~/.bashrc


in_dir=/user/lyx/input/*
out_dir=/user/lyx/output/

hadoop fs -rm -r -skipTrash $out_dir

hadoop jar $HADOOP_STREAMING \
				-D mapreduce.job.name="word count test" \
				-D mapreduce.job.queuename=root.xxxxxxxxxxxxx \
				-D mapred.map.tasks=500 \
				-D mapred.reduce.tasks=1000 \
				-input ${in_dir} \
				-output ${out_dir} \
				-file wordCount.py \
				-mapper "python wordCount.py mapper" \
				-reducer "python wordCount.py reducer"

##############################################################
# -*- coding=utf-8 -*-
# File Name: wordCount.py
# Author: 
# mail: 
# =============================================================
# !/usr/bin/python

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import sys


def mapper():
    for line in sys.stdin:
        lsp = line.strip().split()
        for word in lsp:
            print(word + "\t" + str(1))


def reducer():
    current_key = None
    current_count = 0

    for line in sys.stdin:
        lsp = line.strip().split("\t")
        if len(lsp) < 2:
            continue

        key = lsp[0]
        count = int(lsp[1])

        if current_key == key:
            current_count += count
        else:
            if current_key:
                print(current_key + "\t" + str(current_count))
            current_key = key
            current_count = count
    print(current_key + "\t" + str(current_count))


if __name__ == "__main__":
    if sys.argv[1] == "mapper":
        mapper()
    elif sys.argv[1] == "reducer":
        reducer()

聲明： 總結學習，有問題或不當之處，可以批評指正哦，謝謝。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

WordCount Python版（整理）

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

LeetCode：1287. Element Appearing More Than 25% In Sorted Array - Python

圖像像素座標問題

使用openpyxl模塊向Excel中插入圖片

Linux 安裝 Python libsvm - 相關問題

LeetCode：1293. Shortest Path in a Grid with Obstacles Elimination - Python

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結