Hadoop python mrjob單詞統計

一.mrjob實現WordCount

# -*- coding: utf-8 -*-
# @Time    : 2019/12/1 9:45
# @Author  :

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

1.本地測試
 python3 mr_word_count.py text.txt

在這裏插入圖片描述

2.提交job到Hadoop集羣
# 確保text.txt文件已經存在Hadoop集羣中
hadoop fs -ls /
hadoop fs -cat /text.txt
# 刪除之前生成的output文件夾
hadoop fs -rm -r /output

# 提交job
python3 mr_word_count.py -r hadoop hdfs:///text.txt -o hdfs:///output
3.可能會遇到的問題

1.上傳文件到Hadoop異常

could only be replicated to 0 nodes instead of minReplication (=1)
jps 發現DataNode沒有起來
原因:
可能是多次運行hadoop namenode -format 格式化namenode引起的clusterIDid不一致
解決方法:
修改dfs/data/current/VERSION 中的clusterID值爲dfs/name/current/VERSION中的值

2.Retrying connect to server: 0.0.0.0/0.0.0.0:8032

可能原因:
1.服務器性能不夠
2.yarn-site.xml配置有問題
集羣ha可以參考:https://blog.csdn.net/Cocktail_py/article/details/102631199

3.subprocess failed with code 127
參考: https://blog.csdn.net/Saltwind/article/details/82913477

二.mrjob 實現 topN統計

# -*- coding: utf-8 -*-
# @Time    : 2019/12/1 9:45
# @Author  :

from mrjob.job import MRJob, MRStep
import heapq


class TopNWords(MRJob):
    def mapper(self, _, line):
        if line.strip() != "":
            for word in line.strip().split():
                yield word, 1

    # 介於mapper和reducer之間,用於臨時的將mapper輸出的數據進行統計
    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer_sum(self, word, counts):
        yield None, (sum(counts), word)

    # 利用heapq將數據進行排序,將最大的2個取出
    def top_n_reducer(self, _, word_cnts):
        for cnt, word in heapq.nlargest(2, word_cnts):
            yield word, cnt

    # 實現steps方法用於指定自定義的mapper,comnbiner和reducer方法
    # MRStep指定執行順序
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer_sum),
            MRStep(reducer=self.top_n_reducer)
        ]


def main():
    TopNWords.run()


if __name__ == '__main__':
    main()

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章