八、手把手教MapReduce 單詞統計案例編程

原創

zipo

2020-06-27 03:54

1、在Linux 系統中搭建Eclipse 和Maven 環境，創建Maven Project

2、安裝jdk，並配置環境變量。

3、配置maven ，配置環境變量，用root用戶身份。

4.配置Maven倉庫

5.解壓eclipse

6.以普通用戶打開eclipse，配置maven

修改pom.xml

配置輸出路徑

二、基於八古文格式編寫WordCount 程序

一個簡單的MapReduce程序只需要指定map()、reduce()、input和output，剩下的事由框架完成。

package org.apache.hadoop.mr01;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool {
//四個泛型，前兩個是輸入類型，後兩個是輸出類型。
    public static class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

        private final static IntWritable mapOutputValue=new IntWritable(1);
        private Text mapOutputKey=new Text();
        //每次讀一行數據就調用一次該方法
        @Override
        public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{

            String line=value.toString();
            String[] words=line.split(" ");
            for(String word:words){
                mapOutputKey.set(word);
                context.write(mapOutputKey, mapOutputValue);
            }
        }


    }
   public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

    private IntWritable outputValue = new IntWritable() ;

    @Override
    public void reduce(Text key, Iterable<IntWritable> values,Context context)
            throws IOException, InterruptedException {
        int sum = 0 ;
        //遍歷value的list,累加求和
        for(IntWritable value: values){

            sum += value.get() ;
        }
        outputValue.set(sum);
        //輸出一個單詞的統計結果
        context.write(key, outputValue);
    }

}
   public int run(String[] args) throws Exception {
        //獲取configuration
      Configuration configuration = super.getConf() ;
        // 創建job
        Job job = Job.getInstance(configuration,this.getClass().getSimpleName());
        job.setJarByClass(this.getClass());
        //input
        Path inPath = new Path(args[0]) ;
        FileInputFormat.addInputPath(job, inPath);
        // mapper
        job.setMapperClass(WordCountMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        // reducer
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //output
        Path outPath = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outPath);

        //提交job 
        boolean isSuccess = job.waitForCompletion(true);
        return isSuccess ? 0 : 1 ;
    }
   public static void main(String[] args) throws Exception {

        Configuration configuration = new Configuration();
        int status = ToolRunner.run(configuration,new WordCount(), args) ;
        System.exit(status);
    }
}

打包JAR，在YARN 運行測試

三、以WordCount 程序爲例，理解MapReduce 如何並行分析數

默認一個分片split進行一個map處理，多個map經過中間的數據轉換進入一個或多個reduce處理。

1、input讀入數據，按行讀取，轉換格式爲<key,value>爲map的輸入文件，key是這行數據在文件中的偏移量。Value是這行數據的內容。
2、在map中對這行數據進行split分割，形成map的輸出文件類似爲<word,1>。
3、經過中間過程的轉換，進入reduce中，reduce的輸入格式<key,values>，類似爲<hadoop,list(1,1,1,1)>。Reduce對value值進行累加。
4、output輸入數據，默認情況下每個<Key,Value>輸出一行數據。
key和value之間分隔符爲製表符。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

八、手把手教MapReduce 單詞統計案例編程

python gdal 安裝使用（Windows， python 3.6.8）

七、Hadoop 2.5.2+zookeeper高可用部署

五、Hadoop 2.5.2分佈式環境部署

八、手把手教MapReduce 單詞統計案例編程

第零篇大數據學習介紹

三、hadoop2.5.2+centos6.5編譯源碼

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結