Hadoop WordCount運行詳解

1、MapReduce理論簡介

1.1 MapReduce編程模型

　　MapReduce採用"分而治之"的思想，把對大規模數據集的操作，分發給一個主節點管理下的各個分節點共同完成，然後通過整合各個節點的中間結果，得到最終結果。簡單地說，MapReduce就是"任務的分解與結果的彙總"。

　　在Hadoop中，用於執行MapReduce任務的機器角色有兩個：一個是JobTracker；另一個是TaskTracker，JobTracker是用於調度工作的，TaskTracker是用於執行工作的。一個Hadoop集羣中只有一臺JobTracker。

　　在分佈式計算中，MapReduce框架負責處理了並行編程中分佈式存儲、工作調度、負載均衡、容錯均衡、容錯處理以及網絡通信等複雜問題，把處理過程高度抽象爲兩個函數：map和reduce，map負責把任務分解成多個任務，reduce負責把分解後多任務處理的結果彙總起來。

　　需要注意的是，用MapReduce來處理的數據集（或任務）必須具備這樣的特點：待處理的數據集可以分解成許多小的數據集，而且每一個小數據集都可以完全並行地進行處理。

1.2 MapReduce處理過程

　　在Hadoop中，每個MapReduce任務都被初始化爲一個Job，每個Job又可以分爲兩種階段：map階段和reduce階段。這兩個階段分別用兩個函數表示，即map函數和reduce函數。map函數接收一個<key,value>形式的輸入，然後同樣產生一個<key,value>形式的中間輸出，Hadoop函數接收一個如<key,(list of values)>形式的輸入，然後對這個value集合進行處理，每個reduce產生0或1個輸出，reduce的輸出也是<key,value>形式的。

MapReduce處理大數據集的過程

2、運行WordCount程序

　　單詞計數是最簡單也是最能體現MapReduce思想的程序之一，可以稱爲MapReduce版"Hello World"，該程序的完整代碼可以在Hadoop安裝包的"src/examples"目錄下找到。單詞計數主要完成功能是：統計一系列文本文件中每個單詞出現的次數，如下圖所示。

2.1 準備工作

　　現在以"hadoop"普通用戶登錄"Master.Hadoop"服務器。

　　1）創建本地示例文件

　　首先在"/home/hadoop"目錄下創建文件夾"file"。

　　接着創建兩個文本文件file1.txt和file2.txt，使file1.txt內容爲"Hello World"，而file2.txt的內容爲"Hello Hadoop"。

　　2）在HDFS上創建輸入文件夾

　　3）上傳本地file中文件到集羣的input目錄下

2.2 運行例子

　　1）在集羣上運行WordCount程序

　　備註：以input作爲輸入目錄，output目錄作爲輸出目錄。

　　已經編譯好的WordCount的Jar在"/usr/hadoop"下面，就是"hadoop-examples-1.0.0.jar"，所以在下面執行命令時記得把路徑寫全了，不然會提示找不到該Jar包。

　　2）MapReduce執行過程顯示信息

　　Hadoop命令會啓動一個JVM來運行這個MapReduce程序，並自動獲得Hadoop的配置，同時把類的路徑（及其依賴關係）加入到Hadoop的庫中。以上就是Hadoop Job的運行記錄，從這裏可以看到，這個Job被賦予了一個ID號：job_201202292213_0002，而且得知輸入文件有兩個（Total input paths to process : 2），同時還可以瞭解map的輸入輸出記錄（record數及字節數），以及reduce輸入輸出記錄。比如說，在本例中，map的task數量是2個，reduce的task數量是一個。map的輸入record數是2個，輸出record數是4個等信息。

2.3 查看結果

　　1）查看HDFS上output目錄內容

　　從上圖中知道生成了三個文件，我們的結果在"part-r-00000"中。

　　2）查看結果輸出文件內容

3、WordCount源碼分析

3.1 特別數據類型介紹

　　Hadoop提供瞭如下內容的數據類型，這些數據類型都實現了WritableComparable接口，以便用這些類型定義的數據可以被序列化進行網絡傳輸和文件存儲，以及進行大小比較。

BooleanWritable：標準布爾型數值

ByteWritable：單字節數值

DoubleWritable：雙字節數

FloatWritable：浮點數

IntWritable：整型數

LongWritable：長整型數

Text：使用UTF8格式存儲的文本

NullWritable：當<key,value>中的key或value爲空時使用

3.2 舊的WordCount分析

　　1）源代碼程序

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {

    public static class Map extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }

    public static class Reduce extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        conf.setJobName("wordcount");

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Map.class);
        conf.setCombinerClass(Reduce.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }
}

　　3）主方法Main分析

public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
}

　　首先講解一下Job的初始化過程。main函數調用Jobconf類來對MapReduce Job進行初始化，然後調用setJobName()方法命名這個Job。對Job進行合理的命名有助於更快地找到Job，以便在JobTracker和Tasktracker的頁面中對其進行監視。

JobConf conf = new JobConf(WordCount. class ); conf.setJobName("wordcount" );

　　接着設置Job輸出結果<key,value>的中key和value數據類型，因爲結果是<單詞,個數>，所以key設置爲"Text"類型，相當於Java中String類型。Value設置爲"IntWritable"，相當於Java中的int類型。

conf.setOutputKeyClass(Text.class );

conf.setOutputValueClass(IntWritable.class );

　　然後設置Job處理的Map（拆分）、Combiner（中間結果合併）以及Reduce（合併）的相關處理類。這裏用Reduce類來進行Map產生的中間結果合併，避免給網絡數據傳輸產生壓力。

conf.setMapperClass(Map.class );

conf.setCombinerClass(Reduce.class );

conf.setReducerClass(Reduce.class );

　　接着就是調用setInputPath()和setOutputPath()設置輸入輸出路徑。

conf.setInputFormat(TextInputFormat.class );

conf.setOutputFormat(TextOutputFormat.class );

　　（1）InputFormat和InputSplit

　　InputSplit是Hadoop定義的用來傳送給每個單獨的map的數據，InputSplit存儲的並非數據本身，而是一個分片長度和一個記錄數據位置的數組。生成InputSplit的方法可以通過InputFormat()來設置。

　　當數據傳送給map時，map會將輸入分片傳送到InputFormat，InputFormat則調用方法getRecordReader()生成RecordReader，RecordReader再通過creatKey()、creatValue()方法創建可供map處理的<key,value>對。簡而言之，InputFormat()方法是用來生成可供map處理的<key,value>對的。

　　Hadoop預定義了多種方法將不同類型的輸入數據轉化爲map能夠處理的<key,value>對，它們都繼承自InputFormat，分別是：

    InputFormat

        |

        |---BaileyBorweinPlouffe.BbpInputFormat

        |---ComposableInputFormat

        |---CompositeInputFormat

        |---DBInputFormat

        |---DistSum.Machine.AbstractInputFormat

        |---FileInputFormat

            |---CombineFileInputFormat

            |---KeyValueTextInputFormat

            |---NLineInputFormat

            |---SequenceFileInputFormat

            |---TeraInputFormat

            |---TextInputFormat

　　其中TextInputFormat是Hadoop默認的輸入方法，在TextInputFormat中，每個文件（或其一部分）都會單獨地作爲map的輸入，而這個是繼承自FileInputFormat的。之後，每行數據都會生成一條記錄，每條記錄則表示成<key,value>形式：

key值是每個數據的記錄在數據分片中字節偏移量，數據類型是LongWritable；

value值是每行的內容，數據類型是Text。

　　（2）OutputFormat

　　每一種輸入格式都有一種輸出格式與其對應。默認的輸出格式是TextOutputFormat，這種輸出方式與輸入類似，會將每條記錄以一行的形式存入文本文件。不過，它的鍵和值可以是任意形式的，因爲程序內容會調用toString()方法將鍵和值轉換爲String類型再輸出。

　　3）Map類中map方法分析

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }
    }
}

　　Map類繼承自MapReduceBase，並且它實現了Mapper接口，此接口是一個規範類型，它有4種形式的參數，分別用來指定map的輸入key值類型、輸入value值類型、輸出key值類型和輸出value值類型。在本例中，因爲使用的是TextInputFormat，它的輸出key值是LongWritable類型，輸出value值是Text類型，所以map的輸入類型爲<LongWritable,Text>。在本例中需要輸出<word,1>這樣的形式，因此輸出的key值類型是Text，輸出的value值類型是IntWritable。

　　實現此接口類還需要實現map方法，map方法會具體負責對輸入進行操作，在本例中，map方法對輸入的行以空格爲單位進行切分，然後使用OutputCollect收集輸出的<word,1>。

　　4）Reduce類中reduce方法分析

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}

　　Reduce類也是繼承自MapReduceBase的，需要實現Reducer接口。Reduce類以map的輸出作爲輸入，因此Reduce的輸入類型是<Text，Intwritable>。而Reduce的輸出是單詞和它的數目，因此，它的輸出類型是<Text,IntWritable>。Reduce類也要實現reduce方法，在此方法中，reduce函數將輸入的key值作爲輸出的key值，然後將獲得多個value值加起來，作爲輸出的值。

3.3 新的WordCount分析

　　1）源代碼程序

package org.apache.hadoop.examples;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

　　public static class TokenizerMapper

　　　　　　extends Mapper<Object, Text, Text, IntWritable>{

　　　　　　private final static IntWritable one = new IntWritable(1);

　　　　　　private Text word = new Text();

　　　　　　public void map(Object key, Text value, Context context)

　　　　　　　　throws IOException, InterruptedException {

　　　　　　　　StringTokenizer itr = new StringTokenizer(value.toString());

　　　　　　　　while (itr.hasMoreTokens()) {

　　　　　　　　word.set(itr.nextToken());

　　　　　　　　context.write(word, one);

　　　　　　}

　　　　}

　　}

　　public static class IntSumReducer

　　　　　　extends Reducer<Text,IntWritable,Text,IntWritable> {

　　　　　　private IntWritable result = new IntWritable();

　　　　　　public void reduce(Text key, Iterable<IntWritable> values,Context context)

　　　　　　　　　　 throws IOException, InterruptedException {

　　　　　　　　int sum = 0;

　　　　　　　　for (IntWritable val : values) {

　　　　　　　　　　　sum += val.get();

　　　　　　　　}

　　　　　　result.set(sum);

　　　　　　context.write(key, result);

　　　　}

　　}

　　public static void main(String[] args) throws Exception {

　　　　Configuration conf = new Configuration();

　　　　String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

　　　　if (otherArgs.length != 2) {

　　　　　　System.err.println("Usage: wordcount <in> <out>");

　　　　　　System.exit(2);

　　　　}

　　　　Job job = new Job(conf, "word count");

　　　　job.setJarByClass(WordCount.class);

　　　　job.setMapperClass(TokenizerMapper.class);

　　　　job.setCombinerClass(IntSumReducer.class);

　　　　job.setReducerClass(IntSumReducer.class);

　　　　job.setOutputKeyClass(Text.class);

　　　　job.setOutputValueClass(IntWritable.class);

　　　　FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

　　　　FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

　　　　System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

　 1）Map過程

public static class TokenizerMapper

　　extends Mapper<Object, Text, Text, IntWritable>{

　　private final static IntWritable one = new IntWritable(1);

　　private Text word = new Text();

　　public void map(Object key, Text value, Context context)

　　　　throws IOException, InterruptedException {

　　　　StringTokenizer itr = new StringTokenizer(value.toString());

　　　　while (itr.hasMoreTokens()) {

　　　　　　word.set(itr.nextToken());

　　　　　　context.write(word, one);

　　}

}

　　Map過程需要繼承org.apache.hadoop.mapreduce包中Mapper類，並重寫其map方法。通過在map方法中添加兩句把key值和value值輸出到控制檯的代碼，可以發現map方法中value值存儲的是文本文件中的一行（以回車符爲行結束標記），而key值爲該行的首字母相對於文本文件的首地址的偏移量。然後StringTokenizer類將每一行拆分成爲一個個的單詞，並將<word,1>作爲map方法的結果輸出，其餘的工作都交有MapReduce框架處理。

2）Reduce過程

public static class IntSumReducer

　　extends Reducer<Text,IntWritable,Text,IntWritable> {

　　private IntWritable result = new IntWritable();

　　public void reduce(Text key, Iterable<IntWritable> values,Context context)

　　　　 throws IOException, InterruptedException {

　　　　int sum = 0;

　　　　for (IntWritable val : values) {

　　　　　　sum += val.get();

　　　　}

　　　　result.set(sum);

　　　　context.write(key, result);

　　}

}

　　Reduce過程需要繼承org.apache.hadoop.mapreduce包中Reducer類，並重寫其reduce方法。Map過程輸出<key,values>中key爲單個單詞，而values是對應單詞的計數值所組成的列表，Map的輸出就是Reduce的輸入，所以reduce方法只要遍歷values並求和，即可得到某個單詞的總次數。

3）執行MapReduce任務

public static void main(String[] args) throws Exception {

　　Configuration conf = new Configuration();

　　String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

　　if (otherArgs.length != 2) {

　　　　System.err.println("Usage: wordcount <in> <out>");

　　　　System.exit(2);

　　}

　　Job job = new Job(conf, "word count");

　　job.setJarByClass(WordCount.class);

　　job.setMapperClass(TokenizerMapper.class);

　　job.setCombinerClass(IntSumReducer.class);

　　job.setReducerClass(IntSumReducer.class);

　　job.setOutputKeyClass(Text.class);

　　job.setOutputValueClass(IntWritable.class);

　　FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

　　FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

　　System.exit(job.waitForCompletion(true) ? 0 : 1);

}

　　在MapReduce中，由Job對象負責管理和運行一個計算任務，並通過Job的一些方法對任務的參數進行相關的設置。此處設置了使用TokenizerMapper完成Map過程中的處理和使用IntSumReducer完成Combine和Reduce過程中的處理。還設置了Map過程和Reduce過程的輸出類型：key的類型爲Text，value的類型爲IntWritable。任務的輸出和輸入路徑則由命令行參數指定，並由FileInputFormat和FileOutputFormat分別設定。完成相應任務的參數設定後，即可調用job.waitForCompletion()方法執行任務。

4、WordCount處理過程

　　本節將對WordCount進行更詳細的講解。詳細執行步驟如下：

　　1）將文件拆分成splits，由於測試用的文件較小，所以每個文件爲一個split，並將文件按行分割形成<key,value>對，如圖4-1所示。這一步由MapReduce框架自動完成，其中偏移量（即key值）包括了回車所佔的字符數（Windows和Linux環境會不同）。

圖4-1 分割過程

　　2）將分割好的<key,value>對交給用戶定義的map方法進行處理，生成新的<key,value>對，如圖4-2所示。

圖4-2 執行map方法

　　3）得到map方法輸出的<key,value>對後，Mapper會將它們按照key值進行排序，並執行Combine過程，將key至相同value值累加，得到Mapper的最終輸出結果。如圖4-3所示。

圖4-3 Map端排序及Combine過程

　　4）Reducer先對從Mapper接收的數據進行排序，再交由用戶自定義的reduce方法進行處理，得到新的<key,value>對，並作爲WordCount的輸出結果，如圖4-4所示。

圖4-4 Reduce端排序及輸出結果

5、MapReduce新舊改變

　　Hadoop最新版本的MapReduce Release 0.20.0的API包括了一個全新的Mapreduce JAVA API，有時候也稱爲上下文對象。

　　新的API類型上不兼容以前的API，所以，以前的應用程序需要重寫才能使新的API發揮其作用。

　　新的API和舊的API之間有下面幾個明顯的區別。

新的API傾向於使用抽象類，而不是接口，因爲這更容易擴展。例如，你可以添加一個方法(用默認的實現)到一個抽象類而不需修改類之前的實現方法。在新的API中，Mapper和Reducer是抽象類。
新的API是在org.apache.hadoop.mapreduce包(和子包)中的。之前版本的API則是放在org.apache.hadoop.mapred中的。
新的API廣泛使用context object(上下文對象)，並允許用戶代碼與MapReduce系統進行通信。例如，MapContext基本上充當着JobConf的OutputCollector和Reporter的角色。
新的API同時支持"推"和"拉"式的迭代。在這兩個新老API中，鍵/值記錄對被推mapper中，但除此之外，新的API允許把記錄從map()方法中拉出，這也適用於reducer。"拉"式的一個有用的例子是分批處理記錄，而不是一個接一個。
新的API統一了配置。舊的API有一個特殊的JobConf對象用於作業配置，這是一個對於Hadoop通常的Configuration對象的擴展。在新的API中，這種區別沒有了，所以作業配置通過Configuration來完成。作業控制的執行由Job類來負責，而不是JobClient，它在新的API中已經蕩然無存。

Hadoop WordCount運行詳解

1、MapReduce理論簡介

1.1 MapReduce編程模型

1.2 MapReduce處理過程

2、運行WordCount程序

2.1 準備工作

2.2 運行例子

2.3 查看結果

3、WordCount源碼分析

3.1 特別數據類型介紹

3.2 舊的WordCount分析

3.3 新的WordCount分析

4、WordCount處理過程

5、MapReduce新舊改變

DAPPER 事務 TRANSACTION

SpringMVC整合shiro、自定義sessionManager實現前後端分離

問答系統的發展歷程

多變量線性迴歸

單變量線性迴歸

監督學習與無監督學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結