HADOOP：MapReduce 的類型及常見知識

hadoop 2.x 版本

類型的簡單說明

在Hadoop的 MapReduce中，map和reduce函數遵循如下常規格式。

map(K1,V1) --> list(K2,V2)
reduce(K2,list(V2))  --> list(k3,V3)

map的輸出類型必定與reduce的輸入類型一致
如果使用combine函數，它和reduce函數的形式相同(它是Reduce的一個實現，在Mapper端運行)，不同的是它的輸出類型是中間代的鍵值對類型(K2,V2).

map(K1,V1) --> list(K2,V2)
combine(K2,list(V2)) --> list(K2,V2)
reduce(K2,list(V2))  --> list(k3,V3)

輸入數據的類型由輸入格式進行設置。例如 TextInputFormat 的鍵類型是 LongWritable,值類型是 Text。

Example:

        Job job = Job.getInstance(getConf());

        job.setJarByClass(WhereTimeEQsTask.class);
        job.setJobName("select other filelds name");

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        //和reduce的輸出類型一致
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        //和map的輸出類型一致
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setPartitionerClass(HashPartitioner.class);

        job.setMapperClass(WhereTimeEQsMapper.class);
        job.setReducerClass(WhereTimeEQsReduce.class);
        job.setNumReduceTasks(1);//默認情況下就是有1個reduce

輸入類型

FileInputFormat

使用數據文件作爲輸入源。實現了InputFormat接口。

輸入路徑

public static void addInputPath(Job job, Path path) throws IOException ；
public static void addInputPaths(Job job, String commaSeparatedPaths) throws IOException ；
public static void setInputPaths(Job job, Path... inputPath) throws Exception;
public static void setInputPaths(Job job, String commaSeparatedPaths) throws Exception;

addInputPath 和addInputPaths是將路徑加入路徑列表
setInputPaths是建立完整的路徑列表。
路徑可以是文件，文件夾，glob。
如果目錄包含子目錄的話，會報錯（會被解釋爲文件），解決方法有兩種：1. 使用一個glob 或過濾器來做限定。2. 設置參數 mapred.input.dir.recursive 設置爲true 爲遞歸讀取。

路徑過濾

 public static void setInputPathFilter(Job job,Class<? extends PathFilter> filter) ;

通過此方式設置過濾器，過濾器需要實現PathFilter類accept().

CombineFileInputFormat

CombineFileInputFormat適合處理大量的小文件，它是針對小文件而設計的。它是一個抽象類，所以使用的時候，要使用其子類，比如 CombineSequenceFileInputFormat 和 CombineTextFileInputFormat

類似解決的方式是使用SequenceFile ，這會將許多小文件合併成一個大文件。

避免切分

大文件作爲輸入的時候，很多時候都會被切分，如果希望使用一個map來處理一個大文件的話，那麼，可以重載isSplitable() 方法，返回false.

KeyValueTextInputFormat

適用於，輸入文件的每一行都是一個鍵值對，使用某個分界符進行分隔的時候，

NLineInputFormat

適用於希望mapper收到固定行的輸入，將NLineInputFormat作爲InputFormat,鍵值對的位置和TextInputFormat 一樣，鍵是文件中行的字節偏移量，值是行。

二進制輸入

Hadoop也可以處理二進制輸入格式的數據。SequenceFile 是一個存儲二進制鍵值對的二進制格式文件。

對應的輸入輸出類型分別是。SequenceFileInputFormat SequenceFileOutputFormat。

hadoop fs -text [dic]//查看HDFS中SequenceFile文件的方式，

SequenceFileAsTextInputFormat

SequenceFileAsTextInputFormat 是SequenceFileInputFormat的一個變體。將順序文件的鍵值轉換爲text

SequenceFileAsBinaryInputFormat

SequenceFileAsBinaryInputFormat是SequenceFileInputFormat的一個變體,將順序文件的鍵值轉換爲二進制對象，封裝成BytesWritable對象。

多個輸入—— MultipleInputs

以下兩種方式

MultipleInputs.addInputPath(job, path, inputFormatClass);
MultipleInputs.addInputPath(job, path, inputFormatClass, mapperClass);

數據庫輸入和輸出

DBInputFormat DBOutputFormat 適合小數據量的使用，若太大，數據庫可能重受不住，或者用ETL工具——Sqoop.

//這裏展示一些關鍵代碼
DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", "jdbc:mysql://127.0.0.1:3306/oled", "root", "123456"); //配置連接
job.setInputFormatClass(DBInputFormat.class);//配置輸入類
DBInputFormat.setInput(job, MyDBWritable.class, "select id,name from DB", "select count(1) from DB"); //配置DB的輸入

public static class MyMapper extends  
            Mapper<LongWritable, MyDBWritable, LongWritable, Text> {  
        Text v2 = new Text();  

        protected void map(LongWritable key, MyDBWritable value,Mapper<LongWritable, MyDBWritable, LongWritable, Text>.Context context)throws InterruptedException, IOException {  
               ***
        }  
    }  

//假設表的兩個字段爲id name
public static class MyDBWritable implements Writable,DBWritable{  
        int id;  
        String name;  
        public void write(PreparedStatement statement) throws SQLException {  
            statement.setInt(1, id);  
            statement.setString(2, name);  
        }  

        public void readFields(ResultSet resultSet) throws SQLException {  
            this.id=resultSet.getInt(1);  
            this.name=resultSet.getString(2);  
        }  

        public void write(DataOutput out) throws IOException {  
            out.write(id);  
            out.writeUTF(name);  
        }  

        public void readFields(DataInput in) throws IOException {  
            this.id=in.readInt();  
            this.name=in.readUTF();  
        }     
        public String toString(){  
            return "MyDBWritable[id="+id+",\t"+"name="+name+"]";  
        }  
    }

輸出類型

上面的每個輸入類型也都有相對應的輸出類型。

combine

a.txt 文件內容

a
b
a
a
a
c
c
c
c
b
…

直接實現一個計數功能

public static class WordCountCombineCom extends Reducer<Text,IntWritable,Text,IntWritable>{
        int sum = 0;
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable value : values) {
                sum ++;
            }
            context.write(key,new IntWritable(sum));
        }
    }
job.setNumReduceTasks(1);//必須不爲0
job.setCombinerClass(WordCountCombineCom.class);

partition

一個簡單的自定義Partitioner

public static class MonthPartition extends Partitioner<Text,IntWritable>{

        @Override
        public int getPartition(Text text, IntWritable intWritable, int numPartitions) {
            return 1;//返回的就是分區的數。舉個例子，如果有一些明顯的字段，比如說月，此處就可以按月劃分爲12個分區
        }
}

job.setPartitionerClass(MonthPartition);//設置分區.

patition類結構

Partitioner是partitioner的基類，如果需要定製partitioner也需要繼承該類。
HashPartitioner是mapreduce的默認partitioner。計算方法是which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到當前的目的reducer。
BinaryPatitioner繼承於Partitioner< BinaryComparable ,V>，是Partitioner的偏特化子類。該類提供leftOffset和rightOffset，在計算which reducer時僅對鍵值K的[rightOffset，leftOffset]這個區間取hash。
Which reducer=(hash & Integer.MAX_VALUE) % numReduceTasks
KeyFieldBasedPartitioner也是基於hash的個partitioner。和BinaryPatitioner不同，它提供了多個區間用於計算hash。當區間數爲0時KeyFieldBasedPartitioner退化成HashPartitioner。
TotalOrderPartitioner這個類可以實現輸出的全排序。不同於以上3個partitioner，這個類並不是基於hash的。下面詳細的介紹totalorderpartitioner。

全排序
最簡單的方法：所有數據丟給一個reduce，使其內部排序。
這樣的方法跟單機沒什麼區別，完全沒有利用分佈式計算的優勢；數據量稍大時，一個reduce的處理效率極低。
分佈式方案：
首先，創建一系列排序好的文件；其次，串聯這些文件；最後生成一個全局排序的文件。
主要思路是使用一個partitioner來描述全局排序的輸出。
由此我們可以歸納出這樣一個用hadoop對大量數據排序的步驟：
1）對待排序數據進行抽樣；
2）對抽樣數據進行排序，產生標尺；
3） Map對輸入的每條數據計算其處於哪兩個標尺之間；將數據發給對應區間ID的reduce
4） Reduce將獲得數據直接輸出。
這裏使用對一組url進行排序來作爲例子：

Java實現：
1）InputSampler
輸入採樣類，可以對輸入目錄下的數據進行採樣。InputSampler類實現了Sampler接口，目的是創建一個順序文件來存儲定義分區的鍵。提供了3種採樣方法。

採樣方式對比表:

類名稱	採樣方式	構造方法	效率	特點
SplitSampler	對前n個記錄進行採樣	採樣總數，劃分數	最高
RandomSampler	遍歷所有數據，隨機採樣	採樣頻率，採樣總數，劃分數	最低
IntervalSampler	固定間隔採樣	採樣頻率，劃分數	中	對有序的數據十分適用

對有序的數據十分適用

InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);//RandomSampler的三個參數分別是採樣率、最大樣本數、最大分區。

2）TotalOrderPartitioner

TotalOrderPartitioner.setPartitionFile(conf, partitionFile);
InputSampler.writePartitionFile(conf, sampler);

InputSampler寫的分區文件放在輸入目錄。
TotalOrderPartitioner指定partition文件。partition文件要求Key （這些key就是所謂的劃分）的數量和當前reducer的數量相同並且是從小到大排列。
writePartitionFile這個方法根據採樣類提供的樣本，首先進行排序，然後選定（隨機的方法）和reducer數目-1的樣本寫入到partition file。這樣經過採樣的數據生成的劃分，在每個劃分區間裏的key value pair 就近似相同了，這樣就能完成均衡負載的作用。
DistributedCache.addCacheFile(partitionUri, conf);
partition文件載入分佈式緩存。

基於 DistributeCache 的緩存方案

job.addCacheFile(URI uri);
job.addCacheArchive(URI uri);

爲了保證性能，在作業剛開始的時候，這些文件就會被複制到運行機器的節點上，setup方法會檢索。

這樣在map端處理的時候


private Map<String,String[]> map1 = new HashMap<String,String[]>();
private Map<String,String[]> map2 = new HashMap<String,String[]>();
private Map<String,String[]> map3 = new HashMap<String,String[]>();

@Override
    protected void setup(Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        URI[] uris = context.getCacheFiles();
        for (URI uri : uris) {
            if (uri.toString().endsWith("")) {
                something();//map1
            }
            if (uri.toString().endsWith("")) {
                something();//map2
            }
            if (uri.toString().endsWith("")) {
                something();//map3
            }
        }
    }

保存結果到多輸出文件

MultipleOutputs.addNamedOutput(job, namedOutput, outputFormatClass, keyClass, valueClass);//job的設置

在map中實現

MultipleOutputs mos = new MultipleOutputs(context);//在setup方法中實例化
this。mos.write(namedOutput, key, value, baseOutputPath);//寫出文件內容

自定義計數器

定義一個枚舉

通過動態爲計數器提供組名稱，計數器的名稱，來創建動態的計數器，計數器是全局性的

//在mapper內部創建枚舉
enum OntimeStatices {
    DEPONTIME1;
    DEPONTIME2;
    DEPONTIME3;
}
//在map內實現計數器的操作
context.getCounter(OntimeStatices.DEPONTIME1).increment(1);//比如說實現一個'加'操作

            //在job內訪問，可寫成一個方法
            CounterGroup xntrgrp = job.getCounters().getGroup("");
            Iterator<Counter> cntLter = xntrgrp.iterator();
            while(cntLter.hasNext()){
                Counter c = cntLter.next();
                System.out.println(c.getName() + c.getValue());
            }

HADOOP：MapReduce 的類型及常見知識

類型的簡單說明

輸入類型

FileInputFormat

輸入路徑

路徑過濾

CombineFileInputFormat

避免切分

KeyValueTextInputFormat

NLineInputFormat

二進制輸入

SequenceFileAsTextInputFormat

SequenceFileAsBinaryInputFormat

多個輸入—— MultipleInputs

數據庫輸入和輸出

輸出類型

combine

partition

一個簡單的自定義Partitioner

patition類結構

基於 DistributeCache 的緩存方案

保存結果到多輸出文件

自定義計數器

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

HADOOP：HDFS的核心知識

HADOOP：MapReduce 的類型及常見知識

回首Java——深入理解HashMap相關

回首Java——Java序列化機制(Serialization,Deserialization)

LINUX： ftp,sftp,scp網絡傳輸命令配置及使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結