一、MapReduce工作流程

MapTask和Shuffle階段：
ReduceTask階段：

1.1 MapTask工作機制

Read階段：MapTask通過用戶編寫的RecordReader，從輸入InputSplit中解析出一個個KV。
Map階段：該節點主要是將解析出的KV交給用戶編寫map()函數處理，併產生一系列新的KV
Collect收集階段：在用戶編寫map()函數中，當數據處理完成後，一般會調用OutputCollector.collect()輸出結果。在該函數內部，它會將生成的KV分區（調用Partitioner），並寫入一個環形內存緩衝區中
Spill階段：即“溢寫”，當環形緩衝區滿後，MapReduce會將數據寫到本地磁盤上，生成一個臨時文件。需要注意的是，將數據寫入本地磁盤之前，先要對數據進行一次本地排序，並在必要時對數據進行合併、壓縮等操作。
Combine階段：當所有數據處理完成後，MapTask對所有臨時文件進行一次合併，以確保最終只會生成一個數據文件。
Reduce階段:ReduceTask根據自己的分區號，去各個MapTask機器上取相應的結果分區數據，ReduceTask會將這些文件再進行合併（歸併排序)，然後進行reduce()的邏輯運算。

1.2 Shuffle工作機制

Shuffle中的緩衝區大小會影響到MapReduce程序的執行效率，原則上說，緩衝區越大，磁盤IO的次數越少，執行速度就越快。

緩衝區的大小可以通過參數調整，參數：io.sort.mb默認100M

1.3 ReduceTask工作機制

Copy階段：ReduceTask從各個MapTask上遠程拷貝一片數據，並針對某一片數據，如果其大小超過一定閾值，則寫到磁盤上，否則直接放到內存中
Merge階段：在遠程拷貝數據的同時，ReduceTask啓動了兩個後臺線程對內存和磁盤上的文件進行合併，以防止內存使用過多或磁盤上文件過多
Sort階段：按照MapReduce語義，用戶編寫reduce()函數輸入數據是按key進行聚集的一組數據。爲了將key相同的數據聚在一起，Hadoop採用了基於排序的策略。由於各個MapTask已經實現對自己的處理結果進行了局部排序，因此，ReduceTask只需對所有數據進行一次歸併排序即可
Reduce階段：reduce()函數將計算結果寫到HDFS上

二、InputFormat數據輸入

InputFormat有兩個重要的功能：數據切片和將切片轉換爲KV。

2.1 數據切片

數據切片是在邏輯上對輸入進行分片，並不會在磁盤上將其拆分成片進行存儲。(Block是HDFS物理上對數據的拆分)

一個Job的Map階段並行度由客戶端在提交Job時的切片數決定
每一個Split切片分配一個MapTask並行實例處理
默認情況下，切片大小=BlockSize
切片時不考慮數據集整體，而是逐個針對每一個文件單獨切片

FileInputFormat切片源碼分析（input.getSplits）：

1.程序先找到你數據存儲的目錄

2.開始遍歷處理(規劃切片)目錄下的每一個文件

獲取文件大小file.getLen()
計算切片大小：long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);，默認splitSize=maxSize=128M，當blockSize<MaxSize時，splitSize=blockSize。每次切片時，都要判斷切完剩下的部分是否大於splitSize的1.1倍，若不大於1.1倍就劃分成一塊切片
將切片信息寫到一個切片規劃文件splits中

2.2 FileInputFormat實現類

在運行MapReduce程序時，輸入的文件格式包括:基於行的日誌文件、二進制格式文件、數據庫表等。那麼針對不同的數據類型，MapReduce是如何讀取這些數據的呢?

FileInputFormat常見的接口實現類包括: TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat和自定義InputFormat等。

①TextInputFomat

TextInputFormat是默認的FileInputFormat實現類

切片方法： FileInputFormat的切片方法

KV方法： LineRecordReader
按行讀取每條記錄。鍵是存儲該行在整個文件中的起始字節偏移量，LongWritable類型。值是這行的內容，不包括任何行終止符( 換行符和回車符) ，Text類型。

②KeyValueInputFomat

切片方法： FileInputFormat的切片方法

KV方法： KeyValueLineRecordReader
每一行均爲一條記錄，被分隔符分割爲K、V。可以通過在驅動類中設置
conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t")，來設定分隔符。默認分隔符是tab (\t)。

③NLineInputFomat

切片方法： 自定義切片方法
如果使用NlineInputFormat，代表每個MapTask處理的InputSplit不再按Block塊去劃分，而是按NlineInputFormat指定的行數N來劃分。即輸入文件的總行數/N=切片數,如果不整除，切片數=商+1。

KV方法： LineRecordReader

④CombineFileInputFormat

切片方法： 自定義切片方法
CombineTextInputFormat用於小文件過多的場景，它可以將多個小文件從邏輯上規劃到一個切片中。這樣多個小文件就可以交給一個MapTask處理

虛擬存儲切片最大值設置：CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m

KV方法： LineRecordReader

⑤FixedLengthInputFomat

切片方法： FileInputFormat的切片方法

KV方法： FixedLengthRecordReader
不同於LineRecordReader每次讀取一行，FixedLengthRecordReader每次讀取指定長度的數據。

⑥SequenceFileInputFormat

切片方法： FileInputFormat的切片方法

KV方法： SequenceFileRecordReader

讀取的數據是上一個MapTask處理完的數據

⑦自定義InputFormat案例

自定義InputFormat實現小文件的合併：將多個小文件合併成一個SequenceFile文件（SequenceFile文件是Hadoop用來存儲二進制形式的key-value對的文件格式），SequenceFile裏面存儲着多個文件，存儲的形式爲文件路徑+名稱爲key，文件內容爲value。

①自定義RecordReader

/**
 * 自定義RecordReader處理文件轉換爲KV
 *
 * @author HuChan
 */
public class WholeFileRecordReader extends RecordReader<Text, BytesWritable> {

    private boolean notRead = true;

    private Text key = new Text();

    private BytesWritable value = new BytesWritable();

    private FSDataInputStream inputStream;

    private FileSplit fs;

    /**
     * 初始化方法，框架在開始的時候會調用一次
     */
    @Override
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        //轉換切片類型到文件切片
        fs = (FileSplit) split;
        //獲取切片獲取路徑
        Path path = fs.getPath();
        //通過路徑獲取文件系統
        FileSystem fileSystem = path.getFileSystem(context.getConfiguration());
        //開流
        inputStream = fileSystem.open(path);
    }

    /**
     * 讀取KV值
     * 讀取到返回true，讀完了返回false
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (notRead) {
            //讀取key
            key.set(fs.getPath().toString());
            //讀value
            byte[] bytes = new byte[(int) fs.getLength()];
            inputStream.read(bytes);
            value.set(bytes, 0, bytes.length);
            notRead = false;
            return true;
        } else {
            return false;
        }
    }

    /**
     * 獲取當前讀到的key
     */
    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    /**
     * 獲取當前讀到的value
     */
    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    /**
     * 當前數據讀取的進度
     */
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return notRead ? 0 : 1;
    }

    /**
     * 關流
     */
    @Override
    public void close() throws IOException {
        //關流
        IOUtils.closeStream(inputStream);
    }
}

②自定義InputFormat

public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }

    @Override
    public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        return new WholeFileRecordReader();
    }
}

③Driver設置

public class WholeFileDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(WholeFileDriver.class);
        /**
        *使用默認的Mapper和Reducer
        */
        //job.setMapperClass(WholeFileMapper.class);
        //job.setReducerClass(WholeFileReducer.class);

        job.setInputFormatClass(WholeFileInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\MyFile\\test"));
        FileOutputFormat.setOutputPath(job, new Path("d:\\output"));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

三、Shuffle機制

MapReduce確保每個Reducer的輸入都是按鍵排序的。系統執行排序的過程（即將Map輸出作爲輸入傳給Reducer）稱爲Shuffle。

3.1 Partition分區

默認的Partition分區，key.haCode() & Integer.MAX_VALUE這個值一定是正值，取模就是分區號，默認的是無法控制K存到具體的分區。

public class HashPartitioner<K, V> extends Partitioner<K, V> {
  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

3.2 自定義Partitioner

實操： 手機號136、137、138、139開頭都分別放到一個獨立的4個文件中，其他開頭的放到一個文件中。

自定義Partitioner:

public class FlowPartitioner extends Partitioner<Text, FlowBean> {

    @Override
    public int getPartition(Text text, FlowBean flowBean, int i) {
        //獲取電話號碼的前三位
        String preNum = text.toString().substring(0, 3);
        int partitionNum = 4;
        switch (preNum) {
            case "136":
                partitionNum = 0;
                break;
            case "137":
                partitionNum = 1;
                break;
            case "138":
                partitionNum = 2;
                break;
            case "139":
                partitionNum = 3;
                break;
        }
        return partitionNum;
    }
}

驅動添加設置：

//設置Partitioner
job.setPartitionerClass(FlowPartitioner.class);
//設置reduce task的數量
job.setNumReduceTasks(5);

進行測試：

注意：

如果ReduceTask的數量> getPartition的結果數，則會多產生幾個空的輸出文件part-r-000xx
如果1<ReduceTask的數量<getPartition的結果數，則有一部分分區數據無處安放，會報錯
如果ReduceTask的數量=1，則不管MapTask端輸出多少個分區文件，最終結果都交給這一個ReduceTask，最終也就只會產生一個結果文件part-r-00000

3.3 排序

排序是MapReduce框架中最重要的操作之一。MapTask和ReduceTask均會對數據按照key進行排序，該操作屬於Hadoop的默認行爲。任何應用程序中的數據均會被排序，而不管邏輯上是否需要。默認的排序是按照字典順序，且實現該排序的方法是快速排序。

對於MapTask，它會將處理的結果暫時放到環形緩存區，當環形緩存區使用率達到一定閾值後，再對緩存區中的數據進行一次快速排序，並將這些有序數據溢寫到磁盤上，而當數據處理完畢後，它會對磁盤上所有文件進行歸併排序。

對於ReduceTask，它從每個MapTask上遠程拷貝相應的數據文件，如果文件大小超過一定閾值，則放到磁盤上，否則放到內存中。如果磁盤上文件數目達到一定閾值，則進行一次合併以生成一個更大文件；如果內存中文件大小或者數目超過一定閾值，則進行一次合併後將數據寫到磁盤上。當所有數據拷貝完畢後，ReduceTask統一對內存和磁盤上的所有數據進行一次合併並歸併排序。

排序的分類：

①部分排序：MapReduce根據輸入記錄的鍵對數據集排序，保證輸出的每個文件內部排序。

②全排序： 最終輸出結果只有一個文件，文件內部有序。實現方式是隻設置一個ReduceTask，但該方法在處理大型文件時效率極低，因爲一臺機器處理所有文件，完全喪失了MapReduce所提供的並行架構。

③輔助排序（GroupingComparator分組）: 在Reduce端對key進行分組。應用於:在接收的key爲bean對象時，想讓一個或幾個字段相同(全部字段比較不相同)的key進入到同一個reduce方法時,可以採用分組排序。

④二次排序： 在自定義排序過程中，如果compareTo中的判斷條件爲兩個即爲二次排序。

①WritableComparable全排序和區排序

實體類實現WritableComparable<T>接口，重寫compareTo()方法

    @Override
    public int compareTo(FlowBean o) {
        return Long.compare(o.getSumFlow(), this.sumFlow);
    }

②GroupingComparator分組（輔助排序）

根據以下訂單求出每個訂單中最大商品金額，期望輸出數據：

0000002 2
0000004 4

訂單id	商品id	金額
0000001	sku001	1
0000001	sku002	2
0000002	sku003	3
0000002	sku004	4

需求分析：

利用訂單id和成交金額作爲key，可以將Map階段讀取到的所有訂單數據按照id升序排序，如果id相同再按照金額降序排序，發送到Reduce
在Reduce端利用groupingComparator將訂單id相同的kv聚合成組，然後取第一個即是該訂單中最貴商品

訂單信息OrderBean類：

public class OrderBean implements WritableComparable<OrderBean> {

	private int order_id; // 訂單id號
	private double price; // 價格

	public OrderBean() {
		super();
	}

	public OrderBean(int order_id, double price) {
		super();
		this.order_id = order_id;
		this.price = price;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(order_id);
		out.writeDouble(price);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		order_id = in.readInt();
		price = in.readDouble();
	}

	@Override
	public String toString() {
		return order_id + "\t" + price;
	}

	public int getOrder_id() {
		return order_id;
	}

	public void setOrder_id(int order_id) {
		this.order_id = order_id;
	}

	public double getPrice() {
		return price;
	}

	public void setPrice(double price) {
		this.price = price;
	}

	// 二次排序
	@Override
	public int compareTo(OrderBean o) {

		int result;

		if (order_id > o.getOrder_id()) {
			result = 1;
		} else if (order_id < o.getOrder_id()) {
			result = -1;
		} else {
			// 價格倒序排序
			result = price > o.getPrice() ? -1 : 1;
		}

		return result;
	}
}

Mapper類：

public class OrderMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {

	OrderBean k = new OrderBean();
	
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		
		// 1 獲取一行
		String line = value.toString();
		
		// 2 截取
		String[] fields = line.split("\t");
		
		// 3 封裝對象
		k.setOrder_id(Integer.parseInt(fields[0]));
		k.setPrice(Double.parseDouble(fields[2]));
		
		// 4 寫出
		context.write(k, NullWritable.get());
	}
}

OrderSortGroupingComparator類：

public class OrderGroupingComparator extends WritableComparator {

	protected OrderGroupingComparator() {
		super(OrderBean.class, true);
	}

	@Override
	public int compare(WritableComparable a, WritableComparable b) {

		OrderBean aBean = (OrderBean) a;
		OrderBean bBean = (OrderBean) b;

		int result;
		if (aBean.getOrder_id() > bBean.getOrder_id()) {
			result = 1;
		} else if (aBean.getOrder_id() < bBean.getOrder_id()) {
			result = -1;
		} else {
			result = 0;
		}

		return result;
	}
}

Reducer類:

public class OrderReducer extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {

	@Override
	protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context)		throws IOException, InterruptedException {
		
		context.write(key, NullWritable.get());
	}
}

Driver類:

public class OrderDriver {

	public static void main(String[] args) {
	  ...
	  // 設置reduce端的分組
	  job.setGroupingComparatorClass(OrderGroupingComparator.class);
	  ...
	}
}

3.4 Combiner合併

Combiner是MR程序中Mapper和Reducer之外的一種組件，其父類就是Reducer。Combiner在每一個MapTask所在的節點運行，Combiner的意義就是對每一個MapTask的輸出進行局部彙總，以減少網絡傳輸量。

自定義WordcountCombiner:

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //1、彙總
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        v.set(sum);
        context.write(key, v);
    }
}

驅動類中指定Combiner:

job.setCombinerClass(WordCountCombiner.class);

使用前：

使用後：

四、OutputFormat接口實現類

OutputFormat是MapReduce輸出的基類，所有實現MapReduce輸出都實現了OutputFormat接口，下面我們介紹幾種常見的OutputFormat實現類。

文本輸出TextOutputFormat
默認的輸出格式是TextOutputFormat，它把每條記錄寫爲文本行。它的鍵和值可以是任意類型，因爲TextOutputFormat調用toString()方法把它們轉換爲字符串。
SequenceFileOutputFormat
SequenceFileOutputFormat將它的輸出寫爲一個順序文件。如果輸出需要作爲後續 MapReduce任務的輸入，這便是一種好的輸出格式，因爲它的格式緊湊，很容易被壓縮。
自定義OutputFormat
根據用戶需求，自定義實現輸出。

4.1 自定義OutputFormat

使用場景： 爲了實現控制最終文件的輸出路徑和輸出格式，可以定義OutputFormat

自定義OutputFormat步驟:

自定義一個類繼承FileInputFormat
改寫RecordWriter，重寫write()方法

案例：
過濾輸入的log.txt，包含google的網站輸出到d:/google.log，不包含google的網站輸出到d:/other.log。

Mapper類:

public class FilterMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        context.write(value, NullWritable.get());
    }
}

Reducer類:

public class FilterReducer extends Reducer<Text, NullWritable, Text, NullWritable> {

    Text k = new Text();

    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        //獲取行
        String line = key.toString();
        line = line + "\r\n";
        k.set(line);
        context.write(k, NullWritable.get());
    }
}

自定義RecordWriter：

public class FilterRecordWriter extends RecordWriter<Text, NullWritable> {

    FSDataOutputStream os1 = null;
    FSDataOutputStream os2 = null;

    public FilterRecordWriter(TaskAttemptContext job) {
        //1.獲取文件系統
        FileSystem fs;
        try {
            fs = FileSystem.get(job.getConfiguration());
            os1 = fs.create(new Path("d:/output/google.log"));
            os2 = fs.create(new Path("d:/output/other.log"));
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    @Override
    public void write(Text key, NullWritable value) throws IOException, InterruptedException {
        if (key.toString().contains("google")) {
            os1.write(key.toString().getBytes());
        } else {
            os2.write(key.toString().getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        IOUtils.closeStream(os1);
        IOUtils.closeStream(os2);
    }
}

自定義FileOutputFormat:

public class FilterOutputFormat extends FileOutputFormat<Text, NullWritable> {
    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
        return new FilterRecordWriter(job);
    }
}

驅動Driver

public class FilterDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(FilterDriver.class);
        job.setMapperClass(FilterMapper.class);
        job.setReducerClass(FilterReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 要將自定義的輸出格式組件設置到job中
        job.setOutputFormatClass(FilterOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\MyFile\\test"));
        //指定_SUCCESS文件的位置
        FileOutputFormat.setOutputPath(job, new Path("d:\\output"));
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);

    }
}

Hadoop：MapReduce框架原理

文章目錄

一、MapReduce工作流程

1.1 MapTask工作機制

1.2 Shuffle工作機制

1.3 ReduceTask工作機制

二、InputFormat數據輸入

2.1 數據切片

2.2 FileInputFormat實現類

①TextInputFomat

②KeyValueInputFomat

③NLineInputFomat

④CombineFileInputFormat

⑤FixedLengthInputFomat

⑥SequenceFileInputFormat

⑦自定義InputFormat案例

三、Shuffle機制

3.1 Partition分區

3.2 自定義Partitioner

3.3 排序

①WritableComparable全排序和區排序

②GroupingComparator分組（輔助排序）

3.4 Combiner合併

四、OutputFormat接口實現類

4.1 自定義OutputFormat

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

Hive(五)：企業調優

Kafka(三)：面試題

Flume(一)：概述和企業開發案例

Flume(二)：監控、自定義組件、面試題

HBase(三)：集成Hive、HBase優化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結