MapReduce核心原理（下）

MapReduce 中的排序

MapTask 和 ReduceTask 都會對數據按key進行排序。該操作是 Hadoop 的默認行爲，任何應用程序不管需不需要都會被排序。默認排序是字典順序排序，排序方法是快速排序

下面介紹排序過程：

MapTask

它會將處理的結果暫時放到環形緩衝區中，當環形緩衝區使用率達到一定閾值後，再對緩衝區中的數據進行一次快速排序，並將這些有序數據溢寫到磁盤
溢寫完畢後，他會對磁盤所有文件進行歸併排序

ReduceTask

當所有數據拷貝完後，會統一對內存和磁盤的所有數據進行一次歸併排序。

排序方式

部分排序

MapReduce 根據輸入記錄的鍵值對數據集排序，保證輸出的每個文件內部有序

全排序

最終輸出結果只有一個文件，且文件內部有序。實現方式是隻設置一個 ReduceTask，但是該方法在處理大型文件時效率極低。因爲這樣只有一臺機器處理所有的文件，完全喪失了 MapReduce 所提供的並行架構

輔助排序（分組排序）

在 Reduce 端對 key 進行分組。應用於：在接受的 key 爲 bean 對象時，想讓一個或幾個字段相同的 key 進入到同一個 reduce 方法時，可以採用分組排序。

二次排序

在自定義排序過程中，如果 compareTo 中的判斷條件爲兩個即爲二次排序。

排序接口 WritebleComparable

我們知道 MapReduce 過程是會對 key 進行排序的。那麼如果我們將 Bean 對象作爲 key 時，就需要實現 WritableComparable 接口並重寫 compareTo 方法指定排序規則。

@Setter
@Getter
public class CustomSort implements WritableComparable<CustomSort> {

    private Long orderId;

    private String orderCode;



    @Override
    public int compareTo(CustomSort o) {
        return orderId.compareTo(o.orderId);
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(orderId);
        dataOutput.writeUTF(orderCode);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderId = dataInput.readLong();
        this.orderCode = dataInput.readUTF();
    }
}

分組排序 GroupingComparator

GroupingComparator 是 Mapreduce 中 reduce 端的一個功能組件，主要的作用是決定哪些數據爲一組，調用一次 reduce 邏輯。默認是每個不同的 key，作爲不同的組。我們可以自定義 GroupingComparator 實現不同的 key 作爲一個組，調用一次 reduce 邏輯。

案例實戰：求出每一個訂單中成交金額最大的一筆交易。下面的數據只給出了訂單行的 id 和金額。訂單行 id 中_前相等的算同一個訂單

訂單行 id	商品金額
order1_1	345
order1_2	4325
order1_3	44
order2_1	33
order2_2	11
order2_3	55

實現思路

Mapper:

讀取一行文本數據，切分每個字段
把訂單行 id 和金額封裝爲一個 bean 對象，作爲 key，排序規則是訂單行 id“_”前面的訂單 id 來排序，如果訂單 id 相等再按金額降序排
map 輸出內容，key：bean 對象，value：NullWritable.get()

Shuffle:

自定義分區器，保證相同的訂單 id 的數據去同一個分區

Reduce：

自定義 GroupingComparator，分組規則指定只要訂單 id 相等則屬於同一組
每個 reduce 方法寫出同一組 key 的第一條數據就是最大金額的數據。

參考代碼：

public class OrderBean implements WritableComparable<OrderBean> {

    private String orderLineId;

    private Double price;

    @Override
    public int compareTo(OrderBean o) {
        String orderId = o.getOrderLineId().split("_")[0];
        String orderId2 = orderLineId.split("_")[0];
        int compare = orderId.compareTo(orderId2);
        if(compare==0){
            return o.price.compareTo(price);
        }
        return compare;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(orderLineId);
        dataOutput.writeDouble(price);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderLineId = dataInput.readUTF();
        this.price = dataInput.readDouble();
    }
}

public class OrderMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split(" ");
        OrderBean orderBean=new OrderBean();
        orderBean.setOrderLineId(split[0]);
        orderBean.setPrice(Double.parseDouble(split[1]));

        context.write(orderBean,NullWritable.get());
    }
}

自定義分區：

public class OrderPartitioner extends Partitioner<OrderBean, NullWritable> {
    @Override
    public int getPartition(OrderBean orderBean, NullWritable nullWritable, int i) {
        //相同訂單id的發到同一個reduce中去
        String orderId = orderBean.getOrderLineId().split("_")[0];
        return (orderId.hashCode() & Integer.MAX_VALUE) % i;
    }
}

組排序：

public class OrderGroupingComparator extends WritableComparator {

    public OrderGroupingComparator() {
        super(OrderBean.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        String aOrderId = ((OrderBean) a).getOrderLineId().split("_")[0];
        String bOrderId = ((OrderBean) b).getOrderLineId().split("_")[0];
        return aOrderId.compareTo(bOrderId);
    }
}

public class OrderReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> {

    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

driver 類：

public class OrderDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//        System.setProperty("java.library.path","d://");
        Configuration conf = new Configuration();
        Job job=Job.getInstance(conf,"OrderDriver");

        //指定本程序的jar包所在的路徑
        job.setJarByClass(OrderDriver.class);

        //指定本業務job要使用的mapper/Reducer業務類
        job.setMapperClass(OrderMapper.class);
        job.setReducerClass(OrderReducer.class);

        //指定mapper輸出數據的kv類型
        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        //指定reduce輸出數據的kv類型
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        //指定job的輸入文件目錄和輸出目錄
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        job.setPartitionerClass(OrderPartitioner.class);
        job.setGroupingComparatorClass(OrderGroupingComparator.class);

//        job.setNumReduceTasks(2);
        boolean result = job.waitForCompletion(true);
        System.exit( result ? 0: 1);

    }
}

MapReduce 讀取和輸出數據

InputFormat

運行 MapReduce 程序時，輸入的文件格式包括：基於行的日誌文件、二進制格式文件、數據庫表等。那麼，針對不同的數據類型，MapReduce 是如何讀取這些數據的呢？

InputFormat 是 MapReduce 框架用來讀取數據的類。InputFormat 常用子類：

TextInputFormat（普通文本文件，MR 框架默認的讀取實現類）
KeyValueTextInputFormat（讀取一行文本數據按照指定分隔符，把數據封裝爲 kv 類型）
NLineInputFormat（讀取數據按照行數進行劃分分片）
CombineTextInputFormat（合併小文件，避免啓動過多 MapTask 任務）
自定義 InputFormat

1. CombineTextInputFormat 案例

MR 框架默認的 TextInputFormat 切片機制按文件劃分切片，文件無論多小，都是單獨一個切片，然後由一個 MapTask 處理，如果有大量小文件，就對應生成並啓動大量的 MapTask，就會浪費很多初始化資源、啓動回收等階段。

CombineTextInputFormat 用於小文件過多的場景，它可以將多個小文件從邏輯上劃分成一個切片，這樣多個小文件可以交給一個 MapTask 處理，提高資源利用率。

使用方式：

// 如果不設置InputFormat，它默認用的是TextInputFormat.class
 job.setInputFormatClass(CombineTextInputFormat.class);
 //虛擬存儲切片最大值設置4m
 CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);

CombineTextInputFormat 切片原理

假設設置 setMaxInputSplitSize 值爲 4M，有四個小文件：1.txt -->2M ;2.txt-->7M;3.txt-->0.3M;4.txt--->8.2M

虛擬存儲過程：

把輸入目錄下所有文件大小，依次和設置的 setMaxInputSplitSize 值進行比較，如果不大於設置的最大值，邏輯上劃分一個塊。如果輸入文件大於設置的最大值且大於兩倍，那麼以最大值切割一塊；當剩餘數據大小超過設置的最大值且不大於最大值 2 倍，此時將文件均分成 2 個虛擬存儲塊（防止出現太小切片）。比如如 setMaxInputSplitSize 值爲 4M，輸入文件大小爲 8.02M，則先邏輯上分出一個 4M 的塊。剩餘的大小爲 4.02M，如果按照 4M 邏輯劃分，就會出現 0.02M 的非常小的虛擬存儲文件，所以將剩餘的 4.02M 文件切分成（2.01M 和 2.01M）兩個文件。

2M,一個塊
7M，大於 4 但是不大於 4 的 2 倍，則分爲兩塊，一塊 3.5M

切片過程：

判斷虛擬存儲的文件大小是否大於 setMaxInputSplitSize 值，大於等於則單獨形成一個切片
如果不大於則跟下一個虛擬存儲文件進行合併，共同形成一個切片。
按照之前輸入文件：那 4 個文件經過虛擬存儲過程後，有 7 個文件塊：2M、3.5M、3.5M、0.3M、4M、2.1M、2.1M
合併之後最終形成 3 個切片：（2+3.5）M、（3.5+0.3+4）M、（2.1+2.1）M

2. 自定義 InputFormat

無論 HDFS 還是 MapReduce，在處理小文件時效率都非常低，但又難免面臨處理大量小文件的場景，此時，就需要有相應解決方案。可以自定義 InputFormat 實現小文件的合併。

案例實戰

需求：

將多個小文件合併成一個 SequenceFile 文件（SequenceFile 文件是 Hadoop 用來存儲二進制形式的 key-value 對的文件格式），SequenceFile 裏面存儲着多個文件，存儲的形式爲文件路徑+名稱爲 key，文件內容爲 value。

實現思路：

定義一個類繼承 FileInputFormat
重寫 isSplitable()指定爲不可切分，重寫 createRecordReader（）方法，創建自己的 RecorderReader 對象
改變默認讀取數據方式，實現一次讀取一個完整文件作爲 kv 輸出
Driver 指定使用自定義 InputFormat

代碼參考：

public class CustomFileInputFormat extends FileInputFormat<Text, BytesWritable> {

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }

    @Override
    public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        CustomRecordReader reader = new CustomRecordReader();
        reader.initialize(inputSplit, taskAttemptContext);
        return reader;
    }
}

public class CustomRecordReader extends RecordReader <Text, BytesWritable> {

    private Configuration conf;
    private FileSplit split;
    private boolean isProgress=true;

    private BytesWritable value = new BytesWritable();

    private Text key = new Text();

    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        this.split = (FileSplit) inputSplit;
        this.conf = taskAttemptContext.getConfiguration();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if(isProgress){
            FSDataInputStream fis = null;
            try {
                //定義緩存區
                byte[] contents = new byte[(int) split.getLength()];
                //獲取文件系統
                Path path = split.getPath();
                FileSystem fs = path.getFileSystem(conf);
                //讀取數據
                fis = fs.open(path);
                //讀取文件內容到緩存區
                IOUtils.readFully(fis,contents,0,contents.length);
                //輸出文件內容
                value.set(contents,0,contents.length);
                //獲取文件路徑
                String name = split.getPath().toString();

                key.set(name);
            } finally {
                IOUtils.closeStream(fis);
            }
            isProgress = false;
            return true;
        }
        return false;
    }

    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    @Override
    public void close() throws IOException {

    }
}

在 driver 裏設置 inputFormatclass

job.setInputFormatClass(CustomFileInputFormat.class);

OutputFormat

OutputFormat：是 MapReduce 輸出數據的基類，所有 MapReduce 的數據輸出都實現了 OutputFormat 抽象類。下面介紹幾種常見的 OutputFormat 子類

TextOutputFormat

默認的輸出格式是 TextOutputFormat，它把每條記錄寫爲文本行。

SequenceFileOutputFormat

將 SequenceFileOutputFormat 輸出作爲後續 MapReduce 任務的輸入，這是一種好的輸出格式，因爲它的格式緊湊，很容易被壓縮。

自定義 OutputFormat

案例實戰

需求：

需要一個 MapReduce 程序根據奇偶數把結果輸出到不同目錄。

實現思路：

自定義一個類繼承 FileOutPutFormat
改寫 RecordWriter，重寫 write 方法

代碼參考：

public class CustomOutputFormat extends FileOutputFormat<Text, NullWritable> {

    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataOutputStream oddOut = fs.create(new Path("e:/odd.log"));
        FSDataOutputStream eventOut = fs.create(new Path("e:/event.log"));
        return new CustomWriter(oddOut, eventOut);
    }
}

public class CustomWriter extends RecordWriter<Text, NullWritable> {

    private FSDataOutputStream oddOut;
    private FSDataOutputStream evenOut;

    public CustomWriter(FSDataOutputStream oddOut, FSDataOutputStream evenOut) {
        this.oddOut = oddOut;
        this.evenOut = evenOut;
    }

    @Override
    public void write(Text text, NullWritable nullWritable) throws IOException, InterruptedException {
        Integer number = Integer.valueOf(text.toString());
        System.out.println(text.toString());
        if(number%2==0){
            evenOut.write(text.toString().getBytes());
            evenOut.write("\r\n".getBytes());
        }else {
            oddOut.write(text.toString().getBytes());
            oddOut.write("\r\n".getBytes());
        }
    }

    @Override
    public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        IOUtils.closeStream(oddOut);
        IOUtils.closeStream(evenOut);
    }
}

設置 outputFormat 類

job.setOutputFormatClass(CustomOutputFormat.class);

Shuffle 階段數據的壓縮機制

Hadoop 中支持的壓縮算法

據壓縮有兩大好處，節約磁盤空間，加速數據在網絡和磁盤上的傳輸！！

我們可以使用 bin/hadoop checknative 來查看我們編譯之後的 hadoop 支持的各種壓縮，如果出現 openssl 爲 false，那麼就在線安裝一下依賴包！！

壓縮格式	hadoop 自帶	算法	文件擴展名	是否可切分	換壓縮格式後，原來的程序是否需要修改
DEFLATE	是	DEFLATE	.deflate	否	不需要
Gzip	是	DEFLATE	.gz	否	不需要
bzip2	是	bzip2	.bz2	是	不需要
LZO	否	LZO	.lzo	是	需要建索引，還需要指定輸入格式
Snappy	否	Snappy	.snappy	否	不需要

壓縮效率對比：

壓縮位置

Map 輸入端壓縮

此處使用壓縮文件作爲 Map 的輸入數據，無需顯示指定編解碼方式，Hadoop 會自動檢查文件擴展名，如果壓縮方式能夠匹配，Hadoop 就會選擇合適的編解碼方式進行壓縮和解壓。

Map 端輸出壓縮

Shuffle 是 MR 過程中資源消耗最多的階段，如果有數據量過大造成網絡傳輸速度緩慢，可以考慮使用壓縮

Reduce 端輸出壓縮

輸出的結果數據使用壓縮能夠減少存儲的數據量，降低所需磁盤的空間，並且作爲第二個 MR 的輸入時可以複用壓縮

壓縮配置方式

在驅動代碼中通過 Configuration 設置。

設置map階段壓縮
Configuration configuration = new Configuration();
configuration.set("mapreduce.map.output.compress","true");
configuration.set("mapreduce.map.output.compress.codec","org.apache.hadoop.i
o.compress.SnappyCodec");
設置reduce階段的壓縮
configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD"
);
configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.ap
ache.hadoop.io.compress.SnappyCodec");

配置 mapred-site.xml，這種方式是全局的，對所有 mr 任務生效

<property>   
<name>mapreduce.output.fileoutputformat.compress</name>
   <value>true</value>
</property>
<property>    
<name>mapreduce.output.fileoutputformat.compress.type</name>
   <value>RECORD</value>
</property>
<property>    
<name>mapreduce.output.fileoutputformat.compress.codec</name>
   <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

壓縮實戰

在驅動代碼中添加即可

configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD");
configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.apache
.hadoop.io.compress.SnappyCodec");

MapReduce核心原理（下）

MapReduce 中的排序

排序方式

排序接口 WritebleComparable

分組排序 GroupingComparator

MapReduce 讀取和輸出數據

InputFormat

OutputFormat

Shuffle 階段數據的壓縮機制

Hadoop 中支持的壓縮算法

壓縮位置

壓縮配置方式

壓縮實戰

Doris數據劃分

Doris安裝部署

Doris數據表設計

知道策略模式！但不會在項目裏使用？

Doris入門

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結