大數據學習(十二)mapreduce自定義輸入 inputformat

簡介

這一篇我講給大家分享如何自定義輸入和輸出，可能聽起來比較抽象，我們用實際應用中的一個例子來說明。

自定義輸入問題

我們現在有這樣的數據文件，

每個文件的數據格式是這樣的

那麼我們除了用conbiner以外，還有什麼效率更高的，所謂更高端的方式來將這些小文件進行合併嗎。
下面我們就通過自定義輸入，重寫FileinputFormat中的方法來完成這個目標。

自定義輸入

目錄結構：

MapperClass

package costomInputFormat;

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @Author: Braylon
 * @Date: 2020/1/29 10:33
 * @Version: 1.0
 */
public class MapperClass extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {

    Text k = new Text();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        String name = split.getPath().getName();

        k.set(name);
    }

    @Override
    protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
        context.write(k, value);

    }
}

知識點：

Map過程的輸入和輸出

也就是看一下那四個參數類型<NullWritable, BytesWritable, Text, BytesWritable>，
簡單來說就是我們輸入不再需要偏移量，直接改爲了只要文件內容，而BytesWritable就代表文件內容的數據類型。然後我們將文件名作爲key（Text）輸出，然後文件內同作爲value（BytesWritable）。

map方法啥事情都沒做，主要看setup中的邏輯

其實也很簡單，就是獲取了文件的名字，至於FileSplit我前面的blog中也有講過和用過，這裏不再多說了。

reducerClass

package costomInputFormat;

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @Author: Braylon
 * @Date: 2020/1/29 10:43
 * @Version: 1.0
 */
public class reducerClass extends Reducer<Text, BytesWritable, Text, BytesWritable> {
    @Override
    protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {
        super.reduce(key, values, context);
    }
}

reduce階段沒有做什麼直接跳過。
當然我這裏只是演示一下自定義輸入，reduce階段大家可以加自己需要的業務邏輯。

重點：
costomInputFormat：

package costomInputFormat;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import java.io.IOException;

/**
 * @Author: Braylon
 * @Date: 2020/1/29 10:47
 * @Version: 1.0
 */
public class costomInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }

    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        costomRecordReader costomRecordReader = new costomRecordReader();
        costomRecordReader.initialize(inputSplit, taskAttemptContext);
        return costomRecordReader;
    }
}

自定義就要重寫 FileInputFormat方法。
isSplitable是否可分，這個其實沒有所謂，看你的數據量和業務要求，我的演示數據很小，沒必要分。

createRecordReader方法，這個是關鍵因爲MR流程中inputformat通過recordReader來實現數據的讀取的，所以costomRecordreader是我們自定義的類，更是主要的自定義輸入的邏輯所在。

costomRecordreader：

package costomInputFormat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @Author: Braylon
 * @Date: 2020/1/29 11:02
 * @Version: 1.0
 */
public class costomRecordReader extends RecordReader<NullWritable, BytesWritable> {

    private Configuration configuration;
    private FileSplit fileSplit;

    private boolean processed = false;
    private BytesWritable value = new BytesWritable();

    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        this.fileSplit = (FileSplit) inputSplit;
        configuration = taskAttemptContext.getConfiguration();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (!processed) {
            byte[] bytes = new byte[(int) fileSplit.getLength()];

            FileSystem fs = null;
            FSDataInputStream fis = null;
            //獲取文件系統
            Path path =fileSplit.getPath();
            fs = path.getFileSystem(configuration);
            //讀取數據
            fis = fs.open(path);
            IOUtils.readFully(fis, bytes, 0, bytes.length);
            //輸出文件內容
            value.set(bytes, 0, bytes.length);

            IOUtils.closeStream(fis);

            processed = true;
            return true;
        }
        return false;
    }

    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return processed ? 1 : 0;
    }

    @Override
    public void close() throws IOException {

    }
}

知識點：

變量

上面兩個變量用於獲取配置信息
private Configuration configuration;
private FileSplit fileSplit;
是否處理過，防止重複處理
private boolean processed = false;
創建緩衝區，用於儲存value
private BytesWritable value = new BytesWritable();
流程

initialize初始化配置信息。

nextKeyValue方法，主要的自定義邏輯，其中當處理完之後，將processed設置爲true。

getCurrentKey方法，設置輸出的k，v結構

driver

package costomInputFormat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import java.io.IOException;

/**
 * @Author: Braylon
 * @Date: 2020/1/29 11:17
 * @Version: 1.0
 */
public class drivcer {
    public static void main(String[] args) throws IOException {
        args = new String[]{"D:\\idea\\HDFS\\src\\main\\java\\costomInputFormat\\data", "D:\\idea\\HDFS\\src\\main\\java\\costomInputFormat\\out"};
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(drivcer.class);
        job.setMapperClass(MapperClass.class);
        job.setReducerClass(reducerClass.class);

        job.setInputFormatClass(costomInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        try {
            job.waitForCompletion(true);
            System.out.println("done");
        } catch (InterruptedException e) {
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
    }
}

知識點：

job.setInputFormatClass(costomInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

將自定義的類設置爲輸入類，輸出類是hadoop框架中自帶的，二進制輸出。

歡迎指正
大家共勉~~

大數據學習(十二)mapreduce自定義輸入 inputformat

簡介

自定義輸入問題

自定義輸入

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

HTML 00 Tutorial

拉格朗日乘子法 latex手打公式良心推導

磁盤結構分析圖文並茂

FAT32文件格式分析

networkx igraph相互轉換+效率比較

影響力最大化 CELF 成本效益延遲轉發算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結