大數據技術之Hadoop之MapReduce（3）——自定義InputFormat案例實操

原創

张反水

2020-03-16 14:38

3.1.9 自定義InputFormat案例實操

無論HDFS還是MapReduce，在處理小文件時效率都非常低，但又難免面臨處理大量小文件的場景，此時，就需要有相應解決方案。可以自定義InputFormat實現小文件的合併。

1．需求

將多個小文件合併成一個SequenceFile文件（SequenceFile文件是Hadoop用來存儲二進制形式的key-value對的文件格式），SequenceFile裏面存儲着多個文件，存儲的形式爲文件路徑+名稱爲key，文件內容爲value。

（1）輸入數據

one.txt

yongpeng weidong weinan
sanfeng luozong xiaoming

two.txt

longlong fanfan
mazong kailun yuhang yixin
longlong fanfan
mazong kailun yuhang yixi

three.txt

shuaige changmo zhenqiang
dongli lingu xuanxuan

（2）期望輸出文件格式

2．程序實現

（1）自定義InputFromat

/**
 * @Author zhangyong
 * @Date 2020/3/6 22:24
 * @Version 1.0
 * 定義類繼承FileInputFormat
 */
public class WholeFileInputformat extends FileInputFormat<Text, BytesWritable>{
	@Override
	protected boolean isSplitable(JobContext context, Path filename) {
		return false;
	}
	@Override
	public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context)	throws IOException, InterruptedException {
		WholeRecordReader recordReader = new WholeRecordReader();
		recordReader.initialize(split, context);
		return recordReader;
	}
}

（2）自定義RecordReader類

/**
 * @Author zhangyong
 * @Date 2020/3/6 22:24
 * @Version 1.0
 */
public class WholeRecordReader extends RecordReader<Text, BytesWritable> {
    private Configuration configuration;
    private FileSplit split;
    private boolean isProgress = true;
    private BytesWritable value = new BytesWritable ();
    private Text k = new Text ();
    @Override
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        this.split = (FileSplit) split;
        configuration = context.getConfiguration ();
    }
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (isProgress) {
            // 1 定義緩存區
            byte[] contents = new byte[(int) split.getLength ()];
            FileSystem fs = null;
            FSDataInputStream fis = null;
            try {
                // 2 獲取文件系統
                Path path = split.getPath ();
                fs = path.getFileSystem (configuration);
                // 3 讀取數據
                fis = fs.open (path);
                // 4 讀取文件內容
                IOUtils.readFully (fis, contents, 0, contents.length);
                // 5 輸出文件內容
                value.set (contents, 0, contents.length);
                // 6 獲取文件路徑及名稱
                String name = split.getPath ().toString ();
           		// 7 設置輸出的key值
                k.set (name);
            } catch (Exception e) {
            } finally {
                IOUtils.closeStream (fis);
            }
            isProgress = false;
            return true;
        }
        return false;
    }
    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return k;
    }
    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }
    @Override
    public void close() throws IOException {
    }
}

（3）編寫SequenceFileMapper類處理流程

/**
 * @Author zhangyong
 * @Date 2020/3/6 22:24
 * @Version 1.0
 * 定義類繼承FileInputFormat
 */
public class SequenceFileMapper extends Mapper<Text, BytesWritable, Text, BytesWritable> {
    @Override
    protected void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException {
        context.write (key, value);
    }
}

（4）編寫SequenceFileReducer類處理流程

/**
 * @Author zhangyong
 * @Date 2020/3/6 22:24
 * @Version 1.0
 */
public class SequenceFileReducer extends Reducer<Text, BytesWritable, Text, BytesWritable> {

    @Override
    protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {

        context.write (key, values.iterator ().next ());
    }
}

（5）編寫SequenceFileDriver類處理流程

/**
 * @Author zhangyong
 * @Date 2020/3/6 22:24
 * @Version 1.0
 */
public class SequenceFileDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        // 數據輸入路徑和輸出路徑
        args = new String[2];
        args[0] = "src/main/resources/format/input";
        args[1] = "src/main/resources/format/output";

        // 讀取配置文件
        Configuration cfg = new Configuration();
        cfg.set("mapreduce.framework.name", "local");
        cfg.set("fs.defaultFS", "file:///");

        final FileSystem filesystem = FileSystem.get(cfg);
        if (filesystem.exists(new Path(args[0]))) {
            filesystem.delete(new Path(args[1]), true);
        }

        Job job = Job.getInstance (cfg);

        // 2 設置jar包存儲位置、關聯自定義的mapper和reducer
        job.setJarByClass (SequenceFileDriver.class);
        job.setMapperClass (SequenceFileMapper.class);
        job.setReducerClass (SequenceFileReducer.class);

        // 7設置輸入的inputFormat
        job.setInputFormatClass (WholeFileInputformat.class);

        // 8設置輸出的outputFormat
        job.setOutputFormatClass (SequenceFileOutputFormat.class);

        // 3 設置map輸出端的kv類型
        job.setMapOutputKeyClass (Text.class);
        job.setMapOutputValueClass (BytesWritable.class);

        // 4 設置最終輸出端的kv類型
        job.setOutputKeyClass (Text.class);
        job.setOutputValueClass (BytesWritable.class);

        // 5 設置輸入輸出路徑
        FileInputFormat.setInputPaths (job, new Path (args[0]));
        FileOutputFormat.setOutputPath (job, new Path (args[1]));

        // 6 提交job
        boolean result = job.waitForCompletion (true);
        System.exit (result ? 0 : 1);
    }
}

3．輸出結果

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

大數據技術之Hadoop之MapReduce（3）——自定義InputFormat案例實操

3.1.9 自定義InputFormat案例實操

1．需求

2．程序實現

3．輸出結果

Wireshark 安裝+使用（一）

大數據技術之Hadoop之MapReduce（3）——自定義InputFormat案例實操

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結