簡介
這一篇我講給大家分享如何自定義輸入和輸出,可能聽起來比較抽象,我們用實際應用中的一個例子來說明。
自定義輸入問題
我們現在有這樣的數據文件,
每個文件的數據格式是這樣的
那麼我們除了用conbiner以外,還有什麼效率更高的,所謂更高端的方式來將這些小文件進行合併嗎。
下面我們就通過自定義輸入,重寫FileinputFormat中的方法來完成這個目標。
自定義輸入
目錄結構:
MapperClass
package costomInputFormat;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
/**
* @Author: Braylon
* @Date: 2020/1/29 10:33
* @Version: 1.0
*/
public class MapperClass extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
Text k = new Text();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
String name = split.getPath().getName();
k.set(name);
}
@Override
protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
context.write(k, value);
}
}
知識點:
- Map過程的輸入和輸出
也就是看一下那四個參數類型<NullWritable, BytesWritable, Text, BytesWritable>,
簡單來說就是我們輸入不再需要偏移量,直接改爲了只要文件內容,而BytesWritable就代表文件內容的數據類型。然後我們將文件名作爲key(Text)輸出,然後文件內同作爲value(BytesWritable)。
- map方法啥事情都沒做,主要看setup中的邏輯
其實也很簡單,就是獲取了文件的名字,至於FileSplit我前面的blog中也有講過和用過,這裏不再多說了。
reducerClass
package costomInputFormat;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @Author: Braylon
* @Date: 2020/1/29 10:43
* @Version: 1.0
*/
public class reducerClass extends Reducer<Text, BytesWritable, Text, BytesWritable> {
@Override
protected void reduce(Text key, Iterable<BytesWritable> values, Context context) throws IOException, InterruptedException {
super.reduce(key, values, context);
}
}
reduce階段沒有做什麼直接跳過。
當然我這裏只是演示一下自定義輸入,reduce階段大家可以加自己需要的業務邏輯。
重點:
costomInputFormat:
package costomInputFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
/**
* @Author: Braylon
* @Date: 2020/1/29 10:47
* @Version: 1.0
*/
public class costomInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
costomRecordReader costomRecordReader = new costomRecordReader();
costomRecordReader.initialize(inputSplit, taskAttemptContext);
return costomRecordReader;
}
}
自定義就要重寫 FileInputFormat方法。
isSplitable是否可分,這個其實沒有所謂,看你的數據量和業務要求,我的演示數據很小,沒必要分。
createRecordReader方法,這個是關鍵因爲MR流程中inputformat通過recordReader來實現數據的讀取的,所以costomRecordreader是我們自定義的類,更是主要的自定義輸入的邏輯所在。
costomRecordreader:
package costomInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
/**
* @Author: Braylon
* @Date: 2020/1/29 11:02
* @Version: 1.0
*/
public class costomRecordReader extends RecordReader<NullWritable, BytesWritable> {
private Configuration configuration;
private FileSplit fileSplit;
private boolean processed = false;
private BytesWritable value = new BytesWritable();
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
this.fileSplit = (FileSplit) inputSplit;
configuration = taskAttemptContext.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] bytes = new byte[(int) fileSplit.getLength()];
FileSystem fs = null;
FSDataInputStream fis = null;
//獲取文件系統
Path path =fileSplit.getPath();
fs = path.getFileSystem(configuration);
//讀取數據
fis = fs.open(path);
IOUtils.readFully(fis, bytes, 0, bytes.length);
//輸出文件內容
value.set(bytes, 0, bytes.length);
IOUtils.closeStream(fis);
processed = true;
return true;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return processed ? 1 : 0;
}
@Override
public void close() throws IOException {
}
}
知識點:
-
變量
上面兩個變量用於獲取配置信息
private Configuration configuration;
private FileSplit fileSplit;
是否處理過,防止重複處理
private boolean processed = false;
創建緩衝區,用於儲存value
private BytesWritable value = new BytesWritable(); -
流程
initialize初始化配置信息。
nextKeyValue方法,主要的自定義邏輯,其中當處理完之後,將processed設置爲true。
getCurrentKey方法,設置輸出的k,v結構
driver
package costomInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import java.io.IOException;
/**
* @Author: Braylon
* @Date: 2020/1/29 11:17
* @Version: 1.0
*/
public class drivcer {
public static void main(String[] args) throws IOException {
args = new String[]{"D:\\idea\\HDFS\\src\\main\\java\\costomInputFormat\\data", "D:\\idea\\HDFS\\src\\main\\java\\costomInputFormat\\out"};
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(drivcer.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(reducerClass.class);
job.setInputFormatClass(costomInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(BytesWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
try {
job.waitForCompletion(true);
System.out.println("done");
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
}
知識點:
- job.setInputFormatClass(costomInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
將自定義的類設置爲輸入類,輸出類是hadoop框架中自帶的,二進制輸出。
歡迎指正
大家共勉~~