Hadoop默認實現的InputFormat是FileInputFormat<K,V>,在FileInputFormat下有如下五個子類:CombineFileInputFormat<K,V>、TextInputFormat<K,V>、KeyValueTextInputFormat<K,V>、NLineInputFormat<K,V>和 SequenceFileInputForma<K,V>t。其中TextInputFormat默認的實現是TextInputFormat<K,V>。該輸入格式的分片方式爲:輸入文本的每一行作爲一個分片,其中該行的偏移量作爲Key,該行的內容爲Value。而這篇文章介紹到的是是FileInputFormat的另外一種子類——NLineInputFormat。
NLineInputFormat是Hadoop的一種非默認初始化的一種輸入格式。不同與InputFormat默認初始化的LineInputFormat,這種分片方式是可以一次性按照指定的分片數進行InputSplit。所以在特定情況下,可以有效提升代碼的運行效率。例如:數據文件爲每行一個浮點數,總共一百行;指定Reducer個數爲5個,分片行數爲20。則表明在InputSplit時,文本的每20行作爲一個split;分別輸入每個Mapper處理的分片內容。因爲Hadoop中Mapper的個數與split有關,又因爲NLineInputSplit的Mapper計算方法爲:Mapper = Splits = 文件行數/分片行數。所以在上面的例子中,總共劃分爲5個Mapper。下面貼出實現上面案例的源代碼。
package Hadoop_InputFormat;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class NLineSplite {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
String input = "E:/Document/Study/Data/ComparatorNum.txt";
String output = "E:/Document/Study/Data/output";
System.setProperty("hadoop.home.dir", "D:/Tools/Office/hadoop-2.6.0");
Configuration conf = new Configuration();
Job job = new Job(conf,"NLineSplite");
job.setInputFormatClass(NLineInputFormat.class);
NLineInputFormat.setNumLinesPerSplit(job, 20);
job.setNumReduceTasks(5);
job.setJarByClass(NLineSplite.class);
job.setMapperClass(NLineSpliteMap.class);
job.setPartitionerClass(PartitionerByIndex.class);
job.setReducerClass(NLineSpliteReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
class NLineSpliteMap extends
Mapper<LongWritable, Text, Text, LongWritable> {
int index = (int)(Math.random()*10);
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String line = value.toString();
context.write(new Text(line + " " + index), key);
}
}
class PartitionerByIndex extends Partitioner<Text, LongWritable> {
@Override
public int getPartition(Text key, LongWritable value, int numPartitions) {
// TODO Auto-generated method stub
int index = Integer.parseInt(key.toString().split(" ")[1]);
return Math.abs(index % numPartitions);
}
}
class NLineSpliteReduce extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> value,Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
context.write(new Text(key.toString()), new LongWritable(1));
}
}
在實現每個Mapper的時候,隨機生成一個index,作爲該分片的標記;再通過自定義分區發送到不同的Reducer上。運行結果如下所示: