[Hadoop系列]Hadoop的MapReduce中多文件輸出

 inkfish原創,請勿商業性質轉載,轉載請註明來源(http://blog.csdn.net/inkfish )。

  Hadoop默認的輸出是TextOutputFormat,輸出文件名不可定製。hadoop 0.19.X中有一個org.apache.hadoop.mapred.lib.MultipleOutputFormat,可以輸出多份文件且可以自定義文件名,但是從hadoop 0.20.x中MultipleOutputFormat所在包的所有類被標記爲“已過時”,當前如果再使用MultipleOutputFormat,在將來版本的hadoop中可能無法使用。本篇文章中,我們自己實現一個簡單的MultipleOutputFormat,並修改hadoop自帶的WordCount示例程序來測試結果。(來源:http://blog.csdn.net/inkfish)

環境:

  Ubuntu 8.0.4 Server 32bit
  Hadoop 0.20.1
  JDK 1.6.0_16-b01
  Eclipse 3.5(來源:http://blog.csdn.net/inkfish)

所有代碼分爲3個類: (來源:http://blog.csdn.net/inkfish)

1.LineRecordWriter:

  RecordWriter的一個實現,用於把<Key, Value>轉化爲一行文本。在Hadoop中,這個類作爲TextOutputFormat的一個子類存在,protected訪問權限,因此普通程序無法訪問。這裏僅僅是把LineRecordWriter從TextOutputFormat抽取出來,作爲一個獨立的公共類使用。(來源:http://blog.csdn.net/inkfish)

  1. package inkfish.hadoop.study;
  2. import java.io.DataOutputStream;
  3. import java.io.IOException;
  4. import java.io.UnsupportedEncodingException;
  5. import org.apache.hadoop.io.NullWritable;
  6. import org.apache.hadoop.io.Text;
  7. import org.apache.hadoop.mapreduce.RecordWriter;
  8. import org.apache.hadoop.mapreduce.TaskAttemptContext;
  9. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  10. /**摘自{@link TextOutputFormat}中的LineRecordWriter。 */
  11. public class LineRecordWriter<K, V> extends RecordWriter<K, V> {
  12. private static final String utf8 = "UTF-8";
  13. private static final byte[] newline;
  14. static {
  15. try {
  16. newline = "/n".getBytes(utf8);
  17. } catch (UnsupportedEncodingException uee) {
  18. throw new IllegalArgumentException("can't find " + utf8 + " encoding");
  19. }
  20. }
  21. protected DataOutputStream out;
  22. private final byte[] keyValueSeparator;
  23. public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
  24. this.out = out;
  25. try {
  26. this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
  27. } catch (UnsupportedEncodingException uee) {
  28. throw new IllegalArgumentException("can't find " + utf8 + " encoding");
  29. }
  30. }
  31. public LineRecordWriter(DataOutputStream out) {
  32. this(out, "/t");
  33. }
  34. private void writeObject(Object o) throws IOException {
  35. if (o instanceof Text) {
  36. Text to = (Text) o;
  37. out.write(to.getBytes(), 0, to.getLength());
  38. } else {
  39. out.write(o.toString().getBytes(utf8));
  40. }
  41. }
  42. public synchronized void write(K key, V value) throws IOException {
  43. boolean nullKey = key == null || key instanceof NullWritable;
  44. boolean nullValue = value == null || value instanceof NullWritable;
  45. if (nullKey && nullValue) {
  46. return;
  47. }
  48. if (!nullKey) {
  49. writeObject(key);
  50. }
  51. if (!(nullKey || nullValue)) {
  52. out.write(keyValueSeparator);
  53. }
  54. if (!nullValue) {
  55. writeObject(value);
  56. }
  57. out.write(newline);
  58. }
  59. public synchronized void close(TaskAttemptContext context) throws IOException {
  60. out.close();
  61. }
  62. }

2.MultipleOutputFormat:

  抽象類,主要參考org.apache.hadoop.mapred.lib.MultipleOutputFormat。子類唯一需要實現的方法是:String generateFileNameForKeyValue(K key, V value, Configuration conf),即通過key和value及conf配置信息決定文件名(含擴展名)。(來源:http://blog.csdn.net/inkfish)

  1. package inkfish.hadoop.study;
  2. import java.io.DataOutputStream;
  3. import java.io.IOException;
  4. import java.util.HashMap;
  5. import java.util.Iterator;
  6. import org.apache.hadoop.conf.Configuration;
  7. import org.apache.hadoop.fs.FSDataOutputStream;
  8. import org.apache.hadoop.fs.Path;
  9. import org.apache.hadoop.io.Writable;
  10. import org.apache.hadoop.io.WritableComparable;
  11. import org.apache.hadoop.io.compress.CompressionCodec;
  12. import org.apache.hadoop.io.compress.GzipCodec;
  13. import org.apache.hadoop.mapreduce.OutputCommitter;
  14. import org.apache.hadoop.mapreduce.RecordWriter;
  15. import org.apache.hadoop.mapreduce.TaskAttemptContext;
  16. import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
  17. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  18. import org.apache.hadoop.util.ReflectionUtils;
  19. public abstract class MultipleOutputFormat<K extends WritableComparable<?>, V extends Writable>
  20. extends FileOutputFormat<K, V> {
  21. private MultiRecordWriter writer = null;
  22. public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException,
  23. InterruptedException {
  24. if (writer == null) {
  25. writer = new MultiRecordWriter(job, getTaskOutputPath(job));
  26. }
  27. return writer;
  28. }
  29. private Path getTaskOutputPath(TaskAttemptContext conf) throws IOException {
  30. Path workPath = null;
  31. OutputCommitter committer = super.getOutputCommitter(conf);
  32. if (committer instanceof FileOutputCommitter) {
  33. workPath = ((FileOutputCommitter) committer).getWorkPath();
  34. } else {
  35. Path outputPath = super.getOutputPath(conf);
  36. if (outputPath == null) {
  37. throw new IOException("Undefined job output-path");
  38. }
  39. workPath = outputPath;
  40. }
  41. return workPath;
  42. }
  43. /**通過key, value, conf來確定輸出文件名(含擴展名)*/
  44. protected abstract String generateFileNameForKeyValue(K key, V value, Configuration conf);
  45. public class MultiRecordWriter extends RecordWriter<K, V> {
  46. /**RecordWriter的緩存*/
  47. private HashMap<String, RecordWriter<K, V>> recordWriters = null;
  48. private TaskAttemptContext job = null;
  49. /**輸出目錄*/
  50. private Path workPath = null;
  51. public MultiRecordWriter(TaskAttemptContext job, Path workPath) {
  52. super();
  53. this.job = job;
  54. this.workPath = workPath;
  55. recordWriters = new HashMap<String, RecordWriter<K, V>>();
  56. }
  57. @Override
  58. public void close(TaskAttemptContext context) throws IOException, InterruptedException {
  59. Iterator<RecordWriter<K, V>> values = this.recordWriters.values().iterator();
  60. while (values.hasNext()) {
  61. values.next().close(context);
  62. }
  63. this.recordWriters.clear();
  64. }
  65. @Override
  66. public void write(K key, V value) throws IOException, InterruptedException {
  67. //得到輸出文件名
  68. String baseName = generateFileNameForKeyValue(key, value, job.getConfiguration());
  69. RecordWriter<K, V> rw = this.recordWriters.get(baseName);
  70. if (rw == null) {
  71. rw = getBaseRecordWriter(job, baseName);
  72. this.recordWriters.put(baseName, rw);
  73. }
  74. rw.write(key, value);
  75. }
  76. // ${mapred.out.dir}/_temporary/_${taskid}/${nameWithExtension}
  77. private RecordWriter<K, V> getBaseRecordWriter(TaskAttemptContext job, String baseName)
  78. throws IOException, InterruptedException {
  79. Configuration conf = job.getConfiguration();
  80. boolean isCompressed = getCompressOutput(job);
  81. String keyValueSeparator = ",";
  82. RecordWriter<K, V> recordWriter = null;
  83. if (isCompressed) {
  84. Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job,
  85. GzipCodec.class);
  86. CompressionCodec codec = ReflectionUtils.newInstance(codecClass, conf);
  87. Path file = new Path(workPath, baseName + codec.getDefaultExtension());
  88. FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
  89. recordWriter = new LineRecordWriter<K, V>(new DataOutputStream(codec
  90. .createOutputStream(fileOut)), keyValueSeparator);
  91. } else {
  92. Path file = new Path(workPath, baseName);
  93. FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
  94. recordWriter = new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
  95. }
  96. return recordWriter;
  97. }
  98. }
  99. }

3.WordCount:

  基本上維持hadoop示例中的WordCount原樣,主要增加一個靜態內部類AlphabetOutputFormat,這個類實現了MultipleOutputFormat,文件命名規則是:以英文字母開頭的單詞以“首字母.txt”爲文件名保存,其他以“other.txt”保存。(來源:http://blog.csdn.net/inkfish)

  1. package inkfish.hadoop.study;
  2. import java.io.IOException;
  3. import java.util.StringTokenizer;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.fs.Path;
  6. import org.apache.hadoop.io.IntWritable;
  7. import org.apache.hadoop.io.Text;
  8. import org.apache.hadoop.mapreduce.Job;
  9. import org.apache.hadoop.mapreduce.Mapper;
  10. import org.apache.hadoop.mapreduce.Reducer;
  11. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  12. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  13. import org.apache.hadoop.util.GenericOptionsParser;
  14. public class WordCount {
  15. public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
  16. private final static IntWritable one = new IntWritable(1);
  17. private Text word = new Text();
  18. public void map(Object key, Text value, Context context) throws IOException,
  19. InterruptedException {
  20. StringTokenizer itr = new StringTokenizer(value.toString());
  21. while (itr.hasMoreTokens()) {
  22. word.set(itr.nextToken());
  23. context.write(word, one);
  24. }
  25. }
  26. }
  27. public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  28. private IntWritable result = new IntWritable();
  29. public void reduce(Text key, Iterable<IntWritable> values, Context context)
  30. throws IOException, InterruptedException {
  31. int sum = 0;
  32. for (IntWritable val : values) {
  33. sum += val.get();
  34. }
  35. result.set(sum);
  36. context.write(key, result);
  37. }
  38. }
  39. public static class AlphabetOutputFormat extends MultipleOutputFormat<Text, IntWritable> {
  40. @Override
  41. protected String generateFileNameForKeyValue(Text key, IntWritable value, Configuration conf) {
  42. char c = key.toString().toLowerCase().charAt(0);
  43. if (c >= 'a' && c <= 'z') {
  44. return c + ".txt";
  45. }
  46. return "other.txt";
  47. }
  48. }
  49. public static void main(String[] args) throws Exception {
  50. Configuration conf = new Configuration();
  51. String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
  52. if (otherArgs.length != 2) {
  53. System.err.println("Usage: wordcount <in> <out>");
  54. System.exit(2);
  55. }
  56. Job job = new Job(conf, "word count");
  57. job.setJarByClass(WordCount.class);
  58. job.setMapperClass(TokenizerMapper.class);
  59. job.setCombinerClass(IntSumReducer.class);
  60. job.setReducerClass(IntSumReducer.class);
  61. job.setOutputKeyClass(Text.class);
  62. job.setOutputValueClass(IntWritable.class);
  63. job.setOutputFormatClass(AlphabetOutputFormat.class);//設置輸出格式
  64. FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
  65. FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
  66. System.exit(job.waitForCompletion(true) ? 0 : 1);
  67. }
  68. }

在我測試環境中運行結果:

  1. 10/01/08 20:35:34 INFO mapred.JobClient: Job complete: job_201001052238_0013
  2. 10/01/08 20:35:34 INFO mapred.JobClient: Counters: 15
  3. 10/01/08 20:35:34 INFO mapred.JobClient: Job Counters
  4. 10/01/08 20:35:34 INFO mapred.JobClient: Launched reduce tasks=1
  5. 10/01/08 20:35:34 INFO mapred.JobClient: Rack-local map tasks=38
  6. 10/01/08 20:35:34 INFO mapred.JobClient: Launched map tasks=38
  7. 10/01/08 20:35:34 INFO mapred.JobClient: FileSystemCounters
  8. 10/01/08 20:35:34 INFO mapred.JobClient: FILE_BYTES_READ=1473227
  9. 10/01/08 20:35:34 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1370636
  10. 10/01/08 20:35:34 INFO mapred.JobClient: Map-Reduce Framework
  11. 10/01/08 20:35:34 INFO mapred.JobClient: Reduce input groups=0
  12. 10/01/08 20:35:34 INFO mapred.JobClient: Combine output records=29045
  13. 10/01/08 20:35:34 INFO mapred.JobClient: Map input records=19313
  14. 10/01/08 20:35:34 INFO mapred.JobClient: Reduce shuffle bytes=517685
  15. 10/01/08 20:35:34 INFO mapred.JobClient: Reduce output records=0
  16. 10/01/08 20:35:34 INFO mapred.JobClient: Spilled Records=58090
  17. 10/01/08 20:35:34 INFO mapred.JobClient: Map output bytes=1393868
  18. 10/01/08 20:35:34 INFO mapred.JobClient: Combine input records=119552
  19. 10/01/08 20:35:34 INFO mapred.JobClient: Map output records=119552
  20. 10/01/08 20:35:34 INFO mapred.JobClient: Reduce input records=29045
  21. user@cloud-2:~/software/test$ ls out/
  22. a.txt c.txt e.txt g.txt i.txt k.txt l.txt n.txt o.txt q.txt s.txt u.txt w.txt y.txt
  23. b.txt d.txt f.txt h.txt j.txt _logs m.txt other.txt p.txt r.txt t.txt v.txt x.txt z.txt
  24. user@cloud-2:~/software/test$

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章