mapreduce編程過程

概述

  • 流程圖
  • 默認類
  • wordcount完整示例

流程圖

  • mapreduce完整流程

默認類
  • 和上邊流程圖裏的類對應
  • inputformat TextInputFormat
    recordreader LineRecoredReader
    inputSplit FileSplit
    Map 不知道
    combine 不知道
    partitioner HashPartitioner
    GroupComparator 這個很神奇,可以說是奇蹟發生的地方,類裏和主類裏兩個地方都可以寫
    reduce identityMap,老jar包裏的類
    outputformat fileOutputFormat
    recordReader LineRecordReader
    outputCommitter FileOup0utCommitter
WordCount完整代碼示例

  • inputformat 類對應 TextInputFormat ,而TextInputFormat 又繼承自FileinputFormat,其實精華都在 FileinputFormat 裏,下面首先爲 FileinputFormat 類,然後爲 TextInputFormat 類
  • FileInputFormat 源碼
  • /** 
     * A base class for file-based {@link InputFormat}s.
     * 
     * <p><code>FileInputFormat</code> is the base class for all file-based 
     * <code>InputFormat</code>s. This provides a generic implementation of
     * {@link #getSplits(JobContext)}.
     * Subclasses of <code>FileInputFormat</code> can also override the 
     * {@link #isSplitable(JobContext, Path)} method to ensure input-files are
     * not split-up and are processed as a whole by {@link Mapper}s.
     */
    public abstract class FileInputFormat<K, V> extends InputFormat<K, V> {
    
      public static enum Counter { 
        BYTES_READ
      }
      
      private static final Log LOG = LogFactory.getLog(FileInputFormat.class);
    
      private static final double SPLIT_SLOP = 1.1;   // 10% slop
    
      private static final PathFilter hiddenFileFilter = new PathFilter(){
          public boolean accept(Path p){
            String name = p.getName(); 
            return !name.startsWith("_") && !name.startsWith("."); 
          }
        }; 
    
      static final String NUM_INPUT_FILES = "mapreduce.input.num.files";
    
      /**
       * Proxy PathFilter that accepts a path only if all filters given in the
       * constructor do. Used by the listPaths() to apply the built-in
       * hiddenFileFilter together with a user provided one (if any).
       */
      private static class MultiPathFilter implements PathFilter {
        private List<PathFilter> filters;
    
        public MultiPathFilter(List<PathFilter> filters) {
          this.filters = filters;
        }
    
        public boolean accept(Path path) {
          for (PathFilter filter : filters) {
            if (!filter.accept(path)) {
              return false;
            }
          }
          return true;
        }
      }
    
      /**
       * Get the lower bound on split size imposed by the format.
       * @return the number of bytes of the minimal split for this format
       */
      protected long getFormatMinSplitSize() {
        return 1;
      }
    
      /**
       * Is the given filename splitable? Usually, true, but if the file is
       * stream compressed, it will not be.
       * 
       * <code>FileInputFormat</code> implementations can override this and return
       * <code>false</code> to ensure that individual input files are never split-up
       * so that {@link Mapper}s process entire files.
       * 
       * @param context the job context
       * @param filename the file name to check
       * @return is this file splitable?
       */
      protected boolean isSplitable(JobContext context, Path filename) {
        return true;
      }
    
      /**
       * Set a PathFilter to be applied to the input paths for the map-reduce job.
       * @param job the job to modify
       * @param filter the PathFilter class use for filtering the input paths.
       */
      public static void setInputPathFilter(Job job,
                                            Class<? extends PathFilter> filter) {
        job.getConfiguration().setClass("mapred.input.pathFilter.class", filter, 
                                        PathFilter.class);
      }
    
      /**
       * Set the minimum input split size
       * @param job the job to modify
       * @param size the minimum size
       */
      public static void setMinInputSplitSize(Job job,
                                              long size) {
        job.getConfiguration().setLong("mapred.min.split.size", size);
      }
    
      /**
       * Get the minimum split size
       * @param job the job
       * @return the minimum number of bytes that can be in a split
       */
      public static long getMinSplitSize(JobContext job) {
        return job.getConfiguration().getLong("mapred.min.split.size", 1L);
      }
    
      /**
       * Set the maximum split size
       * @param job the job to modify
       * @param size the maximum split size
       */
      public static void setMaxInputSplitSize(Job job,
                                              long size) {
        job.getConfiguration().setLong("mapred.max.split.size", size);
      }
    
      /**
       * Get the maximum split size.
       * @param context the job to look at.
       * @return the maximum number of bytes a split can include
       */
      public static long getMaxSplitSize(JobContext context) {
        return context.getConfiguration().getLong("mapred.max.split.size", 
                                                  Long.MAX_VALUE);
      }
    
      /**
       * Get a PathFilter instance of the filter set for the input paths.
       *
       * @return the PathFilter instance set for the job, NULL if none has been set.
       */
      public static PathFilter getInputPathFilter(JobContext context) {
        Configuration conf = context.getConfiguration();
        Class<?> filterClass = conf.getClass("mapred.input.pathFilter.class", null,
            PathFilter.class);
        return (filterClass != null) ?
            (PathFilter) ReflectionUtils.newInstance(filterClass, conf) : null;
      }
    
      /** List input directories.
       * Subclasses may override to, e.g., select only files matching a regular
       * expression. 
       * 
       * @param job the job to list input paths for
       * @return array of FileStatus objects
       * @throws IOException if zero items.
       */
      protected List<FileStatus> listStatus(JobContext job
                                            ) throws IOException {
        List<FileStatus> result = new ArrayList<FileStatus>();
        Path[] dirs = getInputPaths(job);
        if (dirs.length == 0) {
          throw new IOException("No input paths specified in job");
        }
        
        // get tokens for all the required FileSystems..
        TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, 
                                            job.getConfiguration());
    
        List<IOException> errors = new ArrayList<IOException>();
        
        // creates a MultiPathFilter with the hiddenFileFilter and the
        // user provided one (if any).
        List<PathFilter> filters = new ArrayList<PathFilter>();
        filters.add(hiddenFileFilter);
        PathFilter jobFilter = getInputPathFilter(job);
        if (jobFilter != null) {
          filters.add(jobFilter);
        }
        PathFilter inputFilter = new MultiPathFilter(filters);
        
        for (int i=0; i < dirs.length; ++i) {
          Path p = dirs[i];
          FileSystem fs = p.getFileSystem(job.getConfiguration()); 
          FileStatus[] matches = fs.globStatus(p, inputFilter);
          if (matches == null) {
            errors.add(new IOException("Input path does not exist: " + p));
          } else if (matches.length == 0) {
            errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
          } else {
            for (FileStatus globStat: matches) {
              if (globStat.isDir()) {
                for(FileStatus stat: fs.listStatus(globStat.getPath(),
                    inputFilter)) {
                  result.add(stat);
                }          
              } else {
                result.add(globStat);
              }
            }
          }
        }
    
        if (!errors.isEmpty()) {
          throw new InvalidInputException(errors);
        }
        LOG.info("Total input paths to process : " + result.size()); 
        return result;
      }
      
    
      /** 
       * Generate the list of files and make them into FileSplits.
       */ 
      public List<InputSplit> getSplits(JobContext job
                                        ) throws IOException {
        long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
        long maxSize = getMaxSplitSize(job);
    
        // generate splits
        List<InputSplit> splits = new ArrayList<InputSplit>();
        List<FileStatus>files = listStatus(job);
        for (FileStatus file: files) {
          Path path = file.getPath();
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          long length = file.getLen();
          BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
          if ((length != 0) && isSplitable(job, path)) { 
            long blockSize = file.getBlockSize();
            long splitSize = computeSplitSize(blockSize, minSize, maxSize);
    
            long bytesRemaining = length;
            while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
              int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
              splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                       blkLocations[blkIndex].getHosts()));
              bytesRemaining -= splitSize;
            }
            
            if (bytesRemaining != 0) {
              splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, 
                         blkLocations[blkLocations.length-1].getHosts()));
            }
          } else if (length != 0) {
            splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
          } else { 
            //Create empty hosts array for zero length files
            splits.add(new FileSplit(path, 0, length, new String[0]));
          }
        }
        
        // Save the number of input files in the job-conf
        job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    
        LOG.debug("Total # of splits: " + splits.size());
        return splits;
      }
    
      protected long computeSplitSize(long blockSize, long minSize,
                                      long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
      }
    
      protected int getBlockIndex(BlockLocation[] blkLocations, 
                                  long offset) {
        for (int i = 0 ; i < blkLocations.length; i++) {
          // is the offset inside this block?
          if ((blkLocations[i].getOffset() <= offset) &&
              (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
            return i;
          }
        }
        BlockLocation last = blkLocations[blkLocations.length -1];
        long fileLength = last.getOffset() + last.getLength() -1;
        throw new IllegalArgumentException("Offset " + offset + 
                                           " is outside of file (0.." +
                                           fileLength + ")");
      }
    
      /**
       * Sets the given comma separated paths as the list of inputs 
       * for the map-reduce job.
       * 
       * @param job the job
       * @param commaSeparatedPaths Comma separated paths to be set as 
       *        the list of inputs for the map-reduce job.
       */
      public static void setInputPaths(Job job, 
                                       String commaSeparatedPaths
                                       ) throws IOException {
        setInputPaths(job, StringUtils.stringToPath(
                            getPathStrings(commaSeparatedPaths)));
      }
    
      /**
       * Add the given comma separated paths to the list of inputs for
       *  the map-reduce job.
       * 
       * @param job The job to modify
       * @param commaSeparatedPaths Comma separated paths to be added to
       *        the list of inputs for the map-reduce job.
       */
      public static void addInputPaths(Job job, 
                                       String commaSeparatedPaths
                                       ) throws IOException {
        for (String str : getPathStrings(commaSeparatedPaths)) {
          addInputPath(job, new Path(str));
        }
      }
    
      /**
       * Set the array of {@link Path}s as the list of inputs
       * for the map-reduce job.
       * 
       * @param job The job to modify 
       * @param inputPaths the {@link Path}s of the input directories/files 
       * for the map-reduce job.
       */ 
      public static void setInputPaths(Job job, 
                                       Path... inputPaths) throws IOException {
        Configuration conf = job.getConfiguration();
        Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);
        StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));
        for(int i = 1; i < inputPaths.length;i++) {
          str.append(StringUtils.COMMA_STR);
          path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);
          str.append(StringUtils.escapeString(path.toString()));
        }
        conf.set("mapred.input.dir", str.toString());
      }
    
      /**
       * Add a {@link Path} to the list of inputs for the map-reduce job.
       * 
       * @param job The {@link Job} to modify
       * @param path {@link Path} to be added to the list of inputs for 
       *            the map-reduce job.
       */
      public static void addInputPath(Job job, 
                                      Path path) throws IOException {
        Configuration conf = job.getConfiguration();
        path = path.getFileSystem(conf).makeQualified(path);
        String dirStr = StringUtils.escapeString(path.toString());
        String dirs = conf.get("mapred.input.dir");
        conf.set("mapred.input.dir", dirs == null ? dirStr : dirs + "," + dirStr);
      }
      
      // This method escapes commas in the glob pattern of the given paths.
      private static String[] getPathStrings(String commaSeparatedPaths) {
        int length = commaSeparatedPaths.length();
        int curlyOpen = 0;
        int pathStart = 0;
        boolean globPattern = false;
        List<String> pathStrings = new ArrayList<String>();
        
        for (int i=0; i<length; i++) {
          char ch = commaSeparatedPaths.charAt(i);
          switch(ch) {
            case '{' : {
              curlyOpen++;
              if (!globPattern) {
                globPattern = true;
              }
              break;
            }
            case '}' : {
              curlyOpen--;
              if (curlyOpen == 0 && globPattern) {
                globPattern = false;
              }
              break;
            }
            case ',' : {
              if (!globPattern) {
                pathStrings.add(commaSeparatedPaths.substring(pathStart, i));
                pathStart = i + 1 ;
              }
              break;
            }
          }
        }
        pathStrings.add(commaSeparatedPaths.substring(pathStart, length));
        
        return pathStrings.toArray(new String[0]);
      }
      
      /**
       * Get the list of input {@link Path}s for the map-reduce job.
       * 
       * @param context The job
       * @return the list of input {@link Path}s for the map-reduce job.
       */
      public static Path[] getInputPaths(JobContext context) {
        String dirs = context.getConfiguration().get("mapred.input.dir", "");
        String [] list = StringUtils.split(dirs);
        Path[] result = new Path[list.length];
        for (int i = 0; i < list.length; i++) {
          result[i] = new Path(StringUtils.unEscapeString(list[i]));
        }
        return result;
      }
    
    }
  • TextInputFormat 源碼
  • public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
    
      @Override
      public RecordReader<LongWritable, Text> 
        createRecordReader(InputSplit split,
                           TaskAttemptContext context) {
        return new LineRecordReader();
      }
    
      @Override
      protected boolean isSplitable(JobContext context, Path file) {
        CompressionCodec codec = 
          new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
        return codec == null;
      }
    
    }
  • inputformat 類裏先用 FileSplit 把原始文件分成很多片,然後 Recordreader 以Filesplit爲單位來處理分片
  • 我認爲精華都在 LineRecordReader,分片很簡單,無非是文件大小爲10,現在按4,4,2 分,下面首先爲 FileSplit 類,然後爲 LineRecordReader 類
  • public class FileSplit extends InputSplit implements Writable {
      private Path file;
      private long start;
      private long length;
      private String[] hosts;
    
      FileSplit() {}
    
      /** Constructs a split with host information
       *
       * @param file the file name
       * @param start the position of the first byte in the file to process
       * @param length the number of bytes in the file to process
       * @param hosts the list of hosts containing the block, possibly null
       */
      public FileSplit(Path file, long start, long length, String[] hosts) {
        this.file = file;
        this.start = start;
        this.length = length;
        this.hosts = hosts;
      }
     
      /** The file containing this split's data. */
      public Path getPath() { return file; }
      
      /** The position of the first byte in the file to process. */
      public long getStart() { return start; }
      
      /** The number of bytes in the file to process. */
      @Override
      public long getLength() { return length; }
    
      @Override
      public String toString() { return file + ":" + start + "+" + length; }
    
      ////////////////////////////////////////////
      // Writable methods
      ////////////////////////////////////////////
    
      @Override
      public void write(DataOutput out) throws IOException {
        Text.writeString(out, file.toString());
        out.writeLong(start);
        out.writeLong(length);
      }
    
      @Override
      public void readFields(DataInput in) throws IOException {
        file = new Path(Text.readString(in));
        start = in.readLong();
        length = in.readLong();
        hosts = null;
      }
    
      @Override
      public String[] getLocations() throws IOException {
        if (this.hosts == null) {
          return new String[]{};
        } else {
          return this.hosts;
        }
      }
    }
  • LineRecordReader 源碼,這個着實能學到很多東西
  • public class LineRecordReader extends RecordReader<LongWritable, Text> {
      private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
    
      private CompressionCodecFactory compressionCodecs = null;
      private long start;
      private long pos;
      private long end;
      private LineReader in;
      private int maxLineLength;
      private LongWritable key = null;
      private Text value = null;
    
      public void initialize(InputSplit genericSplit,
                             TaskAttemptContext context) throws IOException {
        FileSplit split = (FileSplit) genericSplit;
        Configuration job = context.getConfiguration();
        this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
                                        Integer.MAX_VALUE);
        start = split.getStart();
        end = start + split.getLength();
        final Path file = split.getPath();
        compressionCodecs = new CompressionCodecFactory(job);
        final CompressionCodec codec = compressionCodecs.getCodec(file);
    
        // open the file and seek to the start of the split
        FileSystem fs = file.getFileSystem(job);
        FSDataInputStream fileIn = fs.open(split.getPath());
        boolean skipFirstLine = false;
        if (codec != null) {
          in = new LineReader(codec.createInputStream(fileIn), job);
          end = Long.MAX_VALUE;
        } else {
          if (start != 0) {
            skipFirstLine = true;
            --start;
            fileIn.seek(start);
          }
          in = new LineReader(fileIn, job);
        }
        if (skipFirstLine) {  // skip first line and re-establish "start".
          start += in.readLine(new Text(), 0,
                               (int)Math.min((long)Integer.MAX_VALUE, end - start));
        }
        this.pos = start;
      }
      
      public boolean nextKeyValue() throws IOException {
        if (key == null) {
          key = new LongWritable();
        }
        key.set(pos);
        if (value == null) {
          value = new Text();
        }
        int newSize = 0;
        while (pos < end) {
          newSize = in.readLine(value, maxLineLength,
                                Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
                                         maxLineLength));
          if (newSize == 0) {
            break;
          }
          pos += newSize;
          if (newSize < maxLineLength) {
            break;
          }
    
          // line too long. try again
          LOG.info("Skipped line of size " + newSize + " at pos " + 
                   (pos - newSize));
        }
        if (newSize == 0) {
          key = null;
          value = null;
          return false;
        } else {
          return true;
        }
      }
    
      @Override
      public LongWritable getCurrentKey() {
        return key;
      }
    
      @Override
      public Text getCurrentValue() {
        return value;
      }
    
      /**
       * Get the progress within the split
       */
      public float getProgress() {
        if (start == end) {
          return 0.0f;
        } else {
          return Math.min(1.0f, (pos - start) / (float)(end - start));
        }
      }
      
      public synchronized void close() throws IOException {
        if (in != null) {
          in.close(); 
        }
      }
    }
  • Map 類,下面爲wordcount 類默認的Map類
    public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
          
        public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
          StringTokenizer itr = new StringTokenizer(value.toString());
          while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
          }
        }
      }
  • Combine 類,wordcount 使用 IntSumReducer 類作爲 Combine類,和 reduce 類一模一樣
  • public class IntSumReducer 
           extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
    
        public void reduce(Text key, Iterable<IntWritable> values, 
                           Context context
                           ) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
            sum += val.get();
          }
          result.set(sum);
          context.write(key, result);
        }
      }
  • HashPartitioner 類,作用是用戶指定 什麼樣的值到那個reducer去,源代碼如下
  • public class HashPartitioner<K, V> extends Partitioner<K, V> {
    
      /** Use {@link Object#hashCode()} to partition. */
      public int getPartition(K key, V value,int numReduceTasks) {
        return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
      }
    }
  • GroupingComparator,奇蹟發生的地方,wordcount 的 key 爲 Text 類, 這裏 Text 類裏的 Comparator 即爲我所指的 GroupingComparator ,好好利用這個類 可以實現多次聚類
  • public static class Comparator extends WritableComparator {
        public Comparator() {
          super(Text.class);
        }
        public int compare(byte[] b1, int s1, int l1,
                           byte[] b2, int s2, int l2) {
          int n1 = WritableUtils.decodeVIntSize(b1[s1]);
          int n2 = WritableUtils.decodeVIntSize(b2[s2]);
          return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
        }
      }
      static {
        // register this comparator
        WritableComparator.define(Text.class, new Comparator());
      }
  •  wordcount 的 reduce 類,和上邊的 combine 類一樣,其實正確的說法是上面的 combine 類和這裏的reducer 類一樣
  • public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
            sum += val.get();
          }
          result.set(sum);
          context.write(key, result);
        }
    }
  • TextOutputFormat 類,和 FileOutputFormat 類一樣,它也有一個父類FileOutputFormat 類,其實精華都在 FileOutputFormat 類裏。
  • FileOutputFormat 源代碼
  • public abstract class FileOutputFormat<K, V> extends OutputFormat<K, V> {
      
      protected static final String BASE_OUTPUT_NAME = "mapreduce.output.basename";
      protected static final String PART = "part";
    
      public static enum Counter { 
        BYTES_WRITTEN
      }
    
      /** Construct output file names so that, when an output directory listing is
       * sorted lexicographically, positions correspond to output partitions.*/
      private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance();
      static {
        NUMBER_FORMAT.setMinimumIntegerDigits(5);
        NUMBER_FORMAT.setGroupingUsed(false);
      }
      private FileOutputCommitter committer = null;
    
      /**
       * Set whether the output of the job is compressed.
       * @param job the job to modify
       * @param compress should the output of the job be compressed?
       */
      public static void setCompressOutput(Job job, boolean compress) {
        job.getConfiguration().setBoolean("mapred.output.compress", compress);
      }
      
      /**
       * Is the job output compressed?
       * @param job the Job to look in
       * @return <code>true</code> if the job output should be compressed,
       *         <code>false</code> otherwise
       */
      public static boolean getCompressOutput(JobContext job) {
        return job.getConfiguration().getBoolean("mapred.output.compress", false);
      }
      
      /**
       * Set the {@link CompressionCodec} to be used to compress job outputs.
       * @param job the job to modify
       * @param codecClass the {@link CompressionCodec} to be used to
       *                   compress the job outputs
       */
      public static void 
      setOutputCompressorClass(Job job, 
                               Class<? extends CompressionCodec> codecClass) {
        setCompressOutput(job, true);
        job.getConfiguration().setClass("mapred.output.compression.codec", 
                                        codecClass, 
                                        CompressionCodec.class);
      }
      
      /**
       * Get the {@link CompressionCodec} for compressing the job outputs.
       * @param job the {@link Job} to look in
       * @param defaultValue the {@link CompressionCodec} to return if not set
       * @return the {@link CompressionCodec} to be used to compress the 
       *         job outputs
       * @throws IllegalArgumentException if the class was specified, but not found
       */
      public static Class<? extends CompressionCodec> 
      getOutputCompressorClass(JobContext job, 
    		                       Class<? extends CompressionCodec> defaultValue) {
        Class<? extends CompressionCodec> codecClass = defaultValue;
        Configuration conf = job.getConfiguration();
        String name = conf.get("mapred.output.compression.codec");
        if (name != null) {
          try {
            codecClass = 
            	conf.getClassByName(name).asSubclass(CompressionCodec.class);
          } catch (ClassNotFoundException e) {
            throw new IllegalArgumentException("Compression codec " + name + 
                                               " was not found.", e);
          }
        }
        return codecClass;
      }
      
      public abstract RecordWriter<K, V> 
         getRecordWriter(TaskAttemptContext job
                         ) throws IOException, InterruptedException;
    
      public void checkOutputSpecs(JobContext job
                                   ) throws FileAlreadyExistsException, IOException{
        // Ensure that the output directory is set and not already there
        Path outDir = getOutputPath(job);
        if (outDir == null) {
          throw new InvalidJobConfException("Output directory not set.");
        }
        
        // get delegation token for outDir's file system
        TokenCache.obtainTokensForNamenodes(job.getCredentials(), 
                                            new Path[] {outDir}, 
                                            job.getConfiguration());
    
        if (outDir.getFileSystem(job.getConfiguration()).exists(outDir)) {
          throw new FileAlreadyExistsException("Output directory " + outDir + 
                                               " already exists");
        }
      }
    
      /**
       * Set the {@link Path} of the output directory for the map-reduce job.
       *
       * @param job The job to modify
       * @param outputDir the {@link Path} of the output directory for 
       * the map-reduce job.
       */
      public static void setOutputPath(Job job, Path outputDir) {
        job.getConfiguration().set("mapred.output.dir", outputDir.toString());
      }
    
      /**
       * Get the {@link Path} to the output directory for the map-reduce job.
       * 
       * @return the {@link Path} to the output directory for the map-reduce job.
       * @see FileOutputFormat#getWorkOutputPath(TaskInputOutputContext)
       */
      public static Path getOutputPath(JobContext job) {
        String name = job.getConfiguration().get("mapred.output.dir");
        return name == null ? null: new Path(name);
      }
      
      /**
       *  Get the {@link Path} to the task's temporary output directory 
       *  for the map-reduce job
       *  
       * <h4 id="SideEffectFiles">Tasks' Side-Effect Files</h4>
       * 
       * <p>Some applications need to create/write-to side-files, which differ from
       * the actual job-outputs.
       * 
       * <p>In such cases there could be issues with 2 instances of the same TIP 
       * (running simultaneously e.g. speculative tasks) trying to open/write-to the
       * same file (path) on HDFS. Hence the application-writer will have to pick 
       * unique names per task-attempt (e.g. using the attemptid, say 
       * <tt>attempt_200709221812_0001_m_000000_0</tt>), not just per TIP.</p> 
       * 
       * <p>To get around this the Map-Reduce framework helps the application-writer 
       * out by maintaining a special 
       * <tt>${mapred.output.dir}/_temporary/_${taskid}</tt> 
       * sub-directory for each task-attempt on HDFS where the output of the 
       * task-attempt goes. On successful completion of the task-attempt the files 
       * in the <tt>${mapred.output.dir}/_temporary/_${taskid}</tt> (only) 
       * are <i>promoted</i> to <tt>${mapred.output.dir}</tt>. Of course, the 
       * framework discards the sub-directory of unsuccessful task-attempts. This 
       * is completely transparent to the application.</p>
       * 
       * <p>The application-writer can take advantage of this by creating any 
       * side-files required in a work directory during execution 
       * of his task i.e. via 
       * {@link #getWorkOutputPath(TaskInputOutputContext)}, and
       * the framework will move them out similarly - thus she doesn't have to pick 
       * unique paths per task-attempt.</p>
       * 
       * <p>The entire discussion holds true for maps of jobs with 
       * reducer=NONE (i.e. 0 reduces) since output of the map, in that case, 
       * goes directly to HDFS.</p> 
       * 
       * @return the {@link Path} to the task's temporary output directory 
       * for the map-reduce job.
       */
      public static Path getWorkOutputPath(TaskInputOutputContext<?,?,?,?> context
                                           ) throws IOException, 
                                                    InterruptedException {
        FileOutputCommitter committer = (FileOutputCommitter) 
          context.getOutputCommitter();
        return committer.getWorkPath();
      }
    
      /**
       * Helper function to generate a {@link Path} for a file that is unique for
       * the task within the job output directory.
       *
       * <p>The path can be used to create custom files from within the map and
       * reduce tasks. The path name will be unique for each task. The path parent
       * will be the job output directory.</p>ls
       *
       * <p>This method uses the {@link #getUniqueFile} method to make the file name
       * unique for the task.</p>
       *
       * @param context the context for the task.
       * @param name the name for the file.
       * @param extension the extension for the file
       * @return a unique path accross all tasks of the job.
       */
      public 
      static Path getPathForWorkFile(TaskInputOutputContext<?,?,?,?> context, 
                                     String name,
                                     String extension
                                    ) throws IOException, InterruptedException {
        return new Path(getWorkOutputPath(context),
                        getUniqueFile(context, name, extension));
      }
    
      /**
       * Generate a unique filename, based on the task id, name, and extension
       * @param context the task that is calling this
       * @param name the base filename
       * @param extension the filename extension
       * @return a string like $name-[mr]-$id$extension
       */
      public synchronized static String getUniqueFile(TaskAttemptContext context,
                                                      String name,
                                                      String extension) {
        TaskID taskId = context.getTaskAttemptID().getTaskID();
        int partition = taskId.getId();
        StringBuilder result = new StringBuilder();
        result.append(name);
        result.append('-');
        result.append(taskId.isMap() ? 'm' : 'r');
        result.append('-');
        result.append(NUMBER_FORMAT.format(partition));
        result.append(extension);
        return result.toString();
      }
    
      /**
       * Get the default path and filename for the output format.
       * @param context the task context
       * @param extension an extension to add to the filename
       * @return a full path $output/_temporary/$taskid/part-[mr]-$id
       * @throws IOException
       */
      public Path getDefaultWorkFile(TaskAttemptContext context,
                                     String extension) throws IOException{
        FileOutputCommitter committer = 
          (FileOutputCommitter) getOutputCommitter(context);
        return new Path(committer.getWorkPath(), getUniqueFile(context, 
            getOutputName(context), extension));
      }
      
      /**
       * Get the base output name for the output file.
       */
      protected static String getOutputName(JobContext job) {
        return job.getConfiguration().get(BASE_OUTPUT_NAME, PART);
      }
    
      /**
       * Set the base output name for output file to be created.
       */
      protected static void setOutputName(JobContext job, String name) {
        job.getConfiguration().set(BASE_OUTPUT_NAME, name);
      }
    
      public synchronized 
         OutputCommitter getOutputCommitter(TaskAttemptContext context
                                            ) throws IOException {
        if (committer == null) {
          Path output = getOutputPath(context);
          committer = new FileOutputCommitter(output, context);
        }
        return committer;
      }
    }
  • 這裏在給出 TextOutputFormat 類,這下就簡單多了
  • public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
      protected static class LineRecordWriter<K, V>
        extends RecordWriter<K, V> {
        private static final String utf8 = "UTF-8";
        private static final byte[] newline;
        static {
          try {
            newline = "\n".getBytes(utf8);
          } catch (UnsupportedEncodingException uee) {
            throw new IllegalArgumentException("can't find " + utf8 + " encoding");
          }
        }
    
        protected DataOutputStream out;
        private final byte[] keyValueSeparator;
    
        public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
          this.out = out;
          try {
            this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
          } catch (UnsupportedEncodingException uee) {
            throw new IllegalArgumentException("can't find " + utf8 + " encoding");
          }
        }
    
        public LineRecordWriter(DataOutputStream out) {
          this(out, "\t");
        }
    
        /**
         * Write the object to the byte stream, handling Text as a special
         * case.
         * @param o the object to print
         * @throws IOException if the write throws, we pass it on
         */
        private void writeObject(Object o) throws IOException {
          if (o instanceof Text) {
            Text to = (Text) o;
            out.write(to.getBytes(), 0, to.getLength());
          } else {
            out.write(o.toString().getBytes(utf8));
          }
        }
    
        public synchronized void write(K key, V value)
          throws IOException {
    
          boolean nullKey = key == null || key instanceof NullWritable;
          boolean nullValue = value == null || value instanceof NullWritable;
          if (nullKey && nullValue) {
            return;
          }
          if (!nullKey) {
            writeObject(key);
          }
          if (!(nullKey || nullValue)) {
            out.write(keyValueSeparator);
          }
          if (!nullValue) {
            writeObject(value);
          }
          out.write(newline);
        }
    
        public synchronized 
        void close(TaskAttemptContext context) throws IOException {
          out.close();
        }
      }
    
      public RecordWriter<K, V> 
             getRecordWriter(TaskAttemptContext job
                             ) throws IOException, InterruptedException {
        Configuration conf = job.getConfiguration();
        boolean isCompressed = getCompressOutput(job);
        String keyValueSeparator= conf.get("mapred.textoutputformat.separator",
                                           "\t");
        CompressionCodec codec = null;
        String extension = "";
        if (isCompressed) {
          Class<? extends CompressionCodec> codecClass = 
            getOutputCompressorClass(job, GzipCodec.class);
          codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
          extension = codec.getDefaultExtension();
        }
        Path file = getDefaultWorkFile(job, extension);
        FileSystem fs = file.getFileSystem(conf);
        if (!isCompressed) {
          FSDataOutputStream fileOut = fs.create(file, false);
          return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
        } else {
          FSDataOutputStream fileOut = fs.create(file, false);
          return new LineRecordWriter<K, V>(new DataOutputStream
                                            (codec.createOutputStream(fileOut)),
                                            keyValueSeparator);
        }
      }
    }
  • 在 經典類 FileOutputformat 類裏用到 的兩個經典類爲,LineRecordWriter 和 FileOutputCommitter
  • LineRecordWriter 是作爲 TextOutputFormat內部類出現的,其實上面已經有了,爲保持完整性,在列出如下
  • protected static class LineRecordWriter<K, V>
        extends RecordWriter<K, V> {
        private static final String utf8 = "UTF-8";
        private static final byte[] newline;
        static {
          try {
            newline = "\n".getBytes(utf8);
          } catch (UnsupportedEncodingException uee) {
            throw new IllegalArgumentException("can't find " + utf8 + " encoding");
          }
        }
    
        protected DataOutputStream out;
        private final byte[] keyValueSeparator;
    
        public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
          this.out = out;
          try {
            this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
          } catch (UnsupportedEncodingException uee) {
            throw new IllegalArgumentException("can't find " + utf8 + " encoding");
          }
        }
    
        public LineRecordWriter(DataOutputStream out) {
          this(out, "\t");
        }
    
        /**
         * Write the object to the byte stream, handling Text as a special
         * case.
         * @param o the object to print
         * @throws IOException if the write throws, we pass it on
         */
        private void writeObject(Object o) throws IOException {
          if (o instanceof Text) {
            Text to = (Text) o;
            out.write(to.getBytes(), 0, to.getLength());
          } else {
            out.write(o.toString().getBytes(utf8));
          }
        }
    
        public synchronized void write(K key, V value)
          throws IOException {
    
          boolean nullKey = key == null || key instanceof NullWritable;
          boolean nullValue = value == null || value instanceof NullWritable;
          if (nullKey && nullValue) {
            return;
          }
          if (!nullKey) {
            writeObject(key);
          }
          if (!(nullKey || nullValue)) {
            out.write(keyValueSeparator);
          }
          if (!nullValue) {
            writeObject(value);
          }
          out.write(newline);
        }
    
        public synchronized 
        void close(TaskAttemptContext context) throws IOException {
          out.close();
        }
      }
  •  FileOutputCommitter 類做的工作比較雜,比如 更改文件名,看看任務有沒有成功,等等,主要做後期處理
  • public class FileOutputCommitter extends OutputCommitter {
    
      private static final Log LOG = LogFactory.getLog(FileOutputCommitter.class);
    
      /**
       * Temporary directory name 
       */
      protected static final String TEMP_DIR_NAME = "_temporary";
      public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
      static final String SUCCESSFUL_JOB_OUTPUT_DIR_MARKER =
        "mapreduce.fileoutputcommitter.marksuccessfuljobs";
      private FileSystem outputFileSystem = null;
      private Path outputPath = null;
      private Path workPath = null;
    
      /**
       * Create a file output committer
       * @param outputPath the job's output path
       * @param context the task's context
       * @throws IOException
       */
      public FileOutputCommitter(Path outputPath, 
                                 TaskAttemptContext context) throws IOException {
        if (outputPath != null) {
          this.outputPath = outputPath;
          outputFileSystem = outputPath.getFileSystem(context.getConfiguration());
          workPath = new Path(outputPath,
                              (FileOutputCommitter.TEMP_DIR_NAME + Path.SEPARATOR +
                               "_" + context.getTaskAttemptID().toString()
                               )).makeQualified(outputFileSystem);
        }
      }
    
      /**
       * Create the temporary directory that is the root of all of the task 
       * work directories.
       * @param context the job's context
       */
      public void setupJob(JobContext context) throws IOException {
        if (outputPath != null) {
          Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);
          FileSystem fileSys = tmpDir.getFileSystem(context.getConfiguration());
          if (!fileSys.mkdirs(tmpDir)) {
            LOG.error("Mkdirs failed to create " + tmpDir.toString());
          }
        }
      }
    
      private static boolean shouldMarkOutputDir(Configuration conf) {
        return conf.getBoolean(SUCCESSFUL_JOB_OUTPUT_DIR_MARKER, 
                               true);
      }
    
      // Mark the output dir of the job for which the context is passed.
      private void markOutputDirSuccessful(JobContext context)
      throws IOException {
        if (outputPath != null) {
          FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
          if (fileSys.exists(outputPath)) {
            // create a file in the folder to mark it
            Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
            fileSys.create(filePath).close();
          }
        }
      }
    
      /**
       * Delete the temporary directory, including all of the work directories.
       * This is called for all jobs whose final run state is SUCCEEDED
       * @param context the job's context.
       */
      public void commitJob(JobContext context) throws IOException {
        // delete the _temporary folder
        cleanupJob(context);
        // check if the o/p dir should be marked
        if (shouldMarkOutputDir(context.getConfiguration())) {
          // create a _success file in the o/p folder
          markOutputDirSuccessful(context);
        }
      }
    
      @Override
      @Deprecated
      public void cleanupJob(JobContext context) throws IOException {
        if (outputPath != null) {
          Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);
          FileSystem fileSys = tmpDir.getFileSystem(context.getConfiguration());
          if (fileSys.exists(tmpDir)) {
            fileSys.delete(tmpDir, true);
          }
        } else {
          LOG.warn("Output path is null in cleanup");
        }
      }
    
      /**
       * Delete the temporary directory, including all of the work directories.
       * @param context the job's context
       * @param state final run state of the job, should be FAILED or KILLED
       */
      @Override
      public void abortJob(JobContext context, JobStatus.State state)
      throws IOException {
        cleanupJob(context);
      }
      
      /**
       * No task setup required.
       */
      @Override
      public void setupTask(TaskAttemptContext context) throws IOException {
        // FileOutputCommitter's setupTask doesn't do anything. Because the
        // temporary task directory is created on demand when the 
        // task is writing.
      }
    
      /**
       * Move the files from the work directory to the job output directory
       * @param context the task context
       */
      public void commitTask(TaskAttemptContext context) 
      throws IOException {
        TaskAttemptID attemptId = context.getTaskAttemptID();
        if (workPath != null) {
          context.progress();
          if (outputFileSystem.exists(workPath)) {
            // Move the task outputs to their final place
            moveTaskOutputs(context, outputFileSystem, outputPath, workPath);
            // Delete the temporary task-specific output directory
            if (!outputFileSystem.delete(workPath, true)) {
              LOG.warn("Failed to delete the temporary output" + 
              " directory of task: " + attemptId + " - " + workPath);
            }
            LOG.info("Saved output of task '" + attemptId + "' to " + 
                     outputPath);
          }
        }
      }
    
      /**
       * Move all of the files from the work directory to the final output
       * @param context the task context
       * @param fs the output file system
       * @param jobOutputDir the final output direcotry
       * @param taskOutput the work path
       * @throws IOException
       */
      private void moveTaskOutputs(TaskAttemptContext context,
                                   FileSystem fs,
                                   Path jobOutputDir,
                                   Path taskOutput) 
      throws IOException {
        TaskAttemptID attemptId = context.getTaskAttemptID();
        context.progress();
        if (fs.isFile(taskOutput)) {
          Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
                                              workPath);
          if (!fs.rename(taskOutput, finalOutputPath)) {
            if (!fs.delete(finalOutputPath, true)) {
              throw new IOException("Failed to delete earlier output of task: " + 
                                     attemptId);
            }
            if (!fs.rename(taskOutput, finalOutputPath)) {
              throw new IOException("Failed to save output of task: " + 
            		  attemptId);
            }
          }
          LOG.debug("Moved " + taskOutput + " to " + finalOutputPath);
        } else if(fs.getFileStatus(taskOutput).isDir()) {
          FileStatus[] paths = fs.listStatus(taskOutput);
          Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, workPath);
          fs.mkdirs(finalOutputPath);
          if (paths != null) {
            for (FileStatus path : paths) {
              moveTaskOutputs(context, fs, jobOutputDir, path.getPath());
            }
          }
        }
      }
    
      /**
       * Delete the work directory
       */
      @Override
      public void abortTask(TaskAttemptContext context) {
        try {
          if (workPath != null) { 
            context.progress();
            outputFileSystem.delete(workPath, true);
          }
        } catch (IOException ie) {
          LOG.warn("Error discarding output" + StringUtils.stringifyException(ie));
        }
      }
    
      /**
       * Find the final name of a given output file, given the job output directory
       * and the work directory.
       * @param jobOutputDir the job's output directory
       * @param taskOutput the specific task output file
       * @param taskOutputPath the job's work directory
       * @return the final path for the specific output file
       * @throws IOException
       */
      private Path getFinalPath(Path jobOutputDir, Path taskOutput, 
                                Path taskOutputPath) throws IOException {
        URI taskOutputUri = taskOutput.toUri();
        URI relativePath = taskOutputPath.toUri().relativize(taskOutputUri);
        if (taskOutputUri == relativePath) {
          throw new IOException("Can not get the relative path: base = " + 
              taskOutputPath + " child = " + taskOutput);
        }
        if (relativePath.getPath().length() > 0) {
          return new Path(jobOutputDir, relativePath.getPath());
        } else {
          return jobOutputDir;
        }
      }
    
      /**
       * Did this task write any files in the work directory?
       * @param context the task's context
       */
      @Override
      public boolean needsTaskCommit(TaskAttemptContext context
                                     ) throws IOException {
        return workPath != null && outputFileSystem.exists(workPath);
      }
    
      /**
       * Get the directory that the task should write results into
       * @return the work directory
       * @throws IOException
       */
      public Path getWorkPath() throws IOException {
        return workPath;
      }
    }
  • 其實這纔是真正的wordcount代碼編寫過程,網上一堆人只寫主類,那樣理解太偏了,下面是wordcount 的 主類,也就是運行時的入口
  • public class WordCount {
      public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
          System.err.println("Usage: wordcount <in> <out>");
          System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setGroupingComparatorClass(Text.Comparator.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
    }
  • 運行時的命令如下
  • hadoop --config 配置文件位置 jar jar包路徑 WordCount in out
  • wordcount完整版結束
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章