Hadoop之Partition深度解析

原文地址: http://www.cnblogs.com/archimedes/p/hadoop-partitioner.html 

舊版 API 的 Partitioner 解析

Partitioner 的作用是對 Mapper 產生的中間結果進行分片,以便將同一分組的數據交給同一個 Reducer 處理,它直接影響 Reduce 階段的負載均衡。舊版 API 中 Partitioner 的類圖如圖所示。它繼承了JobConfigurable,可通過 configure 方法初始化。它本身只包含一個待實現的方法 getPartition。 該方法包含三個參數, 均由框架自動傳入,前面兩個參數是key/value,第三個參數 numPartitions 表示每個 Mapper 的分片數,也就是 Reducer 的個數。


MapReduce 提供了兩個Partitioner 實 現:HashPartitioner和TotalOrderPartitioner。其中 HashPartitioner 是默認實現,它實現了一種基於哈希值的分片方法,代碼如下:

public int getPartition(K2 key, V2 value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

TotalOrderPartitioner 提供了一種基於區間的分片方法,通常用在數據全排序中。 在MapReduce 環境中,容易想到的全排序方案是歸併排序,即在 Map 階段,每個 Map Task進行局部排序;在 Reduce 階段,啓動一個 Reduce Task 進行全局排序。由於作業只能有一個 Reduce Task,因而 Reduce 階段會成爲作業的瓶頸。爲了提高全局排序的性能和擴展性, MapReduce 提供了 TotalOrderPartitioner。它能夠按照大小將數據分成若干個區間(分片),並保證後一個區間的所有數據均大於前一個區間數據,這使得全排序的步驟如下:

步驟1:數據採樣。在 Client 端通過採樣獲取分片的分割點。Hadoop 自帶了幾個採樣算法,如 IntercalSampler、 RandomSampler、 SplitSampler 等(具體見org.apache.hadoop.mapred.lib 包中的 InputSampler 類)。 下面舉例說明。

採樣數據爲: b, abc, abd, bcd, abcd, efg, hii, afd, rrr, mnk

經排序後得到: abc, abcd, abd, afd, b, bcd, efg, hii, mnk, rrr

如果 Reduce Task 個數爲 4,則採樣數據的四等分點爲 abd、 bcd、 mnk,將這 3 個字符串作爲分割點。

步驟2:Map 階段。本階段涉及兩個組件,分別是 Mapper 和 Partitioner。其中,Mapper 可採用 IdentityMapper,直接將輸入數據輸出,但 Partitioner 必須選用TotalOrderPartitioner,它將步驟 1 中獲取的分割點保存到 trie 樹中以便快速定位任意一個記錄所在的區間,這樣,每個 Map Task 產生 R(Reduce Task 個數)個區間,且區間之間有序。TotalOrderPartitioner 通過 trie 樹查找每條記錄所對應的 Reduce Task 編號。 如圖所示, 我們將分割點 保存在深度爲 2 的 trie 樹中, 假設輸入數據中 有兩個字符串“ abg”和“ mnz”, 則字符串“ abg” 對應 partition1, 即第 2 個 Reduce Task, 字符串“ mnz” 對應partition3, 即第 4 個 Reduce Task。


步驟 3:Reduce 階段。每個 Reducer 對分配到的區間數據進行局部排序,最終得到全排序數據。從以上步驟可以看出,基於 TotalOrderPartitioner 全排序的效率跟 key 分佈規律和採樣算法有直接關係;key 值分佈越均勻且採樣越具有代表性,則 Reduce Task 負載越均衡,全排序效率越高。TotalOrderPartitioner 有兩個典型的應用實例: TeraSort 和 HBase 批量數據導入。 其中,TeraSort 是 Hadoop 自 帶的一個應用程序實例。 它曾在 TB 級數據排序基準評估中 贏得第一名,而 TotalOrderPartitioner正是從該實例中提煉出來的。HBase 是一個構建在 Hadoop之上的 NoSQL 數據倉庫。它以 Region爲單位劃分數據,Region 內部數據有序(按 key 排序),Region 之間也有序。很明顯,一個 MapReduce 全排序作業的 R 個輸出文件正好可對應 HBase 的 R 個 Region。

新版 API 的 Partitioner 解析

新版 API 中的Partitioner類圖如圖所示。它不再實現JobConfigurable 接口。當用戶需要讓 Partitioner通過某個JobConf 對象初始化時,可自行實現Configurable 接口,如:

public class TotalOrderPartitioner<K, V> extends Partitioner<K,V> implements Configurable


Partition所處的位置


Partition主要作用就是將map的結果發送到相應的reduce。這就對partition有兩個要求:

1)均衡負載,儘量的將工作均勻的分配給不同的reduce。

2)效率,分配速度一定要快。

Mapreduce提供的Partitioner


patition類結構

1. Partitioner<k,v>是partitioner的基類,如果需要定製partitioner也需要繼承該類。源代碼如下:

package org.apache.hadoop.mapred;
/** 
 * Partitions the key space.
 * 
 * <p><code>Partitioner</code> controls the partitioning of the keys of the 
 * intermediate map-outputs. The key (or a subset of the key) is used to derive
 * the partition, typically by a hash function. The total number of partitions
 * is the same as the number of reduce tasks for the job. Hence this controls
 * which of the <code>m</code> reduce tasks the intermediate key (and hence the 
 * record) is sent for reduction.</p>
 * 
 * @see Reducer
 * @deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead.
 */
@Deprecated
public interface Partitioner<K2, V2> extends JobConfigurable {
  
  /** 
   * Get the paritition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be paritioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  int getPartition(K2 key, V2 value, int numPartitions);
}

2. HashPartitioner<k,v>是mapreduce的默認partitioner。源代碼如下:

package org.apache.hadoop.mapreduce.lib.partition;
import org.apache.hadoop.mapreduce.Partitioner;
/** Partition keys by their {@link Object#hashCode()}. */
public class HashPartitioner<K, V> extends Partitioner<K, V> {
  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

3. BinaryPatitioner繼承於Partitioner<BinaryComparable ,V>,是Partitioner<k,v>的偏特化子類。該類提供leftOffset和rightOffset,在計算which reducer時僅對鍵值K的[rightOffset,leftOffset]這個區間取hash。

reducer=(hash & Integer.MAX_VALUE) % numReduceTasks

4. KeyFieldBasedPartitioner<k2, v2="">也是基於hash的個partitioner。和BinaryPatitioner不同,它提供了多個區間用於計算hash。當區間數爲0時KeyFieldBasedPartitioner退化成HashPartitioner。 源代碼如下:

package org.apache.hadoop.mapred.lib;

import java.io.UnsupportedEncodingException;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.lib.KeyFieldHelper.KeyDescription;

 /**   
  *  Defines a way to partition keys based on certain key fields (also see
  *  {@link KeyFieldBasedComparator}.
  *  The key specification supported is of the form -k pos1[,pos2], where,
  *  pos is of the form f[.c][opts], where f is the number
  *  of the key field to use, and c is the number of the first character from
  *  the beginning of the field. Fields and character posns are numbered 
  *  starting with 1; a character position of zero in pos2 indicates the
  *  field's last character. If '.c' is omitted from pos1, it defaults to 1
  *  (the beginning of the field); if omitted from pos2, it defaults to 0 
  *  (the end of the field).
  * 
  */
public class KeyFieldBasedPartitioner<K2, V2> implements Partitioner<K2, V2> {
  private static final Log LOG = LogFactory.getLog(KeyFieldBasedPartitioner.class.getName());
  private int numOfPartitionFields;
  
  private KeyFieldHelper keyFieldHelper = new KeyFieldHelper();

  public void configure(JobConf job) {
    String keyFieldSeparator = job.get("map.output.key.field.separator", "\t");
    keyFieldHelper.setKeyFieldSeparator(keyFieldSeparator);
    if (job.get("num.key.fields.for.partition") != null) {
      LOG.warn("Using deprecated num.key.fields.for.partition. " +
              "Use mapred.text.key.partitioner.options instead");
      this.numOfPartitionFields = job.getInt("num.key.fields.for.partition",0);
      keyFieldHelper.setKeyFieldSpec(1,numOfPartitionFields);
    } else {
      String option = job.getKeyFieldPartitionerOption();
      keyFieldHelper.parseOption(option);
    }
  }

  public int getPartition(K2 key, V2 value,
      int numReduceTasks) {
    byte[] keyBytes;

    List <KeyDescription> allKeySpecs = keyFieldHelper.keySpecs();
    if (allKeySpecs.size() == 0) {
      return getPartition(key.toString().hashCode(), numReduceTasks);
    }

    try {
      keyBytes = key.toString().getBytes("UTF-8");
    } catch (UnsupportedEncodingException e) {
      throw new RuntimeException("The current system does not " +
          "support UTF-8 encoding!", e);
    }
    // return 0 if the key is empty
    if (keyBytes.length == 0) {
      return 0;
    }
    
    int []lengthIndicesFirst = keyFieldHelper.getWordLengths(keyBytes, 0, 
        keyBytes.length);
    int currentHash = 0;
    for (KeyDescription keySpec : allKeySpecs) {
      int startChar = keyFieldHelper.getStartOffset(keyBytes, 0, keyBytes.length, 
          lengthIndicesFirst, keySpec);
       // no key found! continue
      if (startChar < 0) {
        continue;
      }
      int endChar = keyFieldHelper.getEndOffset(keyBytes, 0, keyBytes.length, 
          lengthIndicesFirst, keySpec);
      currentHash = hashCode(keyBytes, startChar, endChar, 
          currentHash);
    }
    return getPartition(currentHash, numReduceTasks);
  }
  
  protected int hashCode(byte[] b, int start, int end, int currentHash) {
    for (int i = start; i <= end; i++) {
      currentHash = 31*currentHash + b[i];
    }
    return currentHash;
  }

  protected int getPartition(int hash, int numReduceTasks) {
    return (hash & Integer.MAX_VALUE) % numReduceTasks;
  }
}

5. TotalOrderPartitioner這個類可以實現輸出的全排序。不同於以上3個partitioner,這個類並不是基於hash的。下面詳細的介紹TotalOrderPartitioner

TotalOrderPartitioner 類

每一個reducer的輸出在默認的情況下都是有順序的,但是reducer之間在輸入是無序的情況下也是無序的。如果要實現輸出是全排序的那就會用到TotalOrderPartitioner。

要使用TotalOrderPartitioner,得給TotalOrderPartitioner提供一個partition file。這個文件要求Key(這些key就是所謂的劃分)的數量和當前reducer的數量-1相同並且是從小到大排列。對於爲什麼要用到這樣一個文件,以及這個文件的具體細節待會還會提到。

TotalOrderPartitioner對不同Key的數據類型提供了兩種方案:

1) 對於非BinaryComparable 類型的Key,TotalOrderPartitioner採用二分發查找當前的K所在的index。

例如:reducer的數量爲5,partition file 提供的4個劃分爲【2,4,6,8】。如果當前的一個key/value 是<4,”good”>,利用二分法查找到index=1,index+1=2那麼這個key/value 將會發送到第二個reducer。如果一個key/value爲<4.5, “good”>。那麼二分法查找將返回-3,同樣對-3加1然後取反就是這個key/value將要去的reducer。

對於一些數值型的數據來說,利用二分法查找複雜度是O(log(reducer count)),速度比較快。

2) 對於BinaryComparable類型的Key(也可以直接理解爲字符串)。字符串按照字典順序也是可以進行排序的。

這樣的話也可以給定一些劃分,讓不同的字符串key分配到不同的reducer裏。這裏的處理和數值類型的比較相近。

例如:reducer的數量爲5,partition file 提供了4個劃分爲【“abc”, “bce”, “eaa”, ”fhc”】那麼“ab”這個字符串將會被分配到第一個reducer裏,因爲它小於第一個劃分“abc”。

但是不同於數值型的數據,字符串的查找和比較不能按照數值型數據的比較方法。mapreducer採用的Tire tree(關於Tire tree可以參考《字典樹(Trie Tree)》)的字符串查找方法。查找的時間複雜度o(m),m爲樹的深度,空間複雜度o(255^m-1)。是一個典型的空間換時間的案例。

Tire tree的構建

假設樹的最大深度爲3,劃分爲【aaad ,aaaf, aaaeh,abbx】


Mapreduce裏的Tire tree主要有兩種節點組成:

1) Innertirenode

Innertirenode在mapreduce中是包含了255個字符的一個比較長的串。上圖中的例子只包含了26個英文字母。

2) 葉子節點{unslipttirenode, singesplittirenode, leaftirenode}

Unslipttirenode 是不包含劃分的葉子節點。

Singlesplittirenode 是隻包含了一個劃分點的葉子節點。

Leafnode是包含了多個劃分點的葉子節點。(這種情況比較少見,達到樹的最大深度纔出現這種情況。在實際操作過程中比較少見)

Tire tree的搜索過程

接上面的例子:

1)假如當前 key value pair <aad, 10="">這時會找到圖中的leafnode,在leafnode內部使用二分法繼續查找找到返回 aad在劃分數組中的索引。找不到會返回一個和它最接近的劃分的索引。

2)假如找到singlenode,如果和singlenode的劃分相同或小返回他的索引,比singlenode的劃分大則返回索引+1。

3)假如找到nosplitnode則返回前面的索引。如<zaa, 20="">將會返回abbx的在劃分數組中的索引。

TotalOrderPartitioner的疑問

上面介紹了partitioner有兩個要求,一個是速度,另外一個是均衡負載。使用tire tree提高了搜素的速度,但是我們怎麼才能找到這樣的partition file 呢?讓所有的劃分剛好就能實現均衡負載。

InputSampler

輸入採樣類,可以對輸入目錄下的數據進行採樣。提供了3種採樣方法。


採樣類結構圖

採樣方式對比表:

類名稱

採樣方式

構造方法

效率

特點

SplitSampler<K,V>

對前n個記錄進行採樣

採樣總數,劃分數

最高

 

RandomSampler<K,V>

遍歷所有數據,隨機採樣

採樣頻率,採樣總數,劃分數

最低

 

IntervalSampler<K,V>

固定間隔採樣

採樣頻率,劃分數

對有序的數據十分適用

writePartitionFile這個方法很關鍵,這個方法就是根據採樣類提供的樣本,首先進行排序,然後選定(隨機的方法)和reducer數目-1的樣本寫入到partition file。這樣經過採樣的數據生成的劃分,在每個劃分區間裏的key/value就近似相同了,這樣就能完成均衡負載的作用。

SplitSampler類的源代碼如下:

/**
 * Samples the first n records from s splits.
 * Inexpensive way to sample random data.
 */
public static class SplitSampler<K,V> implements Sampler<K,V> {
  private final int numSamples;
  private final int maxSplitsSampled;
  /**
   * Create a SplitSampler sampling <em>all</em> splits.
   * Takes the first numSamples / numSplits records from each split.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   */
  public SplitSampler(int numSamples) {
    this(numSamples, Integer.MAX_VALUE);
  }
  /**
   * Create a new SplitSampler.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   * @param maxSplitsSampled The maximum number of splits to examine.
   */
  public SplitSampler(int numSamples, int maxSplitsSampled) {
    this.numSamples = numSamples;
    this.maxSplitsSampled = maxSplitsSampled;
  }
  /**
   * From each split sampled, take the first numSamples / numSplits records.
   */
  @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
    InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
    ArrayList<K> samples = new ArrayList<K>(numSamples);
    int splitsToSample = Math.min(maxSplitsSampled, splits.length);
    int splitStep = splits.length / splitsToSample;
    int samplesPerSplit = numSamples / splitsToSample;
    long records = 0;
    for (int i = 0; i < splitsToSample; ++i) {
      RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
          job, Reporter.NULL);
      K key = reader.createKey();
      V value = reader.createValue();
      while (reader.next(key, value)) {
        samples.add(key);
        key = reader.createKey();
        ++records;
        if ((i+1) * samplesPerSplit <= records) {
          break;
        }
      }
      reader.close();
    }
    return (K[])samples.toArray();
  }
}

RandomSampler類的源代碼如下:

/**
 * Sample from random points in the input.
 * General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from
 * each split.
 */
public static class RandomSampler<K,V> implements Sampler<K,V> {
  private double freq;
  private final int numSamples;
  private final int maxSplitsSampled;

  /**
   * Create a new RandomSampler sampling <em>all</em> splits.
   * This will read every split at the client, which is very expensive.
   * @param freq Probability with which a key will be chosen.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   */
  public RandomSampler(double freq, int numSamples) {
    this(freq, numSamples, Integer.MAX_VALUE);
  }

  /**
   * Create a new RandomSampler.
   * @param freq Probability with which a key will be chosen.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   * @param maxSplitsSampled The maximum number of splits to examine.
   */
  public RandomSampler(double freq, int numSamples, int maxSplitsSampled) {
    this.freq = freq;
    this.numSamples = numSamples;
    this.maxSplitsSampled = maxSplitsSampled;
  }

  /**
   * Randomize the split order, then take the specified number of keys from
   * each split sampled, where each key is selected with the specified
   * probability and possibly replaced by a subsequently selected key when
   * the quota of keys from that split is satisfied.
   */
  @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
    InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
    ArrayList<K> samples = new ArrayList<K>(numSamples);
    int splitsToSample = Math.min(maxSplitsSampled, splits.length);

    Random r = new Random();
    long seed = r.nextLong();
    r.setSeed(seed);
    LOG.debug("seed: " + seed);
    // shuffle splits
    for (int i = 0; i < splits.length; ++i) {
      InputSplit tmp = splits[i];
      int j = r.nextInt(splits.length);
      splits[i] = splits[j];
      splits[j] = tmp;
    }
    // our target rate is in terms of the maximum number of sample splits,
    // but we accept the possibility of sampling additional splits to hit
    // the target sample keyset
    for (int i = 0; i < splitsToSample ||
                   (i < splits.length && samples.size() < numSamples); ++i) {
      RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
          Reporter.NULL);
      K key = reader.createKey();
      V value = reader.createValue();
      while (reader.next(key, value)) {
        if (r.nextDouble() <= freq) {
          if (samples.size() < numSamples) {
            samples.add(key);
          } else {
            // When exceeding the maximum number of samples, replace a
            // random element with this one, then adjust the frequency
            // to reflect the possibility of existing elements being
            // pushed out
            int ind = r.nextInt(numSamples);
            if (ind != numSamples) {
              samples.set(ind, key);
            }
            freq *= (numSamples - 1) / (double) numSamples;
          }
          key = reader.createKey();
        }
      }
      reader.close();
    }
    return (K[])samples.toArray();
  }
}

IntervalSampler類的源代碼爲:

/**
 * Sample from s splits at regular intervals.
 * Useful for sorted data.
 */
public static class IntervalSampler<K,V> implements Sampler<K,V> {
  private final double freq;
  private final int maxSplitsSampled;

  /**
   * Create a new IntervalSampler sampling <em>all</em> splits.
   * @param freq The frequency with which records will be emitted.
   */
  public IntervalSampler(double freq) {
    this(freq, Integer.MAX_VALUE);
  }

  /**
   * Create a new IntervalSampler.
   * @param freq The frequency with which records will be emitted.
   * @param maxSplitsSampled The maximum number of splits to examine.
   * @see #getSample
   */
  public IntervalSampler(double freq, int maxSplitsSampled) {
    this.freq = freq;
    this.maxSplitsSampled = maxSplitsSampled;
  }

  /**
   * For each split sampled, emit when the ratio of the number of records
   * retained to the total record count is less than the specified
   * frequency.
   */
  @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
    InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
    ArrayList<K> samples = new ArrayList<K>();
    int splitsToSample = Math.min(maxSplitsSampled, splits.length);
    int splitStep = splits.length / splitsToSample;
    long records = 0;
    long kept = 0;
    for (int i = 0; i < splitsToSample; ++i) {
      RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
          job, Reporter.NULL);
      K key = reader.createKey();
      V value = reader.createValue();
      while (reader.next(key, value)) {
        ++records;
        if ((double) kept / records < freq) {
          ++kept;
          samples.add(key);
          key = reader.createKey();
        }
      }
      reader.close();
    }
    return (K[])samples.toArray();
  }
}

InputSampler類完整源代碼如下:

package org.apache.hadoop.mapred.lib;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Random;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * Utility for collecting samples and writing a partition file for
 * {@link org.apache.hadoop.mapred.lib.TotalOrderPartitioner}.
 */
public class InputSampler<K,V> implements Tool {

  private static final Log LOG = LogFactory.getLog(InputSampler.class);

  static int printUsage() {
    System.out.println("sampler -r <reduces>\n" +
                       "      [-inFormat <input format class>]\n" +
                       "      [-keyClass <map input & output key class>]\n" +
                       "      [-splitRandom <double pcnt> <numSamples> <maxsplits> | " +
                       "// Sample from random splits at random (general)\n" +
                       "       -splitSample <numSamples> <maxsplits> | " +
                       "             // Sample from first records in splits (random data)\n"+
                       "       -splitInterval <double pcnt> <maxsplits>]" +
                       "             // Sample from splits at intervals (sorted data)");
    System.out.println("Default sampler: -splitRandom 0.1 10000 10");
    ToolRunner.printGenericCommandUsage(System.out);
    return -1;
  }

  private JobConf conf;

  public InputSampler(JobConf conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return conf;
  }

  public void setConf(Configuration conf) {
    if (!(conf instanceof JobConf)) {
      this.conf = new JobConf(conf);
    } else {
      this.conf = (JobConf) conf;
    }
  }

  /**
   * Interface to sample using an {@link org.apache.hadoop.mapred.InputFormat}.
   */
  public interface Sampler<K,V> {
    /**
     * For a given job, collect and return a subset of the keys from the
     * input data.
     */
    K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException;
  }

  /**
   * Samples the first n records from s splits.
   * Inexpensive way to sample random data.
   */
  public static class SplitSampler<K,V> implements Sampler<K,V> {

    private final int numSamples;
    private final int maxSplitsSampled;

    /**
     * Create a SplitSampler sampling <em>all</em> splits.
     * Takes the first numSamples / numSplits records from each split.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     */
    public SplitSampler(int numSamples) {
      this(numSamples, Integer.MAX_VALUE);
    }

    /**
     * Create a new SplitSampler.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     * @param maxSplitsSampled The maximum number of splits to examine.
     */
    public SplitSampler(int numSamples, int maxSplitsSampled) {
      this.numSamples = numSamples;
      this.maxSplitsSampled = maxSplitsSampled;
    }

    /**
     * From each split sampled, take the first numSamples / numSplits records.
     */
    @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
    public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>(numSamples);
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);
      int splitStep = splits.length / splitsToSample;
      int samplesPerSplit = numSamples / splitsToSample;
      long records = 0;
      for (int i = 0; i < splitsToSample; ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
            job, Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          samples.add(key);
          key = reader.createKey();
          ++records;
          if ((i+1) * samplesPerSplit <= records) {
            break;
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }
  }

  /**
   * Sample from random points in the input.
   * General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from
   * each split.
   */
  public static class RandomSampler<K,V> implements Sampler<K,V> {
    private double freq;
    private final int numSamples;
    private final int maxSplitsSampled;

    /**
     * Create a new RandomSampler sampling <em>all</em> splits.
     * This will read every split at the client, which is very expensive.
     * @param freq Probability with which a key will be chosen.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     */
    public RandomSampler(double freq, int numSamples) {
      this(freq, numSamples, Integer.MAX_VALUE);
    }

    /**
     * Create a new RandomSampler.
     * @param freq Probability with which a key will be chosen.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     * @param maxSplitsSampled The maximum number of splits to examine.
     */
    public RandomSampler(double freq, int numSamples, int maxSplitsSampled) {
      this.freq = freq;
      this.numSamples = numSamples;
      this.maxSplitsSampled = maxSplitsSampled;
    }

    /**
     * Randomize the split order, then take the specified number of keys from
     * each split sampled, where each key is selected with the specified
     * probability and possibly replaced by a subsequently selected key when
     * the quota of keys from that split is satisfied.
     */
    @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
    public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>(numSamples);
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);

      Random r = new Random();
      long seed = r.nextLong();
      r.setSeed(seed);
      LOG.debug("seed: " + seed);
      // shuffle splits
      for (int i = 0; i < splits.length; ++i) {
        InputSplit tmp = splits[i];
        int j = r.nextInt(splits.length);
        splits[i] = splits[j];
        splits[j] = tmp;
      }
      // our target rate is in terms of the maximum number of sample splits,
      // but we accept the possibility of sampling additional splits to hit
      // the target sample keyset
      for (int i = 0; i < splitsToSample ||
                     (i < splits.length && samples.size() < numSamples); ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
            Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          if (r.nextDouble() <= freq) {
            if (samples.size() < numSamples) {
              samples.add(key);
            } else {
              // When exceeding the maximum number of samples, replace a
              // random element with this one, then adjust the frequency
              // to reflect the possibility of existing elements being
              // pushed out
              int ind = r.nextInt(numSamples);
              if (ind != numSamples) {
                samples.set(ind, key);
              }
              freq *= (numSamples - 1) / (double) numSamples;
            }
            key = reader.createKey();
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }
  }

  /**
   * Sample from s splits at regular intervals.
   * Useful for sorted data.
   */
  public static class IntervalSampler<K,V> implements Sampler<K,V> {
    private final double freq;
    private final int maxSplitsSampled;

    /**
     * Create a new IntervalSampler sampling <em>all</em> splits.
     * @param freq The frequency with which records will be emitted.
     */
    public IntervalSampler(double freq) {
      this(freq, Integer.MAX_VALUE);
    }

    /**
     * Create a new IntervalSampler.
     * @param freq The frequency with which records will be emitted.
     * @param maxSplitsSampled The maximum number of splits to examine.
     * @see #getSample
     */
    public IntervalSampler(double freq, int maxSplitsSampled) {
      this.freq = freq;
      this.maxSplitsSampled = maxSplitsSampled;
    }

    /**
     * For each split sampled, emit when the ratio of the number of records
     * retained to the total record count is less than the specified
     * frequency.
     */
    @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
    public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>();
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);
      int splitStep = splits.length / splitsToSample;
      long records = 0;
      long kept = 0;
      for (int i = 0; i < splitsToSample; ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
            job, Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          ++records;
          if ((double) kept / records < freq) {
            ++kept;
            samples.add(key);
            key = reader.createKey();
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }
  }

  /**
   * Write a partition file for the given job, using the Sampler provided.
   * Queries the sampler for a sample keyset, sorts by the output key
   * comparator, selects the keys for each rank, and writes to the destination
   * returned from {@link
     org.apache.hadoop.mapred.lib.TotalOrderPartitioner#getPartitionFile}.
   */
  @SuppressWarnings("unchecked") // getInputFormat, getOutputKeyComparator
  public static <K,V> void writePartitionFile(JobConf job,
      Sampler<K,V> sampler) throws IOException {
    final InputFormat<K,V> inf = (InputFormat<K,V>) job.getInputFormat();
    int numPartitions = job.getNumReduceTasks();
    K[] samples = sampler.getSample(inf, job);
    LOG.info("Using " + samples.length + " samples");
    RawComparator<K> comparator =
      (RawComparator<K>) job.getOutputKeyComparator();
    Arrays.sort(samples, comparator);
    Path dst = new Path(TotalOrderPartitioner.getPartitionFile(job));
    FileSystem fs = dst.getFileSystem(job);
    if (fs.exists(dst)) {
      fs.delete(dst, false);
    }
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, job, dst,
        job.getMapOutputKeyClass(), NullWritable.class);
    NullWritable nullValue = NullWritable.get();
    float stepSize = samples.length / (float) numPartitions;
    int last = -1;
    for(int i = 1; i < numPartitions; ++i) {
      int k = Math.round(stepSize * i);
      while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
        ++k;
      }
      writer.append(samples[k], nullValue);
      last = k;
    }
    writer.close();
  }

  /**
   * Driver for InputSampler from the command line.
   * Configures a JobConf instance and calls {@link #writePartitionFile}.
   */
  public int run(String[] args) throws Exception {
    JobConf job = (JobConf) getConf();
    ArrayList<String> otherArgs = new ArrayList<String>();
    Sampler<K,V> sampler = null;
    for(int i=0; i < args.length; ++i) {
      try {
        if ("-r".equals(args[i])) {
          job.setNumReduceTasks(Integer.parseInt(args[++i]));
        } else if ("-inFormat".equals(args[i])) {
          job.setInputFormat(
              Class.forName(args[++i]).asSubclass(InputFormat.class));
        } else if ("-keyClass".equals(args[i])) {
          job.setMapOutputKeyClass(
              Class.forName(args[++i]).asSubclass(WritableComparable.class));
        } else if ("-splitSample".equals(args[i])) {
          int numSamples = Integer.parseInt(args[++i]);
          int maxSplits = Integer.parseInt(args[++i]);
          if (0 >= maxSplits) maxSplits = Integer.MAX_VALUE;
          sampler = new SplitSampler<K,V>(numSamples, maxSplits);
        } else if ("-splitRandom".equals(args[i])) {
          double pcnt = Double.parseDouble(args[++i]);
          int numSamples = Integer.parseInt(args[++i]);
          int maxSplits = Integer.parseInt(args[++i]);
          if (0 >= maxSplits) maxSplits = Integer.MAX_VALUE;
          sampler = new RandomSampler<K,V>(pcnt, numSamples, maxSplits);
        } else if ("-splitInterval".equals(args[i])) {
          double pcnt = Double.parseDouble(args[++i]);
          int maxSplits = Integer.parseInt(args[++i]);
          if (0 >= maxSplits) maxSplits = Integer.MAX_VALUE;
          sampler = new IntervalSampler<K,V>(pcnt, maxSplits);
        } else {
          otherArgs.add(args[i]);
        }
      } catch (NumberFormatException except) {
        System.out.println("ERROR: Integer expected instead of " + args[i]);
        return printUsage();
      } catch (ArrayIndexOutOfBoundsException except) {
        System.out.println("ERROR: Required parameter missing from " +
            args[i-1]);
        return printUsage();
      }
    }
    if (job.getNumReduceTasks() <= 1) {
      System.err.println("Sampler requires more than one reducer");
      return printUsage();
    }
    if (otherArgs.size() < 2) {
      System.out.println("ERROR: Wrong number of parameters: ");
      return printUsage();
    }
    if (null == sampler) {
      sampler = new RandomSampler<K,V>(0.1, 10000, 10);
    }

    Path outf = new Path(otherArgs.remove(otherArgs.size() - 1));
    TotalOrderPartitioner.setPartitionFile(job, outf);
    for (String s : otherArgs) {
      FileInputFormat.addInputPath(job, new Path(s));
    }
    InputSampler.<K,V>writePartitionFile(job, sampler);

    return 0;
  }

  public static void main(String[] args) throws Exception {
    JobConf job = new JobConf(InputSampler.class);
    InputSampler<?,?> sampler = new InputSampler(job);
    int res = ToolRunner.run(sampler, args);
    System.exit(res);
  }
}
InputSampler

TotalOrderPartitioner實例

public class SortByTemperatureUsingTotalOrderPartitioner extends Configured
  implements Tool
{
    @Override
    public int run(String[] args) throws Exception
    {
  JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args);
  if (conf == null) {
      return -1;
  }
  conf.setInputFormat(SequenceFileInputFormat.class);
  conf.setOutputKeyClass(IntWritable.class);
  conf.setOutputFormat(SequenceFileOutputFormat.class);
  SequenceFileOutputFormat.setCompressOutput(conf, true);
  SequenceFileOutputFormat
    .setOutputCompressorClass(conf, GzipCodec.class);
  SequenceFileOutputFormat.setOutputCompressionType(conf,
    CompressionType.BLOCK);
  conf.setPartitionerClass(TotalOrderPartitioner.class);
  InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(
    0.1, 10000, 10);
  Path input = FileInputFormat.getInputPaths(conf)[0];
  input = input.makeQualified(input.getFileSystem(conf));
  Path partitionFile = new Path(input, "_partitions");
  TotalOrderPartitioner.setPartitionFile(conf, partitionFile);
  InputSampler.writePartitionFile(conf, sampler);
  // Add to DistributedCache
  URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
  DistributedCache.addCacheFile(partitionUri, conf);
  DistributedCache.createSymlink(conf);
  JobClient.runJob(conf);
  return 0;
    }
    public static void main(String[] args) throws Exception {
  int exitCode = ToolRunner.run(
    new SortByTemperatureUsingTotalOrderPartitioner(), args);
  System.exit(exitCode);
    }
}

參考資料

1.《Hadoop技術內幕 深入理解MapReduce架構設計與實現原理》

2. http://www.cnblogs.com/xwdreamer/archive/2011/10/27/2296943.html

3. http://blog.oddfoo.net/2011/04/17/mapreduce-partition%E5%88%86%E6%9E%90-2/

發佈了42 篇原創文章 · 獲贊 73 · 訪問量 22萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章