Hadoop之Partition深度解析

原文地址： http://www.cnblogs.com/archimedes/p/hadoop-partitioner.html

舊版 API 的 Partitioner 解析

Partitioner 的作用是對 Mapper 產生的中間結果進行分片，以便將同一分組的數據交給同一個 Reducer 處理，它直接影響 Reduce 階段的負載均衡。舊版 API 中 Partitioner 的類圖如圖所示。它繼承了JobConfigurable，可通過 configure 方法初始化。它本身只包含一個待實現的方法 getPartition。該方法包含三個參數，均由框架自動傳入，前面兩個參數是key/value，第三個參數 numPartitions 表示每個 Mapper 的分片數，也就是 Reducer 的個數。

MapReduce 提供了兩個Partitioner 實現：HashPartitioner和TotalOrderPartitioner。其中 HashPartitioner 是默認實現，它實現了一種基於哈希值的分片方法，代碼如下：

public int getPartition(K2 key, V2 value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

TotalOrderPartitioner 提供了一種基於區間的分片方法，通常用在數據全排序中。在MapReduce 環境中，容易想到的全排序方案是歸併排序，即在 Map 階段，每個 Map Task進行局部排序；在 Reduce 階段，啓動一個 Reduce Task 進行全局排序。由於作業只能有一個 Reduce Task，因而 Reduce 階段會成爲作業的瓶頸。爲了提高全局排序的性能和擴展性， MapReduce 提供了 TotalOrderPartitioner。它能夠按照大小將數據分成若干個區間（分片），並保證後一個區間的所有數據均大於前一個區間數據，這使得全排序的步驟如下：

步驟1：數據採樣。在 Client 端通過採樣獲取分片的分割點。Hadoop 自帶了幾個採樣算法，如 IntercalSampler、 RandomSampler、 SplitSampler 等（具體見org.apache.hadoop.mapred.lib 包中的 InputSampler 類）。下面舉例說明。

採樣數據爲： b， abc， abd， bcd， abcd， efg， hii， afd， rrr， mnk

經排序後得到： abc， abcd， abd， afd， b， bcd， efg， hii， mnk， rrr

如果 Reduce Task 個數爲 4，則採樣數據的四等分點爲 abd、 bcd、 mnk，將這 3 個字符串作爲分割點。

步驟2：Map 階段。本階段涉及兩個組件，分別是 Mapper 和 Partitioner。其中，Mapper 可採用 IdentityMapper，直接將輸入數據輸出，但 Partitioner 必須選用TotalOrderPartitioner，它將步驟 1 中獲取的分割點保存到 trie 樹中以便快速定位任意一個記錄所在的區間，這樣，每個 Map Task 產生 R（Reduce Task 個數）個區間，且區間之間有序。TotalOrderPartitioner 通過 trie 樹查找每條記錄所對應的 Reduce Task 編號。如圖所示，我們將分割點保存在深度爲 2 的 trie 樹中，假設輸入數據中有兩個字符串“ abg”和“ mnz”，則字符串“ abg” 對應 partition1，即第 2 個 Reduce Task，字符串“ mnz” 對應partition3，即第 4 個 Reduce Task。

步驟 3：Reduce 階段。每個 Reducer 對分配到的區間數據進行局部排序，最終得到全排序數據。從以上步驟可以看出，基於 TotalOrderPartitioner 全排序的效率跟 key 分佈規律和採樣算法有直接關係；key 值分佈越均勻且採樣越具有代表性，則 Reduce Task 負載越均衡，全排序效率越高。TotalOrderPartitioner 有兩個典型的應用實例： TeraSort 和 HBase 批量數據導入。其中，TeraSort 是 Hadoop 自帶的一個應用程序實例。它曾在 TB 級數據排序基準評估中贏得第一名，而 TotalOrderPartitioner正是從該實例中提煉出來的。HBase 是一個構建在 Hadoop之上的 NoSQL 數據倉庫。它以 Region爲單位劃分數據，Region 內部數據有序（按 key 排序），Region 之間也有序。很明顯，一個 MapReduce 全排序作業的 R 個輸出文件正好可對應 HBase 的 R 個 Region。

新版 API 的 Partitioner 解析

新版 API 中的Partitioner類圖如圖所示。它不再實現JobConfigurable 接口。當用戶需要讓 Partitioner通過某個JobConf 對象初始化時，可自行實現Configurable 接口，如：

public class TotalOrderPartitioner<K, V> extends Partitioner<K,V> implements Configurable

Partition所處的位置

Partition主要作用就是將map的結果發送到相應的reduce。這就對partition有兩個要求：

1）均衡負載，儘量的將工作均勻的分配給不同的reduce。

2）效率，分配速度一定要快。

Mapreduce提供的Partitioner

patition類結構

1. Partitioner<k,v>是partitioner的基類，如果需要定製partitioner也需要繼承該類。源代碼如下：

package org.apache.hadoop.mapred;
/** 
 * Partitions the key space.
 * 
 * <p><code>Partitioner</code> controls the partitioning of the keys of the 
 * intermediate map-outputs. The key (or a subset of the key) is used to derive
 * the partition, typically by a hash function. The total number of partitions
 * is the same as the number of reduce tasks for the job. Hence this controls
 * which of the <code>m</code> reduce tasks the intermediate key (and hence the 
 * record) is sent for reduction.</p>
 * 
 * @see Reducer
 * @deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead.
 */
@Deprecated
public interface Partitioner<K2, V2> extends JobConfigurable {
  
  /** 
   * Get the paritition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be paritioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  int getPartition(K2 key, V2 value, int numPartitions);
}

2. HashPartitioner<k,v>是mapreduce的默認partitioner。源代碼如下：

package org.apache.hadoop.mapreduce.lib.partition;
import org.apache.hadoop.mapreduce.Partitioner;
/** Partition keys by their {@link Object#hashCode()}. */
public class HashPartitioner<K, V> extends Partitioner<K, V> {
  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

3. BinaryPatitioner繼承於Partitioner<BinaryComparable ,V>，是Partitioner<k,v>的偏特化子類。該類提供leftOffset和rightOffset，在計算which reducer時僅對鍵值K的[rightOffset，leftOffset]這個區間取hash。

reducer=(hash & Integer.MAX_VALUE) % numReduceTasks

4. KeyFieldBasedPartitioner<k2, v2="">也是基於hash的個partitioner。和BinaryPatitioner不同，它提供了多個區間用於計算hash。當區間數爲0時KeyFieldBasedPartitioner退化成HashPartitioner。源代碼如下：

package org.apache.hadoop.mapred.lib;

import java.io.UnsupportedEncodingException;
import java.util.List;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.lib.KeyFieldHelper.KeyDescription;

 /**   
  *  Defines a way to partition keys based on certain key fields (also see
  *  {@link KeyFieldBasedComparator}.
  *  The key specification supported is of the form -k pos1[,pos2], where,
  *  pos is of the form f[.c][opts], where f is the number
  *  of the key field to use, and c is the number of the first character from
  *  the beginning of the field. Fields and character posns are numbered 
  *  starting with 1; a character position of zero in pos2 indicates the
  *  field's last character. If '.c' is omitted from pos1, it defaults to 1
  *  (the beginning of the field); if omitted from pos2, it defaults to 0 
  *  (the end of the field).
  * 
  */
public class KeyFieldBasedPartitioner<K2, V2> implements Partitioner<K2, V2> {
  private static final Log LOG = LogFactory.getLog(KeyFieldBasedPartitioner.class.getName());
  private int numOfPartitionFields;
  
  private KeyFieldHelper keyFieldHelper = new KeyFieldHelper();

  public void configure(JobConf job) {
    String keyFieldSeparator = job.get("map.output.key.field.separator", "\t");
    keyFieldHelper.setKeyFieldSeparator(keyFieldSeparator);
    if (job.get("num.key.fields.for.partition") != null) {
      LOG.warn("Using deprecated num.key.fields.for.partition. " +
              "Use mapred.text.key.partitioner.options instead");
      this.numOfPartitionFields = job.getInt("num.key.fields.for.partition",0);
      keyFieldHelper.setKeyFieldSpec(1,numOfPartitionFields);
    } else {
      String option = job.getKeyFieldPartitionerOption();
      keyFieldHelper.parseOption(option);
    }
  }

  public int getPartition(K2 key, V2 value,
      int numReduceTasks) {
    byte[] keyBytes;

    List <KeyDescription> allKeySpecs = keyFieldHelper.keySpecs();
    if (allKeySpecs.size() == 0) {
      return getPartition(key.toString().hashCode(), numReduceTasks);
    }

    try {
      keyBytes = key.toString().getBytes("UTF-8");
    } catch (UnsupportedEncodingException e) {
      throw new RuntimeException("The current system does not " +
          "support UTF-8 encoding!", e);
    }
    // return 0 if the key is empty
    if (keyBytes.length == 0) {
      return 0;
    }
    
    int []lengthIndicesFirst = keyFieldHelper.getWordLengths(keyBytes, 0, 
        keyBytes.length);
    int currentHash = 0;
    for (KeyDescription keySpec : allKeySpecs) {
      int startChar = keyFieldHelper.getStartOffset(keyBytes, 0, keyBytes.length, 
          lengthIndicesFirst, keySpec);
       // no key found! continue
      if (startChar < 0) {
        continue;
      }
      int endChar = keyFieldHelper.getEndOffset(keyBytes, 0, keyBytes.length, 
          lengthIndicesFirst, keySpec);
      currentHash = hashCode(keyBytes, startChar, endChar, 
          currentHash);
    }
    return getPartition(currentHash, numReduceTasks);
  }
  
  protected int hashCode(byte[] b, int start, int end, int currentHash) {
    for (int i = start; i <= end; i++) {
      currentHash = 31*currentHash + b[i];
    }
    return currentHash;
  }

  protected int getPartition(int hash, int numReduceTasks) {
    return (hash & Integer.MAX_VALUE) % numReduceTasks;
  }
}

5. TotalOrderPartitioner這個類可以實現輸出的全排序。不同於以上3個partitioner，這個類並不是基於hash的。下面詳細的介紹TotalOrderPartitioner

TotalOrderPartitioner 類

每一個reducer的輸出在默認的情況下都是有順序的，但是reducer之間在輸入是無序的情況下也是無序的。如果要實現輸出是全排序的那就會用到TotalOrderPartitioner。

要使用TotalOrderPartitioner，得給TotalOrderPartitioner提供一個partition file。這個文件要求Key（這些key就是所謂的劃分）的數量和當前reducer的數量-1相同並且是從小到大排列。對於爲什麼要用到這樣一個文件，以及這個文件的具體細節待會還會提到。

TotalOrderPartitioner對不同Key的數據類型提供了兩種方案：

1）對於非BinaryComparable 類型的Key，TotalOrderPartitioner採用二分發查找當前的K所在的index。

例如：reducer的數量爲5，partition file 提供的4個劃分爲【2，4，6，8】。如果當前的一個key/value 是<4,”good”>，利用二分法查找到index=1，index+1=2那麼這個key/value 將會發送到第二個reducer。如果一個key/value爲<4.5, “good”>。那麼二分法查找將返回-3，同樣對-3加1然後取反就是這個key/value將要去的reducer。

對於一些數值型的數據來說，利用二分法查找複雜度是O(log(reducer count))，速度比較快。

2）對於BinaryComparable類型的Key（也可以直接理解爲字符串）。字符串按照字典順序也是可以進行排序的。

這樣的話也可以給定一些劃分，讓不同的字符串key分配到不同的reducer裏。這裏的處理和數值類型的比較相近。

例如：reducer的數量爲5，partition file 提供了4個劃分爲【“abc”, “bce”, “eaa”, ”fhc”】那麼“ab”這個字符串將會被分配到第一個reducer裏，因爲它小於第一個劃分“abc”。

但是不同於數值型的數據，字符串的查找和比較不能按照數值型數據的比較方法。mapreducer採用的Tire tree（關於Tire tree可以參考《字典樹(Trie Tree)》）的字符串查找方法。查找的時間複雜度o(m)，m爲樹的深度，空間複雜度o(255^m-1)。是一個典型的空間換時間的案例。

Tire tree的構建

假設樹的最大深度爲3，劃分爲【aaad ，aaaf， aaaeh，abbx】

Mapreduce裏的Tire tree主要有兩種節點組成：

1） Innertirenode

Innertirenode在mapreduce中是包含了255個字符的一個比較長的串。上圖中的例子只包含了26個英文字母。

2）葉子節點{unslipttirenode, singesplittirenode, leaftirenode}

Unslipttirenode 是不包含劃分的葉子節點。

Singlesplittirenode 是隻包含了一個劃分點的葉子節點。

Leafnode是包含了多個劃分點的葉子節點。（這種情況比較少見，達到樹的最大深度纔出現這種情況。在實際操作過程中比較少見）

Tire tree的搜索過程

接上面的例子：

1）假如當前 key value pair <aad, 10="">這時會找到圖中的leafnode，在leafnode內部使用二分法繼續查找找到返回 aad在劃分數組中的索引。找不到會返回一個和它最接近的劃分的索引。

2）假如找到singlenode，如果和singlenode的劃分相同或小返回他的索引，比singlenode的劃分大則返回索引+1。

3）假如找到nosplitnode則返回前面的索引。如<zaa, 20="">將會返回abbx的在劃分數組中的索引。

TotalOrderPartitioner的疑問

上面介紹了partitioner有兩個要求，一個是速度，另外一個是均衡負載。使用tire tree提高了搜素的速度，但是我們怎麼才能找到這樣的partition file 呢？讓所有的劃分剛好就能實現均衡負載。

InputSampler

輸入採樣類，可以對輸入目錄下的數據進行採樣。提供了3種採樣方法。

採樣類結構圖

採樣方式對比表:

類名稱	採樣方式	構造方法	效率	特點
SplitSampler<K,V>	對前n個記錄進行採樣	採樣總數，劃分數	最高
RandomSampler<K,V>	遍歷所有數據，隨機採樣	採樣頻率，採樣總數，劃分數	最低
IntervalSampler<K,V>	固定間隔採樣	採樣頻率，劃分數	中	對有序的數據十分適用

writePartitionFile這個方法很關鍵，這個方法就是根據採樣類提供的樣本，首先進行排序，然後選定（隨機的方法）和reducer數目-1的樣本寫入到partition file。這樣經過採樣的數據生成的劃分，在每個劃分區間裏的key/value就近似相同了，這樣就能完成均衡負載的作用。

SplitSampler類的源代碼如下：

/**
 * Samples the first n records from s splits.
 * Inexpensive way to sample random data.
 */
public static class SplitSampler<K,V> implements Sampler<K,V> {
  private final int numSamples;
  private final int maxSplitsSampled;
  /**
   * Create a SplitSampler sampling <em>all</em> splits.
   * Takes the first numSamples / numSplits records from each split.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   */
  public SplitSampler(int numSamples) {
    this(numSamples, Integer.MAX_VALUE);
  }
  /**
   * Create a new SplitSampler.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   * @param maxSplitsSampled The maximum number of splits to examine.
   */
  public SplitSampler(int numSamples, int maxSplitsSampled) {
    this.numSamples = numSamples;
    this.maxSplitsSampled = maxSplitsSampled;
  }
  /**
   * From each split sampled, take the first numSamples / numSplits records.
   */
  @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
    InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
    ArrayList<K> samples = new ArrayList<K>(numSamples);
    int splitsToSample = Math.min(maxSplitsSampled, splits.length);
    int splitStep = splits.length / splitsToSample;
    int samplesPerSplit = numSamples / splitsToSample;
    long records = 0;
    for (int i = 0; i < splitsToSample; ++i) {
      RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
          job, Reporter.NULL);
      K key = reader.createKey();
      V value = reader.createValue();
      while (reader.next(key, value)) {
        samples.add(key);
        key = reader.createKey();
        ++records;
        if ((i+1) * samplesPerSplit <= records) {
          break;
        }
      }
      reader.close();
    }
    return (K[])samples.toArray();
  }
}

RandomSampler類的源代碼如下：

/**
 * Sample from random points in the input.
 * General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from
 * each split.
 */
public static class RandomSampler<K,V> implements Sampler<K,V> {
  private double freq;
  private final int numSamples;
  private final int maxSplitsSampled;

  /**
   * Create a new RandomSampler sampling <em>all</em> splits.
   * This will read every split at the client, which is very expensive.
   * @param freq Probability with which a key will be chosen.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   */
  public RandomSampler(double freq, int numSamples) {
    this(freq, numSamples, Integer.MAX_VALUE);
  }

  /**
   * Create a new RandomSampler.
   * @param freq Probability with which a key will be chosen.
   * @param numSamples Total number of samples to obtain from all selected
   *                   splits.
   * @param maxSplitsSampled The maximum number of splits to examine.
   */
  public RandomSampler(double freq, int numSamples, int maxSplitsSampled) {
    this.freq = freq;
    this.numSamples = numSamples;
    this.maxSplitsSampled = maxSplitsSampled;
  }

  /**
   * Randomize the split order, then take the specified number of keys from
   * each split sampled, where each key is selected with the specified
   * probability and possibly replaced by a subsequently selected key when
   * the quota of keys from that split is satisfied.
   */
  @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
    InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
    ArrayList<K> samples = new ArrayList<K>(numSamples);
    int splitsToSample = Math.min(maxSplitsSampled, splits.length);

    Random r = new Random();
    long seed = r.nextLong();
    r.setSeed(seed);
    LOG.debug("seed: " + seed);
    // shuffle splits
    for (int i = 0; i < splits.length; ++i) {
      InputSplit tmp = splits[i];
      int j = r.nextInt(splits.length);
      splits[i] = splits[j];
      splits[j] = tmp;
    }
    // our target rate is in terms of the maximum number of sample splits,
    // but we accept the possibility of sampling additional splits to hit
    // the target sample keyset
    for (int i = 0; i < splitsToSample ||
                   (i < splits.length && samples.size() < numSamples); ++i) {
      RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
          Reporter.NULL);
      K key = reader.createKey();
      V value = reader.createValue();
      while (reader.next(key, value)) {
        if (r.nextDouble() <= freq) {
          if (samples.size() < numSamples) {
            samples.add(key);
          } else {
            // When exceeding the maximum number of samples, replace a
            // random element with this one, then adjust the frequency
            // to reflect the possibility of existing elements being
            // pushed out
            int ind = r.nextInt(numSamples);
            if (ind != numSamples) {
              samples.set(ind, key);
            }
            freq *= (numSamples - 1) / (double) numSamples;
          }
          key = reader.createKey();
        }
      }
      reader.close();
    }
    return (K[])samples.toArray();
  }
}

IntervalSampler類的源代碼爲：

/**
 * Sample from s splits at regular intervals.
 * Useful for sorted data.
 */
public static class IntervalSampler<K,V> implements Sampler<K,V> {
  private final double freq;
  private final int maxSplitsSampled;

  /**
   * Create a new IntervalSampler sampling <em>all</em> splits.
   * @param freq The frequency with which records will be emitted.
   */
  public IntervalSampler(double freq) {
    this(freq, Integer.MAX_VALUE);
  }

  /**
   * Create a new IntervalSampler.
   * @param freq The frequency with which records will be emitted.
   * @param maxSplitsSampled The maximum number of splits to examine.
   * @see #getSample
   */
  public IntervalSampler(double freq, int maxSplitsSampled) {
    this.freq = freq;
    this.maxSplitsSampled = maxSplitsSampled;
  }

  /**
   * For each split sampled, emit when the ratio of the number of records
   * retained to the total record count is less than the specified
   * frequency.
   */
  @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
    InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
    ArrayList<K> samples = new ArrayList<K>();
    int splitsToSample = Math.min(maxSplitsSampled, splits.length);
    int splitStep = splits.length / splitsToSample;
    long records = 0;
    long kept = 0;
    for (int i = 0; i < splitsToSample; ++i) {
      RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
          job, Reporter.NULL);
      K key = reader.createKey();
      V value = reader.createValue();
      while (reader.next(key, value)) {
        ++records;
        if ((double) kept / records < freq) {
          ++kept;
          samples.add(key);
          key = reader.createKey();
        }
      }
      reader.close();
    }
    return (K[])samples.toArray();
  }
}

InputSampler類完整源代碼如下：

package org.apache.hadoop.mapred.lib;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Random;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * Utility for collecting samples and writing a partition file for
 * {@link org.apache.hadoop.mapred.lib.TotalOrderPartitioner}.
 */
public class InputSampler<K,V> implements Tool {

  private static final Log LOG = LogFactory.getLog(InputSampler.class);

  static int printUsage() {
    System.out.println("sampler -r <reduces>\n" +
                       "      [-inFormat <input format class>]\n" +
                       "      [-keyClass <map input & output key class>]\n" +
                       "      [-splitRandom <double pcnt> <numSamples> <maxsplits> | " +
                       "// Sample from random splits at random (general)\n" +
                       "       -splitSample <numSamples> <maxsplits> | " +
                       "             // Sample from first records in splits (random data)\n"+
                       "       -splitInterval <double pcnt> <maxsplits>]" +
                       "             // Sample from splits at intervals (sorted data)");
    System.out.println("Default sampler: -splitRandom 0.1 10000 10");
    ToolRunner.printGenericCommandUsage(System.out);
    return -1;
  }

  private JobConf conf;

  public InputSampler(JobConf conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return conf;
  }

  public void setConf(Configuration conf) {
    if (!(conf instanceof JobConf)) {
      this.conf = new JobConf(conf);
    } else {
      this.conf = (JobConf) conf;
    }
  }

  /**
   * Interface to sample using an {@link org.apache.hadoop.mapred.InputFormat}.
   */
  public interface Sampler<K,V> {
    /**
     * For a given job, collect and return a subset of the keys from the
     * input data.
     */
    K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException;
  }

  /**
   * Samples the first n records from s splits.
   * Inexpensive way to sample random data.
   */
  public static class SplitSampler<K,V> implements Sampler<K,V> {

    private final int numSamples;
    private final int maxSplitsSampled;

    /**
     * Create a SplitSampler sampling <em>all</em> splits.
     * Takes the first numSamples / numSplits records from each split.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     */
    public SplitSampler(int numSamples) {
      this(numSamples, Integer.MAX_VALUE);
    }

    /**
     * Create a new SplitSampler.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     * @param maxSplitsSampled The maximum number of splits to examine.
     */
    public SplitSampler(int numSamples, int maxSplitsSampled) {
      this.numSamples = numSamples;
      this.maxSplitsSampled = maxSplitsSampled;
    }

    /**
     * From each split sampled, take the first numSamples / numSplits records.
     */
    @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
    public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>(numSamples);
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);
      int splitStep = splits.length / splitsToSample;
      int samplesPerSplit = numSamples / splitsToSample;
      long records = 0;
      for (int i = 0; i < splitsToSample; ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
            job, Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          samples.add(key);
          key = reader.createKey();
          ++records;
          if ((i+1) * samplesPerSplit <= records) {
            break;
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }
  }

  /**
   * Sample from random points in the input.
   * General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from
   * each split.
   */
  public static class RandomSampler<K,V> implements Sampler<K,V> {
    private double freq;
    private final int numSamples;
    private final int maxSplitsSampled;

    /**
     * Create a new RandomSampler sampling <em>all</em> splits.
     * This will read every split at the client, which is very expensive.
     * @param freq Probability with which a key will be chosen.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     */
    public RandomSampler(double freq, int numSamples) {
      this(freq, numSamples, Integer.MAX_VALUE);
    }

    /**
     * Create a new RandomSampler.
     * @param freq Probability with which a key will be chosen.
     * @param numSamples Total number of samples to obtain from all selected
     *                   splits.
     * @param maxSplitsSampled The maximum number of splits to examine.
     */
    public RandomSampler(double freq, int numSamples, int maxSplitsSampled) {
      this.freq = freq;
      this.numSamples = numSamples;
      this.maxSplitsSampled = maxSplitsSampled;
    }

    /**
     * Randomize the split order, then take the specified number of keys from
     * each split sampled, where each key is selected with the specified
     * probability and possibly replaced by a subsequently selected key when
     * the quota of keys from that split is satisfied.
     */
    @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
    public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>(numSamples);
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);

      Random r = new Random();
      long seed = r.nextLong();
      r.setSeed(seed);
      LOG.debug("seed: " + seed);
      // shuffle splits
      for (int i = 0; i < splits.length; ++i) {
        InputSplit tmp = splits[i];
        int j = r.nextInt(splits.length);
        splits[i] = splits[j];
        splits[j] = tmp;
      }
      // our target rate is in terms of the maximum number of sample splits,
      // but we accept the possibility of sampling additional splits to hit
      // the target sample keyset
      for (int i = 0; i < splitsToSample ||
                     (i < splits.length && samples.size() < numSamples); ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
            Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          if (r.nextDouble() <= freq) {
            if (samples.size() < numSamples) {
              samples.add(key);
            } else {
              // When exceeding the maximum number of samples, replace a
              // random element with this one, then adjust the frequency
              // to reflect the possibility of existing elements being
              // pushed out
              int ind = r.nextInt(numSamples);
              if (ind != numSamples) {
                samples.set(ind, key);
              }
              freq *= (numSamples - 1) / (double) numSamples;
            }
            key = reader.createKey();
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }
  }

  /**
   * Sample from s splits at regular intervals.
   * Useful for sorted data.
   */
  public static class IntervalSampler<K,V> implements Sampler<K,V> {
    private final double freq;
    private final int maxSplitsSampled;

    /**
     * Create a new IntervalSampler sampling <em>all</em> splits.
     * @param freq The frequency with which records will be emitted.
     */
    public IntervalSampler(double freq) {
      this(freq, Integer.MAX_VALUE);
    }

    /**
     * Create a new IntervalSampler.
     * @param freq The frequency with which records will be emitted.
     * @param maxSplitsSampled The maximum number of splits to examine.
     * @see #getSample
     */
    public IntervalSampler(double freq, int maxSplitsSampled) {
      this.freq = freq;
      this.maxSplitsSampled = maxSplitsSampled;
    }

    /**
     * For each split sampled, emit when the ratio of the number of records
     * retained to the total record count is less than the specified
     * frequency.
     */
    @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
    public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>();
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);
      int splitStep = splits.length / splitsToSample;
      long records = 0;
      long kept = 0;
      for (int i = 0; i < splitsToSample; ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
            job, Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          ++records;
          if ((double) kept / records < freq) {
            ++kept;
            samples.add(key);
            key = reader.createKey();
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }
  }

  /**
   * Write a partition file for the given job, using the Sampler provided.
   * Queries the sampler for a sample keyset, sorts by the output key
   * comparator, selects the keys for each rank, and writes to the destination
   * returned from {@link
     org.apache.hadoop.mapred.lib.TotalOrderPartitioner#getPartitionFile}.
   */
  @SuppressWarnings("unchecked") // getInputFormat, getOutputKeyComparator
  public static <K,V> void writePartitionFile(JobConf job,
      Sampler<K,V> sampler) throws IOException {
    final InputFormat<K,V> inf = (InputFormat<K,V>) job.getInputFormat();
    int numPartitions = job.getNumReduceTasks();
    K[] samples = sampler.getSample(inf, job);
    LOG.info("Using " + samples.length + " samples");
    RawComparator<K> comparator =
      (RawComparator<K>) job.getOutputKeyComparator();
    Arrays.sort(samples, comparator);
    Path dst = new Path(TotalOrderPartitioner.getPartitionFile(job));
    FileSystem fs = dst.getFileSystem(job);
    if (fs.exists(dst)) {
      fs.delete(dst, false);
    }
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, job, dst,
        job.getMapOutputKeyClass(), NullWritable.class);
    NullWritable nullValue = NullWritable.get();
    float stepSize = samples.length / (float) numPartitions;
    int last = -1;
    for(int i = 1; i < numPartitions; ++i) {
      int k = Math.round(stepSize * i);
      while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
        ++k;
      }
      writer.append(samples[k], nullValue);
      last = k;
    }
    writer.close();
  }

  /**
   * Driver for InputSampler from the command line.
   * Configures a JobConf instance and calls {@link #writePartitionFile}.
   */
  public int run(String[] args) throws Exception {
    JobConf job = (JobConf) getConf();
    ArrayList<String> otherArgs = new ArrayList<String>();
    Sampler<K,V> sampler = null;
    for(int i=0; i < args.length; ++i) {
      try {
        if ("-r".equals(args[i])) {
          job.setNumReduceTasks(Integer.parseInt(args[++i]));
        } else if ("-inFormat".equals(args[i])) {
          job.setInputFormat(
              Class.forName(args[++i]).asSubclass(InputFormat.class));
        } else if ("-keyClass".equals(args[i])) {
          job.setMapOutputKeyClass(
              Class.forName(args[++i]).asSubclass(WritableComparable.class));
        } else if ("-splitSample".equals(args[i])) {
          int numSamples = Integer.parseInt(args[++i]);
          int maxSplits = Integer.parseInt(args[++i]);
          if (0 >= maxSplits) maxSplits = Integer.MAX_VALUE;
          sampler = new SplitSampler<K,V>(numSamples, maxSplits);
        } else if ("-splitRandom".equals(args[i])) {
          double pcnt = Double.parseDouble(args[++i]);
          int numSamples = Integer.parseInt(args[++i]);
          int maxSplits = Integer.parseInt(args[++i]);
          if (0 >= maxSplits) maxSplits = Integer.MAX_VALUE;
          sampler = new RandomSampler<K,V>(pcnt, numSamples, maxSplits);
        } else if ("-splitInterval".equals(args[i])) {
          double pcnt = Double.parseDouble(args[++i]);
          int maxSplits = Integer.parseInt(args[++i]);
          if (0 >= maxSplits) maxSplits = Integer.MAX_VALUE;
          sampler = new IntervalSampler<K,V>(pcnt, maxSplits);
        } else {
          otherArgs.add(args[i]);
        }
      } catch (NumberFormatException except) {
        System.out.println("ERROR: Integer expected instead of " + args[i]);
        return printUsage();
      } catch (ArrayIndexOutOfBoundsException except) {
        System.out.println("ERROR: Required parameter missing from " +
            args[i-1]);
        return printUsage();
      }
    }
    if (job.getNumReduceTasks() <= 1) {
      System.err.println("Sampler requires more than one reducer");
      return printUsage();
    }
    if (otherArgs.size() < 2) {
      System.out.println("ERROR: Wrong number of parameters: ");
      return printUsage();
    }
    if (null == sampler) {
      sampler = new RandomSampler<K,V>(0.1, 10000, 10);
    }

    Path outf = new Path(otherArgs.remove(otherArgs.size() - 1));
    TotalOrderPartitioner.setPartitionFile(job, outf);
    for (String s : otherArgs) {
      FileInputFormat.addInputPath(job, new Path(s));
    }
    InputSampler.<K,V>writePartitionFile(job, sampler);

    return 0;
  }

  public static void main(String[] args) throws Exception {
    JobConf job = new JobConf(InputSampler.class);
    InputSampler<?,?> sampler = new InputSampler(job);
    int res = ToolRunner.run(sampler, args);
    System.exit(res);
  }
}

InputSampler

TotalOrderPartitioner實例

public class SortByTemperatureUsingTotalOrderPartitioner extends Configured
  implements Tool
{
    @Override
    public int run(String[] args) throws Exception
    {
  JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args);
  if (conf == null) {
      return -1;
  }
  conf.setInputFormat(SequenceFileInputFormat.class);
  conf.setOutputKeyClass(IntWritable.class);
  conf.setOutputFormat(SequenceFileOutputFormat.class);
  SequenceFileOutputFormat.setCompressOutput(conf, true);
  SequenceFileOutputFormat
    .setOutputCompressorClass(conf, GzipCodec.class);
  SequenceFileOutputFormat.setOutputCompressionType(conf,
    CompressionType.BLOCK);
  conf.setPartitionerClass(TotalOrderPartitioner.class);
  InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(
    0.1, 10000, 10);
  Path input = FileInputFormat.getInputPaths(conf)[0];
  input = input.makeQualified(input.getFileSystem(conf));
  Path partitionFile = new Path(input, "_partitions");
  TotalOrderPartitioner.setPartitionFile(conf, partitionFile);
  InputSampler.writePartitionFile(conf, sampler);
  // Add to DistributedCache
  URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
  DistributedCache.addCacheFile(partitionUri, conf);
  DistributedCache.createSymlink(conf);
  JobClient.runJob(conf);
  return 0;
    }
    public static void main(String[] args) throws Exception {
  int exitCode = ToolRunner.run(
    new SortByTemperatureUsingTotalOrderPartitioner(), args);
  System.exit(exitCode);
    }
}

參考資料

1.《Hadoop技術內幕深入理解MapReduce架構設計與實現原理》

2. http://www.cnblogs.com/xwdreamer/archive/2011/10/27/2296943.html

3. http://blog.oddfoo.net/2011/04/17/mapreduce-partition%E5%88%86%E6%9E%90-2/

逸辰杳

發佈了42 篇原創文章 · 獲贊 73 · 訪問量 22萬+

私信關注

Hadoop之Partition深度解析

舊版 API 的 Partitioner 解析

新版 API 的 Partitioner 解析

Partition所處的位置

Mapreduce提供的Partitioner

patition類結構

TotalOrderPartitioner 類

Tire tree的構建

Tire tree的搜索過程

TotalOrderPartitioner的疑問

採樣類結構圖

TotalOrderPartitioner實例

參考資料

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

Hadoop之NLineInputFormat解析

Python：Invalid environment marker：python_version

Hadoop之Partition深度解析

如何在hadoop中控制map的個數

歸併算法的遞歸和非遞歸實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結