数据分区详解

数据分区的五种常用方式：

1、随机分区

优点：数据分布均匀
缺点：具有相同特点的数据不会保证被分配到相同的分区

2、Hash分区

优点：具有相同特点的数据保证被分配到相同的分区
特点：会产生数据倾斜

3、范围分区

缺点：提高查询速度，相邻的数据都在相同的分区
缺点：部分分区的数据量会超出其他的分区，需要进行裂变以保持所有分区的数据量是均匀的。如果每个分区不排序，那么裂变就会非常困难

4、轮询分区

负载均衡算法的一种
优点：确保一定不会出现数据倾斜
缺点：无法根据存储/计算能力分配存储/计算压力

5、自定义分区

请参考Flink的分区规则：
public static enum PartitionMethod {
   REBALANCE,       // round-robin 分区
   HASH,           // hash散列
   RANGE,           // 范围分区
   CUSTOM;           // 自定义
}

请看MapReduce的自定义分区的Partitioner接口的定义

/** 
 * Partitions the key space.
 * 
 * <p><code>Partitioner</code> controls the partitioning of the keys of the 
 * intermediate map-outputs. The key (or a subset of the key) is used to derive
 * the partition, typically by a hash function. The total number of partitions
 * is the same as the number of reduce tasks for the job. Hence this controls
 * which of the <code>m</code> reduce tasks the intermediate key (and hence the 
 * record) is sent for reduction.</p>
 * 
 * Note: If you require your Partitioner class to obtain the Job's configuration
 * object, implement the {@link Configurable} interface.
 * 
 * @see Reducer
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class Partitioner<KEY, VALUE> {
  
  /** 
   * Get the partition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be partioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  public abstract int getPartition(KEY key, VALUE value, int numPartitions);
  
}

请看Flink的自定义分区的接口Paritioner的定义：

/**
 * Function to implement a custom partition assignment for keys.
 *
 * @param <K> The type of the key to be partitioned.
 */
@Public
@FunctionalInterface
public interface Partitioner<K> extends java.io.Serializable, Function {

	/**
	 * Computes the partition for the given key.
	 *
	 * @param key The key.
	 * @param numPartitions The number of partitions to partition into.
	 * @return The partition index.
	 */
	int partition(K key, int numPartitions);
}

有个共同的特点就是：

你把元素交给这个分区器，这个分区器的一个方法逻辑来决定这个元素被分发到哪个分区。

6、测试代码

package com.aura.funny.partition;

import java.util.*;

/**
 * 作者： 马中华   https://blog.csdn.net/zhongqi2513
 * 时间： 2019/6/27 14:02
 * 描述：
 *      关于数据分区的代码测试
 */
public class PartitionTest02 {

  public static void main(String[] args) {

    /**
     * 待分区的数据集
     */
    List<String> data = Arrays.asList(
            "a", "b", "c", "d", "e", "f", "g",
            "h", "i", "j", "k", "l", "m", "n",
            "o", "p", "q", "r", "s", "t",
            "u", "v", "w", "x", "y", "z",
            "a", "a", "a", "a",
            "a", "a", "a", "a",
            "a", "a", "a", "a",
            "b", "b", "b", "b");
    /**
     * 分区个数
     */
    int partitionNumber = 5;





    /**
     * 第一招：Hash散列
     */
    System.out.println("\n---------第一招：Hash散列------------");
    List<List<String>> partitionList1 = partitionData(data, new Partitioner() {
      @Override
      public int getPartition(String item, int numPartitions) {
        return (item.hashCode() & Integer.MAX_VALUE) % numPartitions;
      }
    }, partitionNumber);
    printPartitionedData(partitionList1);





    /**
     * 第二招：随机分区
     */
    System.out.println("\n---------第二招：随机分区------------");
    List<List<String>> partitionList2 = partitionData(data, new Partitioner() {
      Random random = new Random();
      @Override
      public int getPartition(String item, int numPartitions) {
        return random.nextInt(numPartitions);
      }
    }, partitionNumber);
    printPartitionedData(partitionList2, false);





    /**
     * 第三招：轮询散列
     */
    System.out.println("\n---------第三招：轮询散列------------");
    List<List<String>> partitionList3 = partitionData(data, new Partitioner() {
      int counter = 0;
      @Override
      public int getPartition(String item, int numPartitions) {
        int partitionIndex = counter;
        counter++;
        if (counter == numPartitions) {
          counter = 0;
        }
        return partitionIndex;
      }
    }, partitionNumber);
    printPartitionedData(partitionList3, false);




    /**
     * 第四招：范围分区
     */
    System.out.println("\n---------第四招：范围分区------------");
    List<List<String>> partitionList4 = partitionData(data, new Partitioner() {
      @Override
      public int getPartition(String item, int numPartitions) {

        // 确定范围分界点
        Set datas = new HashSet<String>(data);
        List<String> distinctItemList = new ArrayList<String>(datas);
        Collections.sort(distinctItemList);
        int step = distinctItemList.size() / numPartitions + 1;

        int index = distinctItemList.indexOf(item);
        int partitionNum = index / step;
        return partitionNum;
      }
    }, partitionNumber);
    printPartitionedData(partitionList4);






    /**
     * 第五招：自定义分区
     */
    System.out.println("\n---------第五招：自定义分区------------");
    List<List<String>> partitionList5 = partitionData(data, new Partitioner() {
      @Override
      public int getPartition(String item, int numPartitions) {

        /**
         * 在此，自定义分区的逻辑即可。决定item这个元素到底被放置到哪个分区中。
         */

        return 0;
      }
    }, partitionNumber);
    printPartitionedData(partitionList5, false);

  }

  /**
   * 分区方法
   */
  public static List<List<String>> partitionData(List<String> data, Partitioner partitioner, int numPartitions){
    List<List<String>> partitionList = initPartitionContext(numPartitions);
    for (String item : data) {
      // 按照每个元素的hash值分配分区编号
      int partitionNum = partitioner.getPartition(item, numPartitions);
      partitionList.get(partitionNum).add(item);
    }
    return partitionList;
  }

  /**
   * 初始化装载分区数据的容器
   */
  public static List<List<String>> initPartitionContext(int numPartitions){
    List<List<String>> partitionList = new ArrayList<List<String>>();
    // 先创建存储每个分区数据的List
    for (int i = 0; i < numPartitions; i++) {
      partitionList.add(new ArrayList<String>());
    }
    return partitionList;
  }

  /**
   * 打印被分区的数据集，分区数据要进行排序
   */
  public static void printPartitionedData(List<List<String>> partitionResult){
    printPartitionedData(partitionResult, true);
  }

  /**
   * 打印被分区的数据集，根据需要是否排序分区的数据
   */
  public static void printPartitionedData(List<List<String>> partitionResult, boolean sort){

    if(sort){
      // 给每个分区的数据排序，为了结果好看
      for (List<String> partition : partitionResult) {
        if(partition.size() != 0 && partition != null){
          Collections.sort(partition);
        }
      }
    }

    // 打印输出每个分区的数据
    for (List<String> partition : partitionResult) {
      if(partition.size() != 0 && partition != null){
        String allItem = "";
        for (String item : partition) {
          allItem += (item + ",");
        }
        System.out.println(allItem.substring(0, allItem.length() - 1));
      }else{
        System.out.println("该分区的数据为空");
      }
    }
  }

}

/**
 * 一个定义分区逻辑的接口
 */
interface Partitioner{
  int getPartition(String item, int numPartitions);
}

各位把代码拿下去，直接就可运行看效果、！！

7、效果

在这里，我也给大家贴一份代码执行的效果

---------第一招：Hash散列------------
d,i,n,s,x
e,j,o,t,y
a,a,a,a,a,a,a,a,a,a,a,a,a,f,k,p,u,z
b,b,b,b,b,g,l,q,v
c,h,m,r,w

---------第二招：随机分区------------
b,c,f,h,j,o,u,y,a,a,a,b
d,v,a,a
a,e,l,n,r,t,w,x,z,a,a,a,b,b
k,p,q,s,a,b
g,i,m,a,a,a

---------第三招：轮询散列------------
a,f,k,p,u,z,a,a,b
b,g,l,q,v,a,a,a,b
c,h,m,r,w,a,a,a
d,i,n,s,x,a,a,b
e,j,o,t,y,a,a,b

---------第四招：范围分区------------
a,a,a,a,a,a,a,a,a,a,a,a,a,b,b,b,b,b,c,d,e,f
g,h,i,j,k,l
m,n,o,p,q,r
s,t,u,v,w,x
y,z

---------第五招：自定义分区------------
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a,a,a,a,a,a,a,a,a,a,a,a,b,b,b,b
该分区的数据为空
该分区的数据为空
该分区的数据为空
该分区的数据为空

数据分区详解

数据分区详解

1、随机分区

2、Hash分区

3、范围分区

4、轮询分区

5、自定义分区

6、测试代码

7、效果

Hive的SQL編譯源碼詳解

Spark的任務提交和執行流程詳解

Hive--筆試題05_2--求TopN

Python全詳解--大綱（全網最清晰學習思路）

四百多篇博客專欄歸類讓你直接晉級老手

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結