Spark自定義分區解決手機號分區

需求: 自定義分區對手機號按前三位進行分區

怎麼分區

我們知道reduceByKey這個算子底層會使用取Hash值進行分區,源碼如下
  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

在這裏插入圖片描述

而HashPartitioner的實現是通過繼承org.apache.spark.Partitioner類,重寫了numPartitions和getPartition方法,這樣,我們只需要自定義一個類,繼承Partitioner類並實現裏面的方法就可以完成,代碼演示如下
import org.apache.spark.Partitioner

/**
  * 需求: 將手機號按照前三位進行自定義分區
  * @param num
  */
class MyPartition(num: Int) extends Partitioner{
    override def numPartitions: Int = num

    override def getPartition(key: Any): Int = {
        key match {
            case null => 0
            case key if key.toString.startsWith("137") => 1
            case key if key.toString.startsWith("138") => 2
            case key if key.toString.startsWith("133") => 3
            case _ => 4
        }
    }
}

測試代碼

    @Test
    def myPartition: Unit ={
        sc.parallelize(Seq(("1379999",1), ("138999",1), ("1333889",1), ("1333889",1)), 6)
                .reduceByKey(new MyPartition(5), _ + _)
                .mapPartitionsWithIndex((index: Int, item) => {
                    println(s"index: ${index} + ${item.toBuffer}")
                    item
                } ).collect()
    }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章