Spark自定義RDD重分區

原創

2020-02-25 17:49

在某些計算場景中，我們可能需要將兩個有關聯的數據輸入的部分數據，也就是說RDD中的部分數據，需要聚合在同一個partition進行匹配計算，這個時候，我們就需要根據實際的業務需求，自定義RDD重分區。

下面結合代碼，看看具體怎麼實現重分區，spark內部提供了一個分區抽象類Partitioner:

package org.apache.spark
/**
 * An object that defines how the elements in a key-value pair RDD are partitioned by key.
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 */
abstract class Partitioner extends Serializable {
  def numPartitions: Int
  def getPartition(key: Any): Int
  def equals(other: Any): Boolean
}

    def numPartitions: Int：這個方法需要返回你想要創建分區的個數；
    def getPartition(key: Any): Int：這個函數需要對輸入的key做計算，然後返回該key的分區ID
    equals()：這個是Java標準的判斷相等的函數，之所以要求用戶實現這個函數是因爲Spark內部會比較兩個RDD的分區是否一樣。

具體實現一個，如下：

package com.ais.common.tools.partitioner

import com.ais.common.tools.{HashTool, IPTool}
import org.apache.spark.Partitioner


class IndexPartitioner(partitions: Int) extends Partitioner {
  def numPartitions = partitions

  def getPartition(key: Any): Int = {
    key match {
      case null => 0
      case iKey: Int => iKey % numPartitions
      case textKey: String => (HashTool.hash(textKey) % numPartitions).toInt
      case _ => 0
    }
  }

  override def equals(other: Any): Boolean = {
    other match {
      case h: IndexPartitioner =>
        h.numPartitions == numPartitions
      case _ =>
        false
    }
  }
}

這個重寫的分區策略就是：

如果key是個整形數值，則和分區數取餘分配；如果key是個字符型的值，則計算他的哈希值再和分區數取餘分配。這樣我們只要在將兩個RDD的key值保持一直，然後進行重分區，就能確保key一樣的數據shuffe到同一個分區。當然也可以根據自己實際的業務來定義更復雜的分區策略。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark自定義RDD重分區

Spark數據結構優化

solr億萬級索引優化實踐（三）

solr億萬級索引優化實踐（一）

調用lucene向solr建索引實踐

Hadoop NameNode元數據相關文件目錄解析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結