21 - Spark - RDD的分區

原創

2020-03-03 04:03

關於Spark RDD 的分區的一些問題

虛擬機配置 2處理器16核

案例一 makeRDD

運行代碼

import org.apache.spark.{SparkConf, SparkContext}

object Spark21 extends App {

  // 設定Spark計算環境
  val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
  // 創建上下文
  val sc = new SparkContext(config)

  val listRDD = sc.makeRDD(List(1, 2, 3, 4))

  // 將RDD的數據保存到文件中
  listRDD.saveAsTextFile("output")

}

運行結果

列表數字隨機分配在了16個分區中

markRDD源代碼

override def defaultParallelism(): Int = {
    conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
  }

/** Get a parameter as an integer, falling back to a default if not set */
  def getInt(key: String, defaultValue: Int): Int = {
    getOption(key).map(_.toInt).getOrElse(defaultValue)
  }

// Use an atomic variable to track total number of cores in the cluster for simplicity and speed
  protected val totalCoreCount = new AtomicInteger(0)

由源代碼可以看出

totalCoreCount 獲取核心總數與 2 相比，取大的那個數，這裏是16，因此分區數爲16

案例二 textFile

一、運行代碼

object Spark21 extends App {

  // 設定Spark計算環境
  val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")
  // 創建上下文
  val sc = new SparkContext(config)

  // val listRDD = sc.makeRDD(List(1, 2, 3, 4))
  private val listRDD: RDD[String] = sc.textFile("in")

  // 將RDD的數據保存到文件中
  listRDD.saveAsTextFile("output")


}

結果有兩個分區，下面分析源代碼

textFile源代碼

/**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   * @param path path to the text file on a supported file system
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of lines of the text file
   */
  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

/**
   * Default min number of partitions for Hadoop RDDs when not given by user
   * Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
   * The reasons for this are discussed in https://github.com/mesos/spark/pull/718
   */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

// Get the default level of parallelism to use in the cluster, as a hint for sizing jobs.
  def defaultParallelism(): Int

由源代碼可以看出

textFile方法有兩個傳入參數，路徑&分區數
如果沒有設置，默認 defaultParallelism 與 2 比較，取較小值得分區
defaultParallelism獲取並行度，這裏是16，與2相比，後者小，所以最後分區爲2個分區

補充 textFile讀取文件，如果設置了分區爲2，但結果分區有 3個
這裏textFile取決於hadoopFile切片操作，具體我也還不知道，先記錄一下

案例三

setMaster源代碼查看

/**
   * The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
   * run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
   */
  def setMaster(master: String): SparkConf = {
    set("spark.master", master)
  }

要連接到的主URL，如“local”要用一個線程在本地運行，“local[4]”要連接到
使用4個內核在本地運行，或者“spark://master:7077”運行在spark獨立集羣上。

根據源代碼說明更改master，將local[*]改爲local[1]

val config: SparkConf = new SparkConf().setMaster("local[*]").setAppName("WordCount")

val config: SparkConf = new SparkConf().setMaster("local[1]").setAppName("WordCount")

結果將存入一個分區

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

21 - Spark - RDD的分區

關於Spark RDD 的分區的一些問題

虛擬機配置 2處理器16核

案例一 makeRDD

運行代碼

運行結果

列表數字隨機分配在了16個分區中

markRDD源代碼

由源代碼可以看出

案例二 textFile

一、運行代碼

結果有兩個分區，下面分析源代碼

textFile源代碼

由源代碼可以看出

案例三

根據源代碼說明更改master，將local[*]改爲local[1]

結果將存入一個分區

Hive 基本指令

22 - Spark - map 算子

26 - Spark - flatMap算子

24 - Spark - mapPartitionsWithIndex算子

31 - Spark - coalesce算子

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結