Spark Core 学习整理

闭包

闭包的大致作用就是：函数可以访问函数外面的变量，但是函数内对变量的修改，在函数外是不可见的。

首先，闭包是一个函数，然后，也是最本质的地方：这个函数内部会引用（依赖）到一些变量，这些变量既不是全局的也不是局部的，而是在定义在上下文中的（这种变量被称为“自由变量”，我们会在稍后的例子中看到这种变量），闭包的“神奇”之处是它可以“cache”或者说是持续的“trace”它所引用的这些变量。（从语言实现层面上解释就是：这些变量以及它们引用的对象不会被GC释放）。同样是这件事情，换另一种说法就是：闭包是一个函数，但同时这个函数“背后”还自带了一个“隐式”的上下文保存了函数内部引用到的一些（自由）变量。

mapPartitions

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

map: 比如一个partition中有1万条数据；那么你的function要执行和计算1万次。

MapPartitions:一个task仅仅会执行一次function，function一次接收所有的partition数据。只要执行一次就可以了，性能比较高。

如果在map过程中需要频繁创建额外的对象(例如将rdd中的数据通过jdbc写入数据库,map需要为每个元素创建一个链接而mapPartition为每个partition创建一个链接),则mapPartitions效率比map高的多。

SparkSql或DataFrame默认会对程序进行mapPartition的优化。

Demo

val standardResultRDD = inputFile.mapPartitions(l => { val standardInstalledAppUtilObject = new DeSerializableDimHdfs(obj.value).getStandardInstalledAppUtilObject for (e <- l) yield { OdsStandardInstlledAppPro.odsStandardInstlledApp(e, standardInstalledAppUtilObject) } }).filter(_ != null)

broadcast

  /**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

Broadcast是只读的，装载的数据是可序列化的

Demo
val obj = sc.broadcast(objPro)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark Core 学习整理

闭包

mapPartitions

broadcast

再谈23种设计模式（3）：行为型模式（学习笔记）

Power Automate Desktop 安装完，登录后老是提示one driver 错误

微前端学习笔记(4):从微前端到微模块之EMP与hel-micro方案探索

微前端学习笔记（1）：微前端总体架构概述，从微服务发微

985 硕士程序员，空窗 4 个月没有 Offer！

一文搞懂 Spring 循环依赖

赛博斗地主——使用大语言模型扮演Agent智能体玩牌类游戏。

VScode右键打开(添加到右键)

记一次 .NET某工控视觉自动化系统卡死分析

WindowsServer--SQL Server搭建主从同步实现读写分离 - 事务性分发

spark dataframe 解析複雜 json

simpleHTTPServer文件傳輸

Spark Core 學習整理

將master分支合併到dev分支

Mac Intellij IDEA中pyspark的環境搭建

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結