背景

測試了一個case，用GraphX 1.6跑標準的LPA算法，使用的是內置的LabelPropagation算法包。數據集是Google web graph，(忽略可能這個數據集不是很合適)，資源情況是standalone模式，18個worker，每個worker起一個executor，50g內存，32核，數據加載成18個分區。

case裏執行200輪迭代，代碼:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._

// load the graph
val google = GraphLoader.edgeListFile(sc, "/home/admin/benchmark/data/google/web-Google.txt", false, 18)

LabelPropagation.run(google, 200)

GraphX的執行方式

graphx的LPA是使用自己封裝的Pregel跑的，先說優點，問題在後面暴露後分析：
1. 包掉了使用VertexRDD和EdgeRDD做BSP的過程，api簡單，泛型清晰
2. 某輪迭代完成後，本輪沒有msg流動的話，判定早停，任務結束
3. 迭代開始前，graph自動cache，結束後，某些中間結果rdd自動uncache

代碼如下:

  def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] =
  {
    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
    // compute the messages
    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      prevG = g
      g = g.joinVertices(messages)(vprog).cache()

      val oldMessages = messages
      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = g.mapReduceTriplets(
        sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      // The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
      // (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
      // and the vertices of g).
      activeMessages = messages.count()

      logInfo("Pregel finished iteration " + i)

      // Unpersist the RDDs hidden by newly-materialized RDDs
      oldMessages.unpersist(blocking = false)
      prevG.unpersistVertices(blocking = false)
      prevG.edges.unpersist(blocking = false)
      // count the iteration
      i += 1
    }

    g
  } // end of apply

SparkDriver成爲瓶頸

driver是提交任務的入口，但同時”監督”了本次DAG的執行過程。在默認1g內存的情況下，任務執行10min後，driver端拋了OOM異常，穩定復現，截取兩次堆棧：

這一次發生於執行rdd checkpoint依賴鏈相關的操作。

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:217)
    at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
    at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$visit$1(RDD.scala:283)
    at org.apache.spark.rdd.RDD$$anonfun$org$apache$spark$rdd$RDD$$visit$1$1.apply(RDD.scala:288)
    at org.apache.spark.rdd.RDD$$anonfun$org$apache$spark$rdd$RDD$$visit$1$1.apply(RDD.scala:286)
    ...

這一次發生於創建ShuffleMapStage(即生成執行計劃，提交出去之前)

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded
...
org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:58)
    at org.apache.spark.scheduler.StageInfo$$anonfun$1.apply(StageInfo.scala:80)
	at org.apache.spark.scheduler.StageInfo$$anonfun$1.apply(StageInfo.scala:80)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.scheduler.StageInfo$.fromStage(StageInfo.scala:80)
    at org.apache.spark.scheduler.Stage.<init>(Stage.scala:99)
    at org.apache.spark.scheduler.ShuffleMapStage.<init>(ShuffleMapStage.scala:36)
    at org.apache.spark.scheduler.DAGScheduler.newShuffleMapStage(DAGScheduler.scala:317)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$newOrUsedShuffleStage(DAGScheduler.scala:352)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage$1.apply(DAGScheduler.scala:286)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage$1.apply(DAGScheduler.scala:285)
    ...

把spark.driver.memory設大10g後，截止此時，Pregel迭代到175輪，任務進行了29min，driver內存使用穩定在10g，gc情況如下：

ygc和fgc使得driver端成爲整個迭代任務的瓶頸。

worker端，executor的內存使用和cpu壓力是不大的，內存溫度在16g，cpu最高的時候不超過250%

UI上job運行timeline:

最終任務在45min左右結束，driver端在10g內存下ygc 760次，fgc 31次。

分析

GraphX跑迭代圖算法的方式，相當於是提交n次spark job。第一，spark得把DAG的生成和提交做的很快，開銷足夠小；第二，stage執行過程中的事件傳輸和響應代價也需要足夠小。

實際上，內存消耗的重頭是圖(rdd)之間的血緣(爲failover做cp)。其他開銷是接收和處理每次迭代每個stage的狀態、起停等。

下圖是跑了20min左右，10g內存已經吃完時候的histo：

 num     #instances         #bytes  class name
----------------------------------------------
   1:      64438657     1546527768  scala.collection.immutable.$colon$colon
   2:      15713541     1131374952  org.apache.spark.storage.RDDInfo
   3:      32323255      517172080  java.lang.Integer
   4:       1420435      363557008  [B
   5:       4116166      328222184  [I
   6:       2804579      316599136  [Ljava.lang.Object;
   7:       6750506      314468944  [C
   8:       9262969      296415008  scala.collection.mutable.ListBuffer
   9:       6468190      155236560  java.lang.String
  10:       4364012       69824192  org.apache.spark.rdd.RDD$$anonfun$checkpointRDD$1
  11:        896697       64562184  java.util.regex.Pattern
  12:        896648       57385472  java.util.regex.Matcher
  13:       3484333       55749328  org.apache.spark.rdd.RDD$$anonfun$dependencies$1
  14:       1742140       55748480  org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1
  15:        896618       50210608  [Ljava.util.regex.Pattern$GroupHead;
  16:        371868       46546816  [Lscala.collection.mutable.HashEntry;
  17:       1728270       41478480  scala.collection.mutable.ArrayBuffer
  18:       1670208       40084992  scala.collection.mutable.DefaultEntry
  19:       2312219       36995504  scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1
  20:       1355395       32529480  org.apache.spark.ui.scope.RDDOperationEdge
  21:       1309374       31424976  org.apache.spark.scheduler.CompressedMapStatus

裏面RDDInfo基本就是血緣依賴鏈的內存證據，其餘包括checkpointRDD, RDD dependencies等

class RDDInfo(
    val id: Int,
    val name: String,
    val numPartitions: Int,
    var storageLevel: StorageLevel,
    val parentIds: Seq[Int],
    val callSite: String = "",
    val scope: Option[RDDOperationScope] = None)
  extends Ordered[RDDInfo] {

同樣這個任務，使用我們自己的計算框架以流式迭代的方式跑的結果是幾百秒，沒有什麼優化，與小輪次跑出的時間基本上呈線性。GraphX在小輪次(driver不是瓶頸)的執行時間也比我們慢幾倍，但是我覺得差距還不夠大，所以待進一步提高後再說。

思考

這是一個典型的case，LPA在fn裏的執行複雜度是很低的，基本上沒什麼計算複雜度和開銷。數據量也並不大。但是迭代次數比較多，但也不誇張。shuffle過程中，帶出去的點屬性也只是一個Map，k和v都是原生類型(long和int)，所以序列化和payload也不是問題。

這個case讓我看到GraphX跑大迭代圖算法時，driver會成爲瓶頸。當然GraphX爲每輪graph做了cp。但是設想一個上規模的spark集羣，有很多人要來跑圖任務，哪怕executor沒有問題，driver要開掉多少資源？

如果GraphX要做大迭代的話，需要手動寫成跑n輪cache一把圖，再繼續的方式跑。這種方式下的n過小，還是會因爲血緣而OOM，n過大，除了cp問題，在我看來把DAG展開成幾百個stage跑也是個問題。

如果往上線支持圖業務的角度看的話，其實GraphX的loader也是個問題。GraphX的建模過程很簡單也很快，但是建好的graph要更新的話，需要把增量部分先load一把，然後做類似兩個graph的join。所以說GraphX讓用戶看到api寫起來很簡單很舒服，代價是不靈活的內置圖建模過程，這倒不如計算和存儲分開，計算只要存儲的一份建模元信息就可以跑了。

這樣看，首先，GraphX並不適合大迭代輪次的計算，qps幾乎爲0；其次，GraphX不適合圖數據更新的場景，開發者可以加額外的工作和方法去做到，但其本身其實沒有考慮這個問題。至於GraphX適不適合大圖的計算先不說。

在我看來GraphX只適合做數據分析鏈中的一環，幾乎沒有單獨做圖業務上生產環境的可能性。

GraphX迭代的瓶頸與分析

背景

GraphX的執行方式

SparkDriver成爲瓶頸

分析

思考

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

Spark SQL CLI 實現分析

論文摘抄 - Infobright

常見計算框架算子層對比

Scala Learning(1): 使用Pattern Matching表達JSON

Spark Core Runtime分析: DAGScheduler, TaskScheduler, SchedulerBackend

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結