Spark寫Redis+Spark資源配置總結

1. 起源於Error

19/10/16 11:22:06 ERROR YarnClusterScheduler: Lost executor 28 on **********: Container marked as failed: container_********** on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:32:59 ERROR YarnClusterScheduler: Lost executor 38 on 100.76.80.197: Container marked as failed: container_********** on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:40:27 ERROR YarnClusterScheduler: Lost executor 39 on **********: Container marked as failed: container_1567762627991_1638740_01_000343 on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:49:29 ERROR YarnClusterScheduler: Lost executor 40 on **********: Container marked as failed: container********** on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:49:29 ERROR TaskSetManager: Task 51 in stage 4.0 failed 4 times; aborting job
19/10/16 11:49:29 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 4.0 failed 4 times, most recent failure: Lost task 51.3 in stage 4.0 (TID 160, **********, executor 40): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container marked as failed: container_1567762627991_1638740_01_000353 on host: 100.76.26.136. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
Driver stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 4.0 failed 4 times, most recent failure: Lost task 51.3 in stage 4.0 (TID 160, **********, executor 40): Executor Lost Failure (executor 40 exited caused by one of the running tasks) Reason: Container marked as failed: container_********** on host: 100.76.26.136. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal

這種問題,最大的可能就是數據寫入redis的資源配置不合理,數據量太大,超過了redis能承受的。

幾個關鍵的spark資源配置如下:

  • driver-memory:主節點內存大小;driver 使用的內存,不可超過單機的總內存;
  • executor-memory:每個子節點(executor)分配的內存大小,推薦單個core配置2~3g;
  • num-executors:創建多少個 executor;
  • executor-cores:每個子節點(executor)使用的併發線程數目,也即每個 executor 最大可併發執行的 Task 數目。

由報錯信息可以看出,yarn丟失了executor,極有可能還是因爲executor被關閉了,所以還是要檢查一下自己的driver-memory和executor-memory是不是夠大。

2. Spark-Redis

使用Spark 2.0的scala API,使用jedis客戶端API,dependency如下:

<dependency>
  <groupId>redis.clients</groupId>
  <artifactId>jedis</artifactId>
  <version>2.9.0</version>
  <type>jar</type>
</dependency>

數據寫入redis代碼如下:

sampleData.repartition(500).foreachPartition(
rows => {
  val rc = new Jedis(redisHost, redisPort)
  rc.auth(redisPassword)
  val pipe = rc.pipelined

  rows.foreach(
  r => {
    val redisKey = r.getAs[String]("key")
    val redisValue = r.getAs[String]("value")
    pipe.set(redisKey, redisValue)
    pipe.expire(redisKey, expireDays * 3600 * 24)
  })

  pipe.sync()
})

3. 總結

3.1 控制開啓redis客戶端的數量

sampleData是一個DataSet,每一行有兩個數據:key和value。由於構建Jedis客戶端會有一定開銷,所以一定不要用map將數據逐條寫入到redis,而是mapPartition或foreachPartition。這樣,這個開銷只會與parition數量相關,與數據總量無關。試想如果sampleData有1億行,在map中將會構建1億個Jedis對象。

3.2 數據批量插入Redis

推薦使用了pipe進行批量插入,批量插入效率與逐條插入效率差異非常大。但是批量插入有個非常大的坑。上面的代碼中,如果一次性批量插入了整個partition的數據,恰巧單個partition的數據量非常大(超過了Redis pipline 的寫入速度  或者 timeout),會導致Redis內存溢出(或者timeout),導致服務不可用!

解決方法是在foreachPartition之前,repartition整個DateSet,確保每個分區的數據不要太大。推薦控制在1k~20k左右。如上,將sampleData分爲500個分區,每個分區10000條,那麼sampleData的總數爲500萬左右。但是,如果數據總量太大,單個分區過小,會導致分區數過大,這樣需要提高driver的內存,否則會導致driver內存溢出

3.3 控制在線更新併發

Redis一般提供在線服務。爲了避免在寫Redis時,與前端任務衝突,就不能使用太多executor。否則會使得QPS過高,影響在線服務響應,甚至導致Redis癱瘓。推薦的實踐方法是提高數據的分區數量,確保每個partition的數量較小,然後逐步提高併發數量(executor數量)。觀察在不同數量executor下,併發寫入Redis的QPS,直到QPS達到一個可以接受的範圍。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章