文章內容

問題描述

我們的生產目前使用的是spark2.3版本。客戶最近在使用UDF完成一些功能，操作方式如下：

編寫UDF.jar
使用beeline (或JDBC)連接 thriftserver（yarn模式），執行create temporary function using udf.jar 的命令，創建一個臨時函數。這時候就可以在當前的session中使用該臨時函數。使用結束後，刪除jar包，後續其他session中執行sql，就會報FileNotFoundException：File XXX does not exist. 重啓thrift服務後該問題可以解決。

問題定位

爲什麼刪除jar以後，其他的task 還是會需要這個jar呢？帶着這個疑問，我又開始了扒源碼找問題之路。
首先從報錯的地方看起， Executor.scala 801：

   * Download any missing dependencies if we receive a new set of files and JARs from the
   * SparkContext. Also adds any new JARs we fetched to the class loader.
   */
  private def updateDependencies(newFiles: Map[String, Long], newJars: Map[String, Long]) {
    lazy val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
    synchronized {
      // Fetch missing dependencies
      for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) {
        logInfo("Fetching " + name + " with timestamp " + timestamp)
        // 報錯的就是這個地方
        // Fetch file with useCache mode, close cache for local mode.
        Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf,
          env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
        currentFiles(name) = timestamp
      }
      initThreadCurrentJars()
      //以下省略部分代碼

在這裏插入代碼片

這裏其實是執行了一個下載task 的jar和file 依賴的操作，首先對比task依賴文件的時間戳，如果達到文件fetchFile 條件則會去下載依賴，優先搜索executor的本地緩存，如果沒有，則去uri指定的文件系統下載。

這個方法的入參有兩個，分別是新增的依賴jars和file。那我們接着跟，看下這兩個參數來自哪裏。然後，我們發現在executor啓動task時通過參數taskDescription傳遞過來的。

  def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)
  }

然後，在taskSetManager裏面初始化taskDescription的，這裏將addedJars和addedFiles 作爲參數，也就是後面task需要去下載的jars 和files。


  // SPARK-21563 make a copy of the jars/files so they are consistent across the TaskSet
  private val addedJars = HashMap[String, Long](sched.sc.addedJars.toSeq: _*)
  private val addedFiles = HashMap[String, Long](sched.sc.addedFiles.toSeq: _*)

可以看到這兩個變量來自於job的SparkContext的addedFiles 和 addedJars 兩個變量。

  // Used to store a URL for each static file/jar together with the file's local timestamp
  private[spark] val addedFiles = new ConcurrentHashMap[String, Long]().asScala
  private[spark] val addedJars = new ConcurrentHashMap[String, Long]().asScala

這裏我產生了疑問,難到不是每個session都對應一個SparkContext嗎？於是我只好去找openSession的時候到底怎麼做的。

override def openSession(
                            protocol: TProtocolVersion,
                            username: String,
                            passwd: String,
                            ipAddress: String,
                            sessionConf: java.util.Map[String, String],
                            withImpersonation: Boolean,
                            delegationToken: String): SessionHandle = {
    val sessionHandle =
      super.openSession(protocol, username, passwd, ipAddress, sessionConf, withImpersonation,
        delegationToken)
    val session = super.getSession(sessionHandle)
    val ss = session.getSessionState
    val hiveConf = session.getHiveConf
    ss.initTxnMgr(hiveConf)
    val txnManager = ss.getTxnMgr

    val ctx = if (sqlContext.conf.hiveThriftServerSingleSession) {
      sqlContext
    } else {
      sqlContext.newSession()
    }
    // 以下省略部分代碼
  }

可以看到，重點在sqlContext.newSession()這行代碼中。

  /**
   * Returns a [[SQLContext]] as new session, with separated SQL configurations, temporary
   * tables, registered functions, but sharing the same `SparkContext`, cached data and
   * other things.
   *
   * @since 1.6.0
   */
  def newSession(): SQLContext = sparkSession.newSession().sqlContext

這裏有一句很重要的註釋：
Returns a [[SQLContext]] as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, cached data andother things.
也就是說， jdbc中所有的session時共享SparkContext，緩存數據等東西的。至此上述問題就可以解釋了。

通過jdbc連接spark thriftserver的方式執行sql，多個session共享同一個SparkContext，當在某個session中通過創建臨時函數的方式引用了jar，這個jar會被永久地添加到SparkContext中，並且此後每個job的tasks都會將這些jar作爲依賴，執行階段會去下載依賴，並且我們上面提到過，是優先獲取本地緩存的jar。如果thrift不重啓，那麼這個sparkcontext就一直都是共享的。如果用戶中途不使用這個函數了，並刪除了函數依賴的jar，當excutor本地緩存失效或executor重啓後，執行task時都會去文件系統重新下載依賴jar，這時候就會報上述錯誤，導致task執行失敗了。

解決方案

找到問題原因後，我們告訴用戶先不要做刪除jar的操作，優先保證環境可用，但這畢竟時權宜之計。然後我們解決問題的角度，提出了兩個解決方案：
1 . executor中下載jar失敗時，只打印log，不報錯，當函數執行找不到jar時，會報錯，可能是classNotFound之類的，這樣用戶知道再次去上傳jar。此實現比較簡單，但是有一部分風險，比如spark_submit等指定的jar，或者其他方式添加的jar，如果這時候確實沒上傳成功，那麼執行時可能報錯不具有明確指向，會增加問題排查的難度。
2. 在session close的時和drop function時，將函數依賴的jar從SparkContext中刪除。避免其他job引用到不需要的依賴文件。這個解決方式可以從根本上解決問題，但是開發和測試相對複雜一些。

總結

以上就是這個詭異問題的發現排查過程，以此記錄，歡迎討論。

記一個Spark2.3 JDBC連接thriftServer 創建臨時函數的bug

文章內容

問題描述

問題定位

解決方案

總結

Spark default 分區爲空時無法查詢的問題解決

記一個hive1.2.1 orc 事務表不能正常提交合並任務的問題

從零開始搭建一個windows下的presto開發調試環境

Idea開發調試MapReduce的wordCount

記一個Spark2.3 JDBC連接thriftServer 創建臨時函數的bug

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結