一次實踐：spark查詢hive速度緩慢原因分析並以此看到spark基礎架構

前一段時間數據挖掘組的同學向我返回說自己的一段pyspark代碼執行非常緩慢，而代碼本身非常簡單，就是查詢hive 一個視圖中的數據，而且通過limit 10限制了數據量。
不說別的，先貼我的代碼吧：

from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled", "false")
zj_sql = 'select * from silver_ep.zj_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()

sql語句僅僅是從一個視圖中查詢10條語句，按道理說，查詢速度應該非常快，但是執行結果是：任務執行了30分鐘也沒有執行完。視圖對應的表的數據文件格式是parquet格式。

可能原因1：難道是因爲我們使用了舊版的python api嗎？因爲我們的2.1.0 版本，通過查看2.1.0版本的spark對應的pyspark API specification ，我發現這樣一句話：

class pyspark.sql.HiveContext(sparkContext, jhiveContext=None)
A variant of Spark SQL that integrates with data stored in Hive.
Configuration for Hive is read from hive-site.xml on the classpath. It supports running both SQL and HiveQL commands.
Parameters: 
sparkContext – The SparkContext to wrap.
jhiveContext – An optional JVM Scala HiveContext. If set, we do not instantiate a new HiveContext in the JVM, instead we make all calls to this object.
Note Deprecated in 2.0.0. Use SparkSession.builder.enableHiveSupport().getOrCreate().

和

class pyspark.sql.SQLContext(sparkContext, sparkSession=None, jsqlContext=None)
The entry point for working with structured data (rows and columns) in Spark, in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
A SQLContext can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files.

2.0+ 版本的spark已經不推薦我們使用SQLContext、HiveContext ，雖然初步推斷是這個導致問題的可能性不大，因爲儘管我們使用了舊的api，但是spark server確是最新的啊，總不至於舊的api依然使用舊的spark服務器吧？但是，我還是嘗試使用新版spark推薦的SparkSession方式去
調用，結果在預料之中，執行效率沒有改變。排除這個原因。

可能原因2：由於查詢的是一個view，與普通表不同，在查詢view的時候會增加一些額外的查詢操作以首先構建view的查詢結果，然後基於構建的view數據進行查詢。
因此懷疑是是否因爲這個view的創建語句含中有join等操作，導致子查詢長期無法完成，因此查詢速度緩慢，如果猜想正確，那麼這條sql語句在hive中直接執行，速度應該也是非常緩慢的，於是通過beeline執行該sql，速度非常快，而且，查看這個view的創建語句：

CREATE VIEW `zj_v` AS SELECT `zj`.`hdate`,
       MD5(`zj`.`firmid`) AS `FIRM_ID`,
       `zj`.`allenablemoney`,
       `zj`.`alloutmoney`,
       `zj`.`zcmoney`,
       `zj`.`netzcmoney`,
       `zj`.`rzmoney`,
       `zj`.`rhmoney`,
       `zj`.`minmoney`
  FROM `SILVER_SILVER_NJSSEL`.`ZJ`

並沒有join等操作，只是一個簡單的查詢。因此排除這個原因。

可能原因3：spark本身的解析引擎有問題
通過beeline使用的hadoop 的 mapreduce引擎做的文件解析和查詢，spark使用的是自己的sql引擎做的解析。那麼，是不是spark執行引擎沒有一定的優化呢，於是，我在spark-sql中執行查詢，結果顯示，查詢效率很高，大概2s返回結果。

可能原因4：難道我們的limit關鍵字沒有起作用，也就是說spark是先把所有數據傳輸到driver然後才做limit操作的嗎？也就是說，Spark在執行collect()這個action之前，遍歷了全表，查詢了所有的數據？我們使用explain來看看spark的執行計劃：

>>> zj_df.explain(True)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'UnresolvedRelation `silver_ep`.`zj_v`

== Analyzed Logical Plan ==
hdate: string, FIRM_ID: string, allenablemoney: string, alloutmoney: string, zcmoney: string, netzcmoney: string, rz
GlobalLimit 10
+- LocalLimit 10
   +- Project [hdate#38, FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmoney#42, netzcmoney#43, rzmoney#44, rhmon
      +- SubqueryAlias zj_v
         +- Project [hdate#38, md5(cast(firmid#39 as binary)) AS FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmo
            +- SubqueryAlias zj
               +- Relation[hdate#38,firmid#39,allenablemoney#40,alloutmoney#41,zcmoney#42,netzcmoney#43,rzmoney#44,r

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Project [hdate#38, md5(cast(firmid#39 as binary)) AS FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmoney#42
      +- Relation[hdate#38,firmid#39,allenablemoney#40,alloutmoney#41,zcmoney#42,netzcmoney#43,rzmoney#44,rhmoney#45

== Physical Plan ==
CollectLimit 10
+- *Project [hdate#38, md5(cast(firmid#39 as binary)) AS FIRM_ID#37, allenablemoney#40, alloutmoney#41, zcmoney#42, 
   +- *BatchedScan parquet silver_silver_njssel.zj[hdate#38,firmid#39,allenablemoney#40,alloutmoney#41,zcmoney#42,ner/hive/warehouse/silver_silver_njssel.db/zj, PushedFilters: [], ReadSchema: struct<hdate:string,firmid:string,allena
>>>

從explain的結果可以看到，spark的driver拿到了我們的sql以後，從我們的”limit 10”得到
GlobalLimit 10
然後，根據全局limit 10的執行計劃，得到每臺單機（一個或者多個executor進程，當我們在使用pyspark交互方式的時候，其實是一個pyspark進程下面的好多executor線程）的
LocalLimit 10
，顯然，當executor在得到查詢結果的時候，已經處理了limit 10 ，即提交的不是全局結果。
那麼，時間到底消耗在哪兒呢？

可能原因5:collect()操作本身決定了需要這麼長的時間

爲了更佳準確的觀察spark在執行我們的hive查詢任務的時候的執行邏輯，我們通過
sc.setLogLevel("INFO")修改pyspark的日誌級別（發現通過修改log4j沒有什麼效果），將日誌級別從WARN降低到INFO, 然後開始執行剛纔的

2017-02-21 20:46:52,757 INFO  [Executor task launch worker-8] datasources.FileScanRDD: Reading File path: hdfs://datah row]
2017-02-21 20:46:52,757 INFO  [Executor task launch worker-28] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO  [Executor task launch worker-11] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,757 INFO  [Executor task launch worker-6] datasources.FileScanRDD: Reading File path: hdfs://datah row]
2017-02-21 20:46:52,757 INFO  [Executor task launch worker-21] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO  [Executor task launch worker-18] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO  [Executor task launch worker-16] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO  [Executor task launch worker-0] datasources.FileScanRDD: Reading File path: hdfs://datah row]
2017-02-21 20:46:52,756 INFO  [Executor task launch worker-15] datasources.FileScanRDD: Reading File path: hdfs://datay row]
2017-02-21 20:46:52,756 INFO  [Executor task launch worker-31] datasources.FileScanRDD: Reading File path: hdfs://datay row]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further d

可見，Spark的解析引擎的執行策略是爲每一個數據文件都創建了一個worker線程。因爲我們是使用pyspark進行的，所以是單機執行模式，所有的executor屬於同一進程下面的不同線程。所以，這個任務實際上是在一個機器上執行，共享一個jvm的內存。
而且，我們在運行過程中發現經常出現OutofMemory Exception（非必現）：

2017-02-22 12:07:37,724 ERROR [dag-scheduler-event-loop] scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ShuffleMapTask,ExceptionFailure(java.lang.OutOfMemoryError,Java heap space,[Ljava.lang.StackTraceElement;@394278bc,java.lang.OutOfMemoryError: Java heap space
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

原因很明顯，將parquet文件讀到內存的時候，發生了oom異常。問題又來了，我只需要10條數據，那麼每個executor最多只需要讀10條數據就可以結束了，爲啥需要將整個parquet文件load到內存呢？然後我看了一下這些parquet文件的大小：

-rwxr-xr-x   2 appuser supergroup  126287896 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000000_0
-rwxr-xr-x   2 appuser supergroup  179992288 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000001_0
-rwxr-xr-x   2 appuser supergroup  155053353 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000002_0
-rwxr-xr-x   2 appuser supergroup  163026985 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000003_0
-rwxr-xr-x   2 appuser supergroup  155736832 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000004_0
-rwxr-xr-x   2 appuser supergroup  157311028 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000005_0
-rwxr-xr-x   2 appuser supergroup  150175977 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000006_0
-rwxr-xr-x   2 appuser supergroup  184228405 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000007_0
-rwxr-xr-x   2 appuser supergroup  162361165 2017-02-22 09:08 hdfs://datahdfsmaster/hive/warehouse/silver_silver_njssel.db/zj/000008_0

這些文件都是150MB左右，並且parquet文件本身的存儲性質決定了我們讀取和解析parquet文件的時候，不是按行去讀，而是一個row group一個row group去讀的。通過parquet-tools工具解析這些parquet文件，發現這些文件基本上都最多隻有2個row group，也就是說每個row group都非常大。
因此，當spark同時創建了多個task去讀取這些parquet文件，儘管每個文件讀進內存只需要一個row group，但是由於所有的task是屬於同一進程，因此可能會把內存撐滿。
這個對應的進程啓動的時候，系統分配了多少內存給它呢？我們看一下這個執行進程的詳細情況：

appuser  20739     1  3 Feb14 ?        06:00:46 /home/jdk/bin/java -cp /home/hbase/conf/:/home/spark/hadooplib/*:/home/spark/hivelib/*:/home/spark/hbaselib/*:/home/spark/kafkalib/*:/home/spark/extlib/*:/home/spark/conf/:/home/spark/jars/*:/home/hadoop/etc/hadoop/ -Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://datahdfsmaster/spark/history -Xmx1024m org.apache.spark.deploy.history.HistoryServer

看到了，系統分配了1g內存給這個進程。在哪兒設置的呢？也可以跟代碼進去看看：

pyspark:
export PYSPARK_DRIVER_PYTHON
export PYSPARK_DRIVER_PYTHON_OPTS
exec "\${SPARK_HOME}"/bin/spark-submit pyspark-shell-main --name "PySparkShell" "\$@"

spark-submit：
if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="\$(cd "`dirname "$0"`"/..; pwd)"
fi
\#disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
exec "\${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "\$@"

spark-class:
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

這三個腳本是pyspark依次的執行邏輯，只有當我們在pyspark中執行任務的時候，纔會調用到spark-class，然後通過

"\$RUNNER" -Xmx128m -cp "\$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "\$@"

組合出了我們用來創建一個單獨java進程的命令，最後，其中，
exec "${CMD[@]}"
創建了一個獨立的linux 進程，負責運行我們的分佈式任務。 $RUNNER是$JAVA_HOME/java ，真正的-Xmx參數，是在

org.apache.spark.launcher.Main

裏面進行設置的。

org.apache.spark.launcher.Main.main()[line 86] 
  -> 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(Map<String, String> env) [line 151]
  ->
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(Map<String, String> env)

具體看最終的實現：

     String tsMemory =
        isThriftServer(mainClass) ? System.getenv("SPARK_DAEMON_MEMORY") : null;
      String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),
        System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);
      cmd.add("-Xmx" + memory)

看到了嗎，從系統變量 SPARK_DAEMON_MEMORY、spark的配置文件中的配置項spark.driver.memor以及系統變量SPARK_DRIVER_MEMORY 以及系統變量 SPARK_MEM和默認內存大小DEFAULT_MEM（1g）中選擇第一個不是空的值作爲啓動這個執行進程的xmx大小。最終，選擇了使用默認值，因此jvm啓動內存是1g。
那麼，難道就這麼一個簡單的操作，這麼簡單的使用場景，當前最流行的分佈式處理系統真的搞不定了嗎？
我門需要的僅僅是10條數據，其實只要一個executor拿到了這10條數據，那目的就達到了，而不需要等到所有的executor都返回結果。
因此，我改用RDD.show()操作，結果，速度非常快，幾乎是立刻返回。
從show()方法和collect()方法的簡單對比，我門可以發現它們的差別：
無論是collect()還是take()方法，最終都是通過Sparkcontext.runJob()方法取提交任務並獲取結果，但是runJob方法是一個多態方法。collect()中調用的runJob方法是：

  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

 def take(num: Int): Array[T] = withScope {
    val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
    if (num == 0) { //參數問題，直接返回空數組
      new Array[T](0)
    } else {
      val buf = new ArrayBuffer[T]
      val totalParts = this.partitions.length //這個rdd的partition個數
      var partsScanned = 0
      while (buf.size < num && partsScanned < totalParts) { //數據還不夠，並且還有partition沒有返回結果
        // The number of partitions to try in this iteration. It is ok for this number to be
        // greater than totalParts because we actually cap it at totalParts in runJob.
        var numPartsToTry = 1L
        if (partsScanned > 0) {
          // If we didn't find any rows after the previous iteration, quadruple and retry.
          // Otherwise, interpolate the number of partitions we need to try, but overestimate
          // it by 50%. We also cap the estimation in the end.
          if (buf.isEmpty) {
            numPartsToTry = partsScanned * scaleUpFactor //如果這次取得對結果不夠，下次需要增加掃描的partition個數
          } else {
            // the left side of max is >=1 whenever partsScanned >= 2
            numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
            numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
          }
        }

        val left = num - buf.size
        //確定partition的範圍，在剩餘需要掃描的partion和總的partion中取較小值作爲partition的上限值，下限值是上次運行截止的partition
        val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt) 
        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)//對指定範圍對partition運行任務

        res.foreach(buf ++= _.take(num - buf.size))
        partsScanned += p.size
      }

這是collect()方法所調用的runJob():

  /**
   * Run a job on all partitions in an RDD and return the results in an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
    runJob(rdd, func, 0 until rdd.partitions.length)
  }

而take()方法調用的runJob()是

  /**
   * Run a function on a given set of partitions in an RDD and return the results as an array.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like first()
   * @return in-memory collection with a result of the job (each collection element will contain
   * a result from one partition)
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: Iterator[T] => U,
      partitions: Seq[Int]): Array[U] = {
    val cleanedFunc = clean(func)
    runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
  }

兩個runJob()的區別是傳入的paritions不同，前者是在所有的partition上運行任務，而後者在部分partition上運行任務。通過查看take()方法的源代碼和註釋，可以清晰地理解take()方法是如何不斷運行任務，直到取到的結果數量滿足了參數規定的數量，或者，也有可能發生的是，當所有的job已經處理完了所有的partition，但是總共得到的結果依然不夠則返回當前結果集的情形。

一次實踐：spark查詢hive速度緩慢原因分析並以此看到spark基礎架構

Python 潮流週刊#52：Python 處理 Excel 的資源

Yarn資源請求處理和資源分配原理解析

Yarn FairScheduler 的資源預留機制導致的一次宕機事故分析

從Presto堆棧講解含有lambda表達式堆棧分析方法

Spark對HiveMetastore客戶端的多版本管理、兼容性探究以及柵欄實現

Yarn ResourceManager進行主從切換時發生腦裂原因分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結