Hive查詢變慢及無響應問題分析

Hive查詢變慢及無響應問題分析

問題描述

某任務使用Hive作數據查詢,連續幾天任務越來越慢。後面甚至出現了執行失敗的錯誤。
查詢Hive速度變慢
Hive查詢失敗

問題分析

經排查發現連接機器A上Hive的任務運行平穩,出現問題的任務都是連接的B機器的,觀測到B機器上的Hive進程CPU過高。
cpu過高
我們查詢到4154對應的進程是HiveServer2。通過Cloudera Manage查看B機器的負載情況,發現線程數達到了18000多!
線程數較高
觀察近三十天的線程數,線程數一直在增加,重啓之後線程數也會隨着時間上升,懷疑hiveserver2的線程是有問題的。
線程數持續上升
追蹤日誌,發現hiveserver2在6月1日有一次GC,而這次GC虛擬機停止了3.7s。

2019-06-01 07:22:36,244 INFO  org.apache.hadoop.hive.ql.session.SessionState: [HiveServer2-Handler-Pool: Thread-268382]: Deleted directory: /tmp/hive/7b4cb2c9-5c18-435f-a7ed-05c2d705c23e on fs with scheme file
2019-06-01 07:22:36,244 INFO  hive.metastore: [HiveServer2-Handler-Pool: Thread-268382]: Closed a connection to metastore, current connections: 1
2019-06-01 11:00:11,604 WARN  org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: [AllocationFileReloader]: Failed to reload fair scheduler config file because last modified returned 0. File exists: false
2019-06-01 13:00:12,234 INFO  org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@7dcb42a9]: Detected pause in JVM or host machine (eg GC): pause of approximately 3754ms
No GCs detected
2019-06-02 00:01:02,064 ERROR org.apache.hadoop.conf.Configuration: [HiveServer2-Handler-Pool: Thread-268382]: error parsing conf file:/opt/cm-5.12.0/run/cloudera-scm-agent/process/11384-hive-HIVESERVER2/hive-site.xml
java.io.FileNotFoundException: /opt/cm-5.12.0/run/cloudera-scm-agent/process/11384-hive-HIVESERVER2/hive-site.xml (沒有那個文件或目錄)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
        at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2550)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2516)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2412)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:1234)
        at org.apache.hadoop.hive.conf.HiveConf.getVar(HiveConf.java:2697)
        at org.apache.hadoop.hive.conf.HiveConf.getVar(HiveConf.java:2718)
        at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:2790)
        at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:2733)
        at org.apache.hive.service.auth.AuthenticationProviderFactory.getAuthenticationProvider(AuthenticationProviderFactory.java:61)
        at org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:104)
        at org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
        at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
        at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283)
        at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
        at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-06-02 00:01:02,064 ERROR org.apache.thrift.server.TThreadPoolServer: [HiveServer2-Handler-Pool: Thread-268382]: Error occurred during processing of message.
java.lang.RuntimeException: java.io.FileNotFoundException: /opt/cm-5.12.0/run/cloudera-scm-agent/process/11384-hive-HIVESERVER2/hive-site.xml (沒有那個文件或目
錄)

同時我們發現線程[HiveServer2-Handler-Pool: Thread-268382]的線程id很大,難道是這個線程池裏面線程數量太多,還有很多沒有銷燬?

但是查看更多的log發現這個線程id一直被複用,並不是每次接收到請求就會增長的。

那就從JVM着手吧。
首先登錄master01並切換hive用戶。

$ sudo su - hive -s /bin/bash

下載並啓動Java診斷神器——阿爾薩斯

$ wget https://alibaba.github.io/arthas/arthas-boot.jar
$ java -jar arthas-boot.jar
[INFO] arthas-boot version: 3.1.1
[INFO] Process 15280 already using port 3658
[INFO] Process 15280 already using port 8563
[INFO] Found existing java process, please choose one and hit RETURN.
* [1]: 15280 org.apache.hadoop.util.RunJar
  [2]: 4171 org.apache.hadoop.util.RunJar
1
[INFO] arthas home: /var/lib/hive/.arthas/lib/3.1.1/arthas
[INFO] The target process already listen port 3658, skip attach.
[INFO] arthas-client connect 127.0.0.1 3658
  ,---.  ,------. ,--------.,--.  ,--.  ,---.   ,---.
 /  O  \ |  .--. ''--.  .--'|  '--'  | /  O  \ '   .-'
|  .-.  ||  '--'.'   |  |   |  .--.  ||  .-.  |`.  `-.
|  | |  ||  |\  \    |  |   |  |  |  ||  | |  |.-'    |
`--' `--'`--' '--'   `--'   `--'  `--'`--' `--'`-----'
 
 
wiki      https://alibaba.github.io/arthas
tutorials https://alibaba.github.io/arthas/arthas-tutorials
version   3.1.1
pid       15280
time      2019-06-12 16:57:33

查看線程

$ thread

thread dashboard

目之所及,全是Get-Input-Paths線程!
下載Hive源碼,發現org.apache.hadoop.hive.ql.exec.Utilities中有與此相關的線程池。

ExecutorService pool = Executors.newFixedThreadPool(numExecutors,
        new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Get-Input-Paths-%d").build());
 
 
executor = Executors.newFixedThreadPool(numExecutors,
        new ThreadFactoryBuilder().setDaemon(true)
                .setNameFormat("Get-Input-Summary-%d").build());

這就相當於每次進入getInputPaths和getInputSummary方法都會創建一個線程池,是不是線程池沒正確關閉呢——但是代碼中已經有相關邏輯executor.shutdown()。

Google關鍵字,發現Jira[HIVE-16949]Leak of threads from Get-Input-Paths and Get-Input-Summary thread pool。原來是老版本的Hive沒有將線程池關閉,這個Bug在Hive3.0中得以修復。而我們使用的CDH5.12.0中確實沒有關閉線程池的代碼。

問題解決

  1. 可從CDH官網下載對應版本Hive源碼下載hive-1.1.0-cdh5.12.0-src.tar.gz
    ,修改並替換集羣上所有的hive-exec.jar。修改方法參考上面的Jira。
  2. 升級CDH版本至5.12.2,此版本已經修復HIVE-16949的問題,參考Issues Fixed in CDH 5.12.x

以上方法2選1。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章