生产环境踩坑系列::Hive on Spark的connection timeout 问题

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"起因","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7/16凌晨,钉钉突然收到了一条告警,一个公司所有业务部门的组织架构表的ETL过程中,数据推送到DIM层的过程中出现异常,导致任务失败。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因为这个数据会影响到第二天所有大数据组对外的应用服务中组织架构基础数据,当然,我们的Pla-nB也不是吃素的,一旦出现错误,后面的权限管理模块与网关会自动配合切换前一天的最后一次成功处理到DIM中的组织架构数据,只会影响到在前一天做过组织架构变化的同事在系统上的操作,但是这个影响数量是可控的,并且我们会也有所有组织架构变化的审计数据,如果第二天这个推数的ETL修复不完的话,我们会手动按照审计数据对这些用户先进行操作,保证线上的稳定性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"技术架构","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集群:CDH 256G/64C计算物理集群 X 18台","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"调度:dolphin","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"数据抽取:datax","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DIM层数据库:Doris","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive版本:2.1.1","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"告警","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"告警策略现在是有机器人去捕捉dolphin的告警邮件,发到钉钉群里,dolphin其实是可以获取到异常的,需要进行一系列的开发,但是担心复杂的调度过程会有任务监控的遗漏,导致告警丢失,这样就是大问题,所以简单粗暴,机器人代替人来读取邮件并发送告警到钉钉,这样只关注这个幸福来敲门的小可爱即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/21/21eaad0893443e3ba950429c5a66a3bf.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"集群log","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"kotlin"},"content":[{"type":"text","text":"Log Type: stderr\n\nLog Upload Time: Fri Jul 16 01:27:46 +0800 2021\n\nLog Length: 10569\n\nSLF4J: Class path contains multiple SLF4J bindings.\nSLF4J: Found binding in [jar:file:/data7/yarn/nm/usercache/dolphinscheduler/filecache/8096/__spark_libs__6065796770539359217.zip/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]\nSLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]\nSLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.\nSLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]\n21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for TERM\n21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for HUP\n21/07/16 01:27:43 INFO util.SignalUtils: Registered signal handler for INT\n21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls to: yarn,dolphinscheduler\n21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls to: yarn,dolphinscheduler\n21/07/16 01:27:43 INFO spark.SecurityManager: Changing view acls groups to: \n21/07/16 01:27:43 INFO spark.SecurityManager: Changing modify acls groups to: \n21/07/16 01:27:43 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, dolphinscheduler); groups with view permissions: Set(); users with modify permissions: Set(yarn, dolphinscheduler); groups with modify permissions: Set()\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1625364172078_3093_000001\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...\n21/07/16 01:27:43 INFO client.RemoteDriver: Connecting to HiveServer2 address: hadoop-task-1.bigdata.xx.com:24173\n21/07/16 01:27:44 INFO conf.HiveConf: Found configuration file file:/data8/yarn/nm/usercache/dolphinscheduler/filecache/8097/__spark_conf__.zip/__hadoop_conf__/hive-site.xml\n21/07/16 01:27:44 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\njava.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.XX.com/10.25.15.104:24173\n at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)\n at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:155)\n at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)\n at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)\n at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)\n at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)\n at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)\n at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: java.net.ConnectException: Connection refused\n ... 10 more\n21/07/16 01:27:44 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.dd.com/10.25.15.104:24173\n at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)\n at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:155)\n at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)\n at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)\n at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)\n at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)\n at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)\n at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: java.net.ConnectException: Connection refused\n ... 10 more\n)\n21/07/16 01:27:44 ERROR yarn.ApplicationMaster: Uncaught exception: \norg.apache.spark.SparkException: Exception thrown in awaitResult: \n at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)\n at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:447)\n at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:275)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:805)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:804)\n at java.security.AccessController.doPrivileged(Native Method)\n at javax.security.auth.Subject.doAs(Subject.java:422)\n at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)\n at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:804)\n at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)\nCaused by: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:41)\n at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:155)\n at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:559)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)\nCaused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\n at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)\n at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)\n at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)\n at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)\n at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)\n at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)\n at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n at java.lang.Thread.run(Thread.java:748)\nCaused by: java.net.ConnectException: Connection refused\n ... 10 more\n21/07/16 01:27:44 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://lbc/user/dolphinscheduler/.sparkStaging/application_1625364172078_3093\n21/07/16 01:27:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n21/07/16 01:27:44 INFO util.ShutdownHookManager: Shutdown hook called\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"原因分析","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"运维角度","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"资源负载","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"运维同学的思路从来都是先看负载,其实这套18个计算节点的集群已经平稳运行一段时间了,当天晚上推送的这个时间段的任务并行度其实也很低,Yarn的每个队列也都做了隔离,我们在dolphin上面的任务也通过直接抓dolphin的Mysql数据库直接获取到了所有的运行计划和实际执行计划,所以因为资源跑满导致的问题不太令人信服,并且也没有支持这个观点的证据。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"网络","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"ConnectionTimeOut","attrs":{}}],"attrs":{}},{"type":"text","text":"都会去网络同学那里打交换机的log,一通筛查,但是其实如果不是严重的网络情况,这也是比较难以发现问题的,果然,网络同学的回复是一切正常。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"开发角度","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"程序Bug","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个ETL的过程运行了几个星期了,一直正常,我们从dolphin的邮件监控,钉钉机器人监控,任务的footprint监控等等都一直跟踪,程序bug的可能性不高。之所以没说一定不是程序bug的原因是,ETL的过程本身从数据源的数据类型,数据集以及突发的一些变化可能会影响到后续的整体数据搬移的过程,也许是一些考不到的点在这个时间点突然间发力,导致问题,这样也需要对程序的健壮性进行增强。对程序进行了初步筛查,排除了程序的问题。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"开源工具","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般程序也没有问题的话,就是一个很可怕的消息了。我们需要从log中排查开源工具的执行流,然后分析步骤,从出问题的地方开始分析导致问题发生的各种可能性,最主要的是,这个问题可能是一个无法重现的问题。一般如果分析到这里,就需要有对开源架构非常了解的同学或者是对开源框架运行原理相对熟悉的同学出手了,当然,也有那种从没有跟踪过这一块源代码的同学非常有兴趣也可以从头开始调查。本次需要从开源工具的架构来分析问题出在哪里了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"问题分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从log上看,本次出问题的地方是Hive on Spark的运行过程中,HQL已经变成了Spark任务,在AM中初始化了Driver的线程。关于Driver启动和Executor的关系我也想整理一套文章,有空发出来。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"定位出问题的地方","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最关键重要的两条log如下","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"prolog"},"content":[{"type":"text","text":"21/07/16 01:27:43 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread\n21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这两句的代码的来源分别是来自","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"org.apache.spark.deploy.yarn.ApplicationMaster.scala","attrs":{}}],"attrs":{}},{"type":"text","text":"的","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"runDriver","attrs":{}}],"attrs":{}},{"type":"text","text":"方法与","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"startUserApplication","attrs":{}}],"attrs":{}},{"type":"text","text":"方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"runDriver","attrs":{}}],"attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"scala"},"content":[{"type":"text","text":" private def runDriver(): Unit = {\n addAmIpFilter(None)\n userClassThread = startUserApplication()\n\n // This a bit hacky, but we need to wait until the spark.driver.port property has\n // been set by the Thread executing the user class.\n logInfo(\"Waiting for spark context initialization...\")\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"startUserApplication","attrs":{}}],"attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"scala"},"content":[{"type":"text","text":" /**\n * Start the user class, which contains the spark driver, in a separate Thread.\n * If the main routine exits cleanly or exits with System.exit(N) for any N\n * we assume it was successful, for all other cases we assume failure.\n *\n * Returns the user thread that was started.\n */\n private def startUserApplication(): Thread = {\n logInfo(\"Starting the user application in a separate Thread\")\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这两个方法即yarn-cluster模式下启动用户提交的spark运行的jar文件的过程,在用户提交的代码中是原生应该是处理数据的代码,即各种算子的计算,根据shuffle算子进行Stage划分,遇到Action算子则进行任务提交等等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"分析可能的原因","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"解读问题","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而就在Driver线程运行的过程中却有一行这样的错误:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"kotlin"},"content":[{"type":"text","text":"21/07/16 01:27:43 INFO yarn.ApplicationMaster: Waiting for spark context initialization...\n21/07/16 01:27:43 INFO client.RemoteDriver: Connecting to HiveServer2 address: hadoop-task-1.bigdata.xx.com:24173\n21/07/16 01:27:44 INFO conf.HiveConf: Found configuration file file:/data8/yarn/nm/usercache/dolphinscheduler/filecache/8097/__spark_conf__.zip/__hadoop_conf__/hive-site.xml\n21/07/16 01:27:44 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.xx.com/10.25.15.104:24173\njava.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: hadoop-task-1.bigdata.XX.com/10.25.15.104:24173\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我分析问题从来都是有依据的从大到小的去有目的的收缩,并剔除不可能的选项,从可能的问题中精确最终答案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"从以上的log中可以分析出","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Driver已经启动了","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在Driver中进行了连接到HiveServer2的连接","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"而这个连接发生了ConnectionTimeout的错误","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"表象原因","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从log中解读出来的错误就是,Driver启动后,Driver线程里面与HiveServer2,也就是Hive的Server进行的连接,在连接的时候出现了timeout,导致任务失败,到这里具体问题出在哪里就知道了,那么下一个问题就是Why?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive on spark是什么处理机制?为什么会在Driver线程中去连接HiveServer2的服务?这个处理过程因为什么会导致timeout呢?带着问题进行深入分析,只能去源代码中一探究竟","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"深入分析","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive on Spark(下称HOS)机制","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive on Spark,即Hive的SQL(HQL)的执行过程从默认的MapReduce变成Spark引擎来实现,利用Spark的速度优势与计算能力解决原生MR笨重的实现","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive on Spark的实现架构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这里需要一幅图(来源于网络,跟我我对源代码的解读,这个架构是正确的)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个大的结构图先有个大体印象即可,后续分析每一块细节的时候再回头来理解会更简单","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/00c3c40ea6df9d571737be6568f30ce9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive on Spark细节技术点","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"入口:SparkTask","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以理解成,HQL提交在Hive的client端,提交HQL后,经过一系列的转换变成spark的任务,整体开始向Spark任务转换的起始位置就是","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkTask","attrs":{}}],"attrs":{}},{"type":"text","text":",至于从哪里如何调用到","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkTask","attrs":{}}],"attrs":{}},{"type":"text","text":"的,我暂时还没有细致研究,后续有需要或者有小伙伴有兴趣我们一起探讨跟踪这部分的逻辑。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/355f6e85a2743eae55c8761bec080ef7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"与上面的架构图呼应,整个对一个HQL任务的提交(不算后续的Job的监控)其实就是","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Session的创建","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Session的submit","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"这两个步骤的调用大体流程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SparkSession的获取的一系列调用过程","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sparkSession = SparkUtilities.getSparkSession(conf, sparkSessionManager)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sparkSession = sparkSessionManager.getSession(sparkSession, conf, true);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"existingSession.open(conf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"hiveSparkClient = HiveSparkClientFactory.createHiveSparkClient(conf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"return new RemoteHiveSparkClient(hiveconf, sparkConf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"createRemoteClient();","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"remoteClient = SparkClientFactory.createClient(conf, hiveConf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"return new SparkClientImpl(server, sparkConf, hiveConf);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"this.driverThread = startDriver(rpcServer, clientId, secret);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"扔Driver的jar到spark集群","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"spark-submit —class xxxx.jar","attrs":{}}],"attrs":{}},{"type":"text","text":" ... 的处理","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"sparkSession.submit的一系列调用过程","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SparkJobRef jobRef = sparkSession.submit(driverContext, sparkWork);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"return hiveSparkClient.execute(driverContext, sparkWork);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"【","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteHiveSparkClient","attrs":{}},{"type":"text","text":"】 return submit(driverContext, sparkWork);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JobHandle jobHandle = remoteClient.submit(job);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"【","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SparkClientImpl","attrs":{}},{"type":"text","text":"】 return protocol.submit(job);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"→","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"【","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ClientProtocol","attrs":{}},{"type":"text","text":"】 final io.netty.util.concurrent.Future rpc = driverRpc.call(new JobRequest(jobId, job));","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"初始化:SparkSession","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"众所周知,一个离线的spark任务是用户首先编写一个User Class,然后达成jar包,把这个jar包投入到spark集群中即可,一般生产环境上,我们会使用","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"—master yarn —deploy-mode cluster","attrs":{}}],"attrs":{}},{"type":"text","text":" 的yarn的提交方式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一直以来,我理解的HOS中提交一个HQL就是解析成一个spark的job,提交到spark集群即可,但是这个job每次都是打成一个jar包,或者整体打成一个jar包来提交么?这块一直没有细致的研究,其实细想起来,每次打个包是个多么愚蠢的设计,看过SparkSession的实现后,可以理解到,HOS本身的设计架构其实是这样的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"先回顾一下spark提交任务的简单模型","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c2/c2071dd39661371719d8fbae7e935cb1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"加上HOS的过程则为","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a3c09269f88e4a6f81cbdbe90ca0bf4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一个初始化SparkSessin的过程居然完成了提交一个User Class到spark集群的过程,而且这个过程其实非常的巧妙。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"在HiveServer2(HS2)与spark集群建立连接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因为HQL是提交到HS2的服务器,HS2解析HQL并转换成为sparkTask并执行一系列的处理,如上图所示,在HS2中利用","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientFactory.initialize","attrs":{}}],"attrs":{}},{"type":"text","text":"首先建立了一个Netty的Server,,然后通过","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientFactory.createClient","attrs":{}}],"attrs":{}},{"type":"text","text":"初始化了","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientImpl","attrs":{}}],"attrs":{}},{"type":"text","text":",并且在","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"SparkClientImpl","attrs":{}}],"attrs":{}},{"type":"text","text":"的构造函数中调用了","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"startDriver","attrs":{}}],"attrs":{}},{"type":"text","text":"方法,在这个方法中完成了","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"spark-submit","attrs":{}}],"attrs":{}},{"type":"text","text":"的操作,代码片段如下","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"bash"},"content":[{"type":"text","text":"if (sparkHome != null) {\n argv.add(new File(sparkHome, \"bin/spark-submit\").getAbsolutePath());\n } else {\n LOG.info(\"No spark.home provided, calling SparkSubmit directly.\");\n argv.add(new File(System.getProperty(\"java.home\"), \"bin/java\").getAbsolutePath());\n\n if (master.startsWith(\"local\") || master.startsWith(\"mesos\") || master.endsWith(\"-client\") || master.startsWith(\"spark\")) {\n String mem = conf.get(\"spark.driver.memory\");\n if (mem != null) {\n argv.add(\"-Xms\" + mem);\n argv.add(\"-Xmx\" + mem);\n }\n\n String cp = conf.get(\"spark.driver.extraClassPath\");\n if (cp != null) {\n argv.add(\"-classpath\");\n argv.add(cp);\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到拼装了一个带有","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"bin/spark-submit","attrs":{}}],"attrs":{}},{"type":"text","text":"的cmd","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"回顾一下一般提交一个User Class中代码的形式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"一般提交的User Class的样子","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般,我们提交到spark集群的User Class(jar文件)都是一段代码文件体,比如以下的代码片段,一段wordcount。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"reduceByKey","attrs":{}}],"attrs":{}},{"type":"text","text":"是一个Shuffle算子,切分出来2个stage","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"saveAsTextFile","attrs":{}}],"attrs":{}},{"type":"text","text":"是一个Action算子,会提交整个job","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" object WordCount {\n\n def main(args: Array[String]): Unit = {\n // code\n val conf = new SparkConf().setAppName(\"WordCount\")\n val sc = new SparkContext()\n\n //read\n val line: RDD[String] = sc.textFile(args(0))\n\n // flatmap\n val words: RDD[String] = line.flatMap(_.split(\" \"))\n\n val wordAndOne = words.map((_, 1))\n\n val wordAndCount = wordAndOne.reduceByKey(_ + _)\n\n // save to hdfs\n wordAndCount.saveAsTextFile(args(1))\n\n // close\n sc.stop()\n\n }\n\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"这个过程的精妙之处","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提交的User Class(","attrs":{}},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteDriver.java","attrs":{}}],"attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":")本身是一个Netty的client","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RemoteDriver被条到Spark集群中,会启动一个Netty client,去连接到HS2的Netty Server,如图,这个Netty Server前述构建的时间点了","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"提交的HQL即在Netty Server与Netty Client(已经提交到Spark集群中的RemoteDriver)的通信","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从这幅图可以看出(猜想)来,提交的HQL通过这两个Netty间的服务传递到Spark集群内部,从而实现在集群内的计算处理","attrs":{}}]}]}],"attrs":{}},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a3c09269f88e4a6f81cbdbe90ca0bf4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"巡查可疑的问题点","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从上面的错误log中可以看出,就是在Driver线程中启动了RemoteDriver后,反向连接HS2造成了timeout,也就是上图中的这个Netty Rpc连接过程中造成了timeout,需要再细看一下这个过程是如何处理的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteDriver的代码细节","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在初始化的过程中,有这样的一段代码,就是在初始化Netty的client,这里有一行注释非常的亮眼,其实这行注释已经提醒我们注意time out,因为,本身Rpc.createClient返回的是一个Promise,而这里又进一步进行了get的同步调用阻塞的获取到clientRpc。在这个获取过程中如果控制不好却是容易造成timeout","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" // The RPC library takes care of timing out this.\n this.clientRpc = Rpc.createClient(mapConf, egroup, serverAddress, serverPort,\n clientId, secret, protocol).get();\n this.running = true;\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RemoteDriver中调用","attrs":{}},{"type":"codeinline","content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Rpc.createClient","attrs":{}}],"attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"的代码细节","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我直接在代码中进行标注解释,这里,构建client的bootstrap的过程中,使用到了一个","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"int connectTimeoutMs = (int) rpcConf.getConnectTimeoutMs();","attrs":{}}],"attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个来源的timeout常量","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" public static Promise createClient(\n Map config,\n final NioEventLoopGroup eloop,\n String host,\n int port,\n final String clientId,\n final String secret,\n final RpcDispatcher dispatcher) throws Exception {\n final RpcConfiguration rpcConf = new RpcConfiguration(config);\n\n // client端连接Netty server端的timeout时长\n int connectTimeoutMs = (int) rpcConf.getConnectTimeoutMs();\n\n final ChannelFuture cf = new Bootstrap()\n .group(eloop)\n .handler(new ChannelInboundHandlerAdapter() { })\n .channel(NioSocketChannel.class)\n .option(ChannelOption.SO_KEEPALIVE, true)\n\n // 在这里被设置\n .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeoutMs)\n .connect(host, port);\n\n final Promise promise = eloop.next().newPromise();\n final AtomicReference rpc = new AtomicReference();\n\n // Set up a timeout to undo everything.\n final Runnable timeoutTask = new Runnable() {\n @Override\n public void run() {\n promise.setFailure(new TimeoutException(\"Timed out waiting for RPC server connection.\"));\n }\n };\n final ScheduledFuture> timeoutFuture = eloop.schedule(timeoutTask,\n rpcConf.getServerConnectTimeoutMs(), TimeUnit.MILLISECONDS);\n\n // The channel listener instantiates the Rpc instance when the connection is established,\n // and initiates the SASL handshake.\n cf.addListener(new ChannelFutureListener() {\n @Override\n public void operationComplete(ChannelFuture cf) throws Exception {\n if (cf.isSuccess()) {\n SaslClientHandler saslHandler = new SaslClientHandler(rpcConf, clientId, promise,\n timeoutFuture, secret, dispatcher);\n Rpc rpc =createRpc(rpcConf, saslHandler, (SocketChannel) cf.channel(), eloop);\n saslHandler.rpc = rpc;\n saslHandler.sendHello(cf.channel());\n } else {\n promise.setFailure(cf.cause());\n }\n }\n });\n\n // Handle cancellation of the promise.\n promise.addListener(new GenericFutureListener>() {\n @Override\n public void operationComplete(Promise p) {\n if (p.isCancelled()) {\n cf.cancel(true);\n }\n }\n });\n\n return promise;\n }\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"追踪这个可以的timeout时长","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":"none"},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,这个timeout时长的default值为1000ms,一旦client端到server端连接超过1s,则直接会出现timeout错误,也就是本文最初描述的timeout","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" long getConnectTimeoutMs() {\n String value = config.get(HiveConf.ConfVars.SPARK_RPC_CLIENT_CONNECT_TIMEOUT.varname);\n return value != null ? Integer.parseInt(value) :DEFAULT_CONF.getTimeVar(\n HiveConf.ConfVars.SPARK_RPC_CLIENT_CONNECT_TIMEOUT, TimeUnit.MILLISECONDS);\n }\n\n SPARK_RPC_CLIENT_CONNECT_TIMEOUT(\"hive.spark.client.connect.timeout\",\n \"1000ms\", new TimeValidator(TimeUnit.MILLISECONDS),\n \"Timeout for remote Spark driver in connecting back to Hive client.\"),\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"结案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上文分析所示,在Driver线程中连接到HS2的Server的过程中,timeout的常量被设置成了1s,一旦超过1s,则会出现timeout错误。这个1s本身设置的过短,很容易出现问题,所以提高这个timeout常量的设置即可解决问题,提高稳定性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"官方解答","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其实这个问题,我已经最初搜索到了官方的一个bug fix","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"参看:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://issues.apache.org/jira/browse/HIVE-16794","title":"","type":null},"content":[{"type":"text","text":"https://issues.apache.org/jira/browse/HIVE-16794","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://issues.apache.org/jira/secure/attachment/12872466/HIVE-16794.patch","title":"","type":null},"content":[{"type":"text","text":"https://issues.apache.org/jira/secure/attachment/12872466/HIVE-16794.patch","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之所以没有一开始就按照这个issue修改或者做Hive升级是想详细的再研究一下这个问题的本质,以及HOS的基础原理","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"后续","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HOS的基础过程在这次trouble shooting中做了简单的回顾,后续会针对RemoteDriver是如何向spark提交job的,并且job又是如何从HS2的Netty Server端生成并传入到RemoteDriver的做详细的说明,","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"完","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章