Flink線上問題: The assigned slot container_xxx was removed

Flink線上問題: The assigned slot container_xxx was removed


客戶現場使用Flink(on Yarn)進行數據抽取,Source是JDBC,Sink是Kafka,客戶反映流程差不多跑10天左右就掛,讓我看看.

環境:

Flink: 1.5.2

jdk: 1.8.0_25

Hadoop: 2.4.1

jobmanger和TaskManger都分配1G內存

首先我看了一下我們系統收集到的日誌,有2段可能有用.

第一段:

2019-12-26 08:42:42,157 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@tdh04:46540] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-12-26 08:42:42,877 INFO  org.apache.flink.yarn.YarnResourceManager                     - Closing TaskExecutor connection container_1576488269936_0008_01_000008 because: Exception from container-launch.
Container id: container_e05_1576488269936_0008_01_000008
Exit code: 255
Stack trace: ExitCodeException exitCode=255: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
	at org.apache.hadoop.util.Shell.run(Shell.java:482)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 255

第二段:

2019-12-26 08:42:42,877 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Unregister TaskManager f4660fcc70ee329e2427b5ed1245aa83 from the SlotManager.
2019-12-26 08:42:42,878 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: JDBC Source -> Timestamps/Watermarks -> sink_0_projection -> 數據輸出_0_CheckOutputTypeFunction -> Sink: Unnamed (1/1) (96b3d4c2da227693ac34a8e8d2a4abea) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,879 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Discarding checkpoint 82951 of job 42983ed24d98360747a8a535f2a3c8ba because: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
2019-12-26 08:42:42,879 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) switched from state RUNNING to FAILING.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,882 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart or fail the job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) if no longer possible.
2019-12-26 08:42:42,882 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) switched from state FAILING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-26 08:42:42,882 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job flow_1575462616880_0011 (42983ed24d98360747a8a535f2a3c8ba) because the restart strategy prevented it.
org.apache.flink.util.FlinkException: The assigned slot container_1576488269936_0008_01_000008_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:793)
	at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:339)
	at org.apache.flink.yarn.YarnResourceManager$$Lambda$212/930337248.run(Unknown Source)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

看完日誌能得到的信息非常有限,就知道了container被移除了,至於爲什麼被移除還不知道.接下來去Yarn上看一下日誌.(Yarn要開啓日誌聚合,第一次讓我看問題的時候日誌聚合沒有開,我把日誌聚合打開後告訴他有問題之後再找我)

Yarn上有用的日誌:

jobmanager.out

2019-12-26 08:42:44,233 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

taskmanager.err

SEVERE: Failed to resolve default logging config file: config/java.util.logging.properties
Uncaught error from thread [flink-akka.actor.default-dispatcher-4]: GC overhead limit exceeded, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for for ActorSystem[flink]
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at sun.reflect.AccessorGenerator.emitConstructor(AccessorGenerator.java:429)
	at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:379)
	at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
	at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:340)
	at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
	at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
	at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
	at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
	at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
	at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
	at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1484)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1334)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
	at org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation$MethodInvocation.readObject(RemoteRpcInvocation.java:204)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:502)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:489)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:477)

我按時間先後整理了一下:

2019-12-26 08:42:42,877  Container退出
2019-12-26 08:42:44.084  kill job
2019-12-26 08:42:44,233 ClusterEntrypoint  - RECEIVED SIGNAL 15
2019-12-26 08:42:45,005  FINISH_APPLICATION sent to absent application application_1576488269936_0008

首先是TaskManger的Container退出,其實這時候任務就失敗了,由於任務異常結束,系統主動kill Yarn上的任務,所以SIGNAL 15其實是我們自己發的,下面那條其是在Yarn上kill一個已經不存在的任務時發出的警告.

由於沒有找到明顯導致Container退出的原因,結合以下信息:

  1. 任務可以正常啓動
  2. 幾乎週期性的失敗
  3. akka報的OutOfMemory

猜測可能是存在內存泄露,然後我就去業務代碼(不是我寫的)裏看,看到JDBC PreparedStatement每次都弄個新的,並且沒有close,我猜測可能就是它導致的,然後寫了一段Demo,如下:

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class JDBC {

    public static void main(String[] args) throws Exception {

        Class.forName("com.mysql.jdbc.Driver");
        Connection conn = DriverManager.getConnection(
            "jdbc:mysql://localhost:3306/test", "root", "123456");

        PreparedStatement preparedStatement;
        while (true) {
            preparedStatement = conn.prepareStatement("select count(*) from table_a");
            ResultSet resultSet = preparedStatement.executeQuery();
            while (resultSet.next()) {
                resultSet.getObject(1);
            }
            //preparedStatement.close();
            //Thread.sleep(10);
        }
    }
}

JVM參數:-Xmx100M

使用Visual VM觀察內存變化,啓動程序後很快就OOM了,所以加了個sleep,通過Visual VM觀察老年代一直在增長,後來發生了GC但是也不管用還是OOM了.

更改代碼把preparedStatement關閉之後老年代雖然也可以幾乎漲滿,但是GC過後內存就下來了.

如果不是任務啓動的時候因爲其他異常比如ClassNotFound導致的The assigned slot container_xxx was removed,並且週期性的出現問題,要優先考慮一下內存泄露.

歡迎關注公衆號:大數據開發者
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章