flink丢数据bug之问题定位

开发环境,发现丢数据,一开始还以为自己sql的问题,后来花了一个下午+晚上+第2天早上基本确定了,先说一下现象

重启sql服务,发送两条数据,均工作在 event time模式下,延迟时间是1000ms

发送的2条数据如下所示:

第1条

{
"domain":"ymm-sms-webabcdefg0",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}

第2条

{
"domain":"ymm-sms-webabcdefg1",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}

上面2条数据解释下,domain不一样,一个后缀是0,一个后缀是1;然后时间戳注意都是1551865000000

我的sql里的窗口分组时间为5000ms

因为我的延迟时间设置为了1000ms,所以下面一条消息我设置时间戳为1551865000000+5000+1000-1

理论上要触发2个窗口的计算

问题就在这里,只有后缀为1的数据出来,后缀为0的消息不见了!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

---下面我得弄明白为啥不见了,这坑爹的!

关键问题是如何定位,采用什么样的步骤,不然就是浪费时间,我的想法是:先把正常消息记录被计算的调用栈捞出来,然后再看不正常的是在哪一步调用出了问题

正常的调用栈之前调研过就比较熟悉了,直接捞出来如下:

通过断点  org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement

调用栈的文字版在此

[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)

这个调用栈还是要先研究一下的,看看能得出什么结论

这里的currentRecordDeserializer通过debug打印如下

currentRecordDeserializer = "org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer@23772fa4"

进入SpillingAdaptiveSpanningRecordDeserializer的getNextRecord方法,可以看到

这里有个读操作,读到缓冲区里,这个读对象是

target = "org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate@28b94ca3"

这里都是读,那么数据从哪来的呢?也就是说,谁往这里面写的?定位一下

查看调用栈如下

  [1] org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer (SpillingAdaptiveSpanningRecordDeserializer.java:68)
  [2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:214)
  [3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
  [4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
  [5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
  [6] java.lang.Thread.run (Thread.java:748)

跟上面第一个比较一下

[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)

其实是同一个线程,那么也就大致明白了,先读缓冲区里的数据,然后再解析,根据解析结果来执行4个分支,

关键是本质上从哪里来读数据,看下面

我们在 stop at org.apache.flink.streaming.runtime.io.StreamInputProcessor:209 打上断点

print barrierHandler
 barrierHandler = "org.apache.flink.streaming.runtime.io.BarrierTracker@66d1ea7e"

代码位于
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(), line=94 bci=0

看到了这个

继续打印

print inputGate
 inputGate = "org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate@60cca197"

看到了消费者的字样,nice

继续跟踪,调用的是

org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(), line=502 bci=0

进这个函数看一下

到这里就跟之前了解到的情况串起来了!

 

那就先看看第一条消息来的时候,这个地方是否被触发了!!!继续做实验

上面说的channel是如何使用的?

所以此时我们把目光投向

org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer

断点在

stop at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel:186

然后打印
print subpartitionView.getClass()
 subpartitionView.getClass() = "class org.apache.flink.runtime.io.network.partition.PipelinedSubpartitionView"

下面的问题就是,谁往里面放数据的?

 

断点在

org.apache.flink.runtime.io.network.partition.PipelinedSubpartition.add

 

这个时候,我再按照顺向来跟踪发送的调用栈

[1] org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit (RecordWriter.java:104)
  [2] org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit (StreamRecordWriter.java:81)
  [3] org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter (RecordWriterOutput.java:107)
  [4] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:89)
  [5] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:45)
  [6] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [7] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [8] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [9] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:45)
  [10] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:32)
  [11] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [12] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [13] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [14] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [15] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [16] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [17] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [18] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [19] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [20] DataStreamCalcRule$9091.processElement (null)
  [21] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
  [22] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
  [23] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [24] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [25] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [26] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [27] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [28] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [29] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [30] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [31] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [32] TableFunctionCollector$9065.collect (null)
  [33] org.apache.flink.table.functions.TableFunction.collect (TableFunction.scala:92)
  [34] com.ymm.hubble.metric.flink.config.RowsByMetricAndTags.eval (RowsByMetricAndTags.java:161)
  [35] DataStreamCorrelateRule$9023.processElement (null)
  [36] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:79)
  [37] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:35)
  [38] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [39] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [40] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [41] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [42] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [43] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [44] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [45] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [46] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [47] DataStreamCalcRule$8990.processElement (null)
  [48] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
  [49] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
  [50] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [51] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [52] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [53] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [54] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [55] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [56] org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.processElement (TimestampsAndPeriodicWatermarksOperator.java:67)
  [57] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [58] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [59] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [60] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [61] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [62] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [63] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [64] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [65] DataStreamSourceConversion$8954.processElement (null)
  [66] org.apache.flink.table.runtime.CRowOutputProcessRunner.processElement (CRowOutputProcessRunner.scala:67)
  [67] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [68] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [69] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [70] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [71] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [72] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [73] org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollectWithTimestamp (StreamSourceContexts.java:310)
  [74] org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp (StreamSourceContexts.java:409)
  [75] org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp (AbstractFetcher.java:398)
  [76] org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord (Kafka010Fetcher.java:89)
  [77] org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop (Kafka09Fetcher.java:154)
  [78] org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run (FlinkKafkaConsumerBase.java:721)
  [79] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:87)
  [80] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:56)
  [81] org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run (SourceStreamTask.java:99)
  [82] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
  [83] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
  [84] java.lang.Thread.run (Thread.java:748)

反复测试了几次,截图

现象再具体一下,就是新起来的任务,发送1条数据的时候,org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit这个方法是没有触发的,第2条发送了会触发,那么就不能在org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit这个地方打断点了,要提前!

判断到底是没从kafka读到消息,还是读到了不触发RecordWriter.emit函数!

-------------------------------------------------------------------------------------

//最开始从kafka读消息的地方
stop in org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord

//发射给channel的缓冲区[可能有远程操作]
stop in org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit

//拿到消息进行处理
stop at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator:299

通过修改第一个断点的位置可以判断问题!

后来终于搞明白是代码的问题!!!第一条数据触发了 拉取配置线程的初始化,然后被丢掉了!第2条数据正常被处理!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章