开发环境,发现丢数据,一开始还以为自己sql的问题,后来花了一个下午+晚上+第2天早上基本确定了,先说一下现象
重启sql服务,发送两条数据,均工作在 event time模式下,延迟时间是1000ms
发送的2条数据如下所示:
第1条
{
"domain":"ymm-sms-webabcdefg0",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}
第2条
{
"domain":"ymm-sms-webabcdefg1",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}
上面2条数据解释下,domain不一样,一个后缀是0,一个后缀是1;然后时间戳注意都是1551865000000
我的sql里的窗口分组时间为5000ms
因为我的延迟时间设置为了1000ms,所以下面一条消息我设置时间戳为1551865000000+5000+1000-1
理论上要触发2个窗口的计算
问题就在这里,只有后缀为1的数据出来,后缀为0的消息不见了!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
---下面我得弄明白为啥不见了,这坑爹的!
关键问题是如何定位,采用什么样的步骤,不然就是浪费时间,我的想法是:先把正常消息记录被计算的调用栈捞出来,然后再看不正常的是在哪一步调用出了问题
正常的调用栈之前调研过就比较熟悉了,直接捞出来如下:
通过断点 org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement
调用栈的文字版在此
[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)
这个调用栈还是要先研究一下的,看看能得出什么结论
这里的currentRecordDeserializer通过debug打印如下
currentRecordDeserializer = "org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer@23772fa4"
进入SpillingAdaptiveSpanningRecordDeserializer的getNextRecord方法,可以看到
这里有个读操作,读到缓冲区里,这个读对象是
target = "org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate@28b94ca3"
这里都是读,那么数据从哪来的呢?也就是说,谁往这里面写的?定位一下
查看调用栈如下
[1] org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer (SpillingAdaptiveSpanningRecordDeserializer.java:68)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:214)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)
跟上面第一个比较一下
[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)
其实是同一个线程,那么也就大致明白了,先读缓冲区里的数据,然后再解析,根据解析结果来执行4个分支,
关键是本质上从哪里来读数据,看下面
我们在 stop at org.apache.flink.streaming.runtime.io.StreamInputProcessor:209 打上断点
print barrierHandler
barrierHandler = "org.apache.flink.streaming.runtime.io.BarrierTracker@66d1ea7e"
代码位于
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(), line=94 bci=0
看到了这个
继续打印
print inputGate
inputGate = "org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate@60cca197"
看到了消费者的字样,nice
继续跟踪,调用的是
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(), line=502 bci=0
进这个函数看一下
到这里就跟之前了解到的情况串起来了!
那就先看看第一条消息来的时候,这个地方是否被触发了!!!继续做实验
上面说的channel是如何使用的?
所以此时我们把目光投向
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer
断点在
stop at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel:186
然后打印
print subpartitionView.getClass()
subpartitionView.getClass() = "class org.apache.flink.runtime.io.network.partition.PipelinedSubpartitionView"
下面的问题就是,谁往里面放数据的?
断点在
org.apache.flink.runtime.io.network.partition.PipelinedSubpartition.add
这个时候,我再按照顺向来跟踪发送的调用栈
[1] org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit (RecordWriter.java:104)
[2] org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit (StreamRecordWriter.java:81)
[3] org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter (RecordWriterOutput.java:107)
[4] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:89)
[5] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:45)
[6] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[7] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[8] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[9] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:45)
[10] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:32)
[11] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[12] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[13] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[14] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[15] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[16] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[17] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[18] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[19] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[20] DataStreamCalcRule$9091.processElement (null)
[21] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
[22] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
[23] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[24] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[25] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[26] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[27] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[28] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[29] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[30] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[31] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[32] TableFunctionCollector$9065.collect (null)
[33] org.apache.flink.table.functions.TableFunction.collect (TableFunction.scala:92)
[34] com.ymm.hubble.metric.flink.config.RowsByMetricAndTags.eval (RowsByMetricAndTags.java:161)
[35] DataStreamCorrelateRule$9023.processElement (null)
[36] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:79)
[37] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:35)
[38] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[39] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[40] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[41] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[42] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[43] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[44] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[45] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[46] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[47] DataStreamCalcRule$8990.processElement (null)
[48] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
[49] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
[50] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[51] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[52] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[53] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[54] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[55] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[56] org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.processElement (TimestampsAndPeriodicWatermarksOperator.java:67)
[57] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[58] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[59] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[60] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[61] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[62] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[63] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[64] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[65] DataStreamSourceConversion$8954.processElement (null)
[66] org.apache.flink.table.runtime.CRowOutputProcessRunner.processElement (CRowOutputProcessRunner.scala:67)
[67] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[68] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[69] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[70] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[71] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[72] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[73] org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollectWithTimestamp (StreamSourceContexts.java:310)
[74] org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp (StreamSourceContexts.java:409)
[75] org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp (AbstractFetcher.java:398)
[76] org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord (Kafka010Fetcher.java:89)
[77] org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop (Kafka09Fetcher.java:154)
[78] org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run (FlinkKafkaConsumerBase.java:721)
[79] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:87)
[80] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:56)
[81] org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run (SourceStreamTask.java:99)
[82] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[83] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[84] java.lang.Thread.run (Thread.java:748)
反复测试了几次,截图
现象再具体一下,就是新起来的任务,发送1条数据的时候,org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit这个方法是没有触发的,第2条发送了会触发,那么就不能在org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit这个地方打断点了,要提前!
判断到底是没从kafka读到消息,还是读到了不触发RecordWriter.emit函数!
-------------------------------------------------------------------------------------
//最开始从kafka读消息的地方
stop in org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord
//发射给channel的缓冲区[可能有远程操作]
stop in org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit
//拿到消息进行处理
stop at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator:299
通过修改第一个断点的位置可以判断问题!
后来终于搞明白是代码的问题!!!第一条数据触发了 拉取配置线程的初始化,然后被丢掉了!第2条数据正常被处理!