開發環境,發現丟數據,一開始還以爲自己sql的問題,後來花了一個下午+晚上+第2天早上基本確定了,先說一下現象
重啓sql服務,發送兩條數據,均工作在 event time模式下,延遲時間是1000ms
發送的2條數據如下所示:
第1條
{
"domain":"ymm-sms-webabcdefg0",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}
第2條
{
"domain":"ymm-sms-webabcdefg1",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}
上面2條數據解釋下,domain不一樣,一個後綴是0,一個後綴是1;然後時間戳注意都是1551865000000
我的sql裏的窗口分組時間爲5000ms
因爲我的延遲時間設置爲了1000ms,所以下面一條消息我設置時間戳爲1551865000000+5000+1000-1
理論上要觸發2個窗口的計算
問題就在這裏,只有後綴爲1的數據出來,後綴爲0的消息不見了!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
---下面我得弄明白爲啥不見了,這坑爹的!
關鍵問題是如何定位,採用什麼樣的步驟,不然就是浪費時間,我的想法是:先把正常消息記錄被計算的調用棧撈出來,然後再看不正常的是在哪一步調用出了問題
正常的調用棧之前調研過就比較熟悉了,直接撈出來如下:
通過斷點 org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement
調用棧的文字版在此
[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)
這個調用棧還是要先研究一下的,看看能得出什麼結論
這裏的currentRecordDeserializer通過debug打印如下
currentRecordDeserializer = "org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer@23772fa4"
進入SpillingAdaptiveSpanningRecordDeserializer的getNextRecord方法,可以看到
這裏有個讀操作,讀到緩衝區裏,這個讀對象是
target = "org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate@28b94ca3"
這裏都是讀,那麼數據從哪來的呢?也就是說,誰往這裏面寫的?定位一下
查看調用棧如下
[1] org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer (SpillingAdaptiveSpanningRecordDeserializer.java:68)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:214)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)
跟上面第一個比較一下
[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)
其實是同一個線程,那麼也就大致明白了,先讀緩衝區裏的數據,然後再解析,根據解析結果來執行4個分支,
關鍵是本質上從哪裏來讀數據,看下面
我們在 stop at org.apache.flink.streaming.runtime.io.StreamInputProcessor:209 打上斷點
print barrierHandler
barrierHandler = "org.apache.flink.streaming.runtime.io.BarrierTracker@66d1ea7e"
代碼位於
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(), line=94 bci=0
看到了這個
繼續打印
print inputGate
inputGate = "org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate@60cca197"
看到了消費者的字樣,nice
繼續跟蹤,調用的是
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(), line=502 bci=0
進這個函數看一下
到這裏就跟之前瞭解到的情況串起來了!
那就先看看第一條消息來的時候,這個地方是否被觸發了!!!繼續做實驗
上面說的channel是如何使用的?
所以此時我們把目光投向
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer
斷點在
stop at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel:186
然後打印
print subpartitionView.getClass()
subpartitionView.getClass() = "class org.apache.flink.runtime.io.network.partition.PipelinedSubpartitionView"
下面的問題就是,誰往裏面放數據的?
斷點在
org.apache.flink.runtime.io.network.partition.PipelinedSubpartition.add
這個時候,我再按照順向來跟蹤發送的調用棧
[1] org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit (RecordWriter.java:104)
[2] org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit (StreamRecordWriter.java:81)
[3] org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter (RecordWriterOutput.java:107)
[4] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:89)
[5] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:45)
[6] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[7] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[8] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[9] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:45)
[10] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:32)
[11] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[12] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[13] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[14] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[15] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[16] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[17] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[18] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[19] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[20] DataStreamCalcRule$9091.processElement (null)
[21] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
[22] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
[23] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[24] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[25] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[26] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[27] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[28] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[29] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[30] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[31] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[32] TableFunctionCollector$9065.collect (null)
[33] org.apache.flink.table.functions.TableFunction.collect (TableFunction.scala:92)
[34] com.ymm.hubble.metric.flink.config.RowsByMetricAndTags.eval (RowsByMetricAndTags.java:161)
[35] DataStreamCorrelateRule$9023.processElement (null)
[36] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:79)
[37] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:35)
[38] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[39] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[40] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[41] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[42] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[43] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[44] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[45] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[46] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[47] DataStreamCalcRule$8990.processElement (null)
[48] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
[49] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
[50] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[51] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[52] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[53] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[54] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[55] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[56] org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.processElement (TimestampsAndPeriodicWatermarksOperator.java:67)
[57] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[58] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[59] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[60] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[61] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[62] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
[63] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
[64] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
[65] DataStreamSourceConversion$8954.processElement (null)
[66] org.apache.flink.table.runtime.CRowOutputProcessRunner.processElement (CRowOutputProcessRunner.scala:67)
[67] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
[68] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
[69] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
[70] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
[71] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
[72] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
[73] org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollectWithTimestamp (StreamSourceContexts.java:310)
[74] org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp (StreamSourceContexts.java:409)
[75] org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp (AbstractFetcher.java:398)
[76] org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord (Kafka010Fetcher.java:89)
[77] org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop (Kafka09Fetcher.java:154)
[78] org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run (FlinkKafkaConsumerBase.java:721)
[79] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:87)
[80] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:56)
[81] org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run (SourceStreamTask.java:99)
[82] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[83] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[84] java.lang.Thread.run (Thread.java:748)
反覆測試了幾次,截圖
現象再具體一下,就是新起來的任務,發送1條數據的時候,org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit這個方法是沒有觸發的,第2條發送了會觸發,那麼就不能在org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit這個地方打斷點了,要提前!
判斷到底是沒從kafka讀到消息,還是讀到了不觸發RecordWriter.emit函數!
-------------------------------------------------------------------------------------
//最開始從kafka讀消息的地方
stop in org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord
//發射給channel的緩衝區[可能有遠程操作]
stop in org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit
//拿到消息進行處理
stop at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator:299
通過修改第一個斷點的位置可以判斷問題!
後來終於搞明白是代碼的問題!!!第一條數據觸發了 拉取配置線程的初始化,然後被丟掉了!第2條數據正常被處理!