flink丟數據bug之問題定位

開發環境,發現丟數據,一開始還以爲自己sql的問題,後來花了一個下午+晚上+第2天早上基本確定了,先說一下現象

重啓sql服務,發送兩條數據,均工作在 event time模式下,延遲時間是1000ms

發送的2條數據如下所示:

第1條

{
"domain":"ymm-sms-webabcdefg0",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}

第2條

{
"domain":"ymm-sms-webabcdefg1",
"hostName":"lzq",
"ipAddress":"192.168.56.256",
"metric":"alert.test",
"sample":0.1,
"step":30,
"tags":{"hostname":"abcdefg"},
"timestamp":1551865000000,
"value":13.0
}

上面2條數據解釋下,domain不一樣,一個後綴是0,一個後綴是1;然後時間戳注意都是1551865000000

我的sql裏的窗口分組時間爲5000ms

因爲我的延遲時間設置爲了1000ms,所以下面一條消息我設置時間戳爲1551865000000+5000+1000-1

理論上要觸發2個窗口的計算

問題就在這裏,只有後綴爲1的數據出來,後綴爲0的消息不見了!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

---下面我得弄明白爲啥不見了,這坑爹的!

關鍵問題是如何定位,採用什麼樣的步驟,不然就是浪費時間,我的想法是:先把正常消息記錄被計算的調用棧撈出來,然後再看不正常的是在哪一步調用出了問題

正常的調用棧之前調研過就比較熟悉了,直接撈出來如下:

通過斷點  org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement

調用棧的文字版在此

[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)

這個調用棧還是要先研究一下的,看看能得出什麼結論

這裏的currentRecordDeserializer通過debug打印如下

currentRecordDeserializer = "org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer@23772fa4"

進入SpillingAdaptiveSpanningRecordDeserializer的getNextRecord方法,可以看到

這裏有個讀操作,讀到緩衝區裏,這個讀對象是

target = "org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate@28b94ca3"

這裏都是讀,那麼數據從哪來的呢?也就是說,誰往這裏面寫的?定位一下

查看調用棧如下

  [1] org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer (SpillingAdaptiveSpanningRecordDeserializer.java:68)
  [2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:214)
  [3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
  [4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
  [5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
  [6] java.lang.Thread.run (Thread.java:748)

跟上面第一個比較一下

[1] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement (WindowOperator.java:295)
[2] org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput (StreamInputProcessor.java:202)
[3] org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run (OneInputStreamTask.java:103)
[4] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
[5] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
[6] java.lang.Thread.run (Thread.java:748)

其實是同一個線程,那麼也就大致明白了,先讀緩衝區裏的數據,然後再解析,根據解析結果來執行4個分支,

關鍵是本質上從哪裏來讀數據,看下面

我們在 stop at org.apache.flink.streaming.runtime.io.StreamInputProcessor:209 打上斷點

print barrierHandler
 barrierHandler = "org.apache.flink.streaming.runtime.io.BarrierTracker@66d1ea7e"

代碼位於
org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(), line=94 bci=0

看到了這個

繼續打印

print inputGate
 inputGate = "org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate@60cca197"

看到了消費者的字樣,nice

繼續跟蹤,調用的是

org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(), line=502 bci=0

進這個函數看一下

到這裏就跟之前瞭解到的情況串起來了!

 

那就先看看第一條消息來的時候,這個地方是否被觸發了!!!繼續做實驗

上面說的channel是如何使用的?

所以此時我們把目光投向

org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer

斷點在

stop at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel:186

然後打印
print subpartitionView.getClass()
 subpartitionView.getClass() = "class org.apache.flink.runtime.io.network.partition.PipelinedSubpartitionView"

下面的問題就是,誰往裏面放數據的?

 

斷點在

org.apache.flink.runtime.io.network.partition.PipelinedSubpartition.add

 

這個時候,我再按照順向來跟蹤發送的調用棧

[1] org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit (RecordWriter.java:104)
  [2] org.apache.flink.streaming.runtime.io.StreamRecordWriter.emit (StreamRecordWriter.java:81)
  [3] org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter (RecordWriterOutput.java:107)
  [4] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:89)
  [5] org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect (RecordWriterOutput.java:45)
  [6] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [7] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [8] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [9] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:45)
  [10] org.apache.flink.table.runtime.RowtimeProcessFunction.processElement (RowtimeProcessFunction.scala:32)
  [11] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [12] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [13] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [14] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [15] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [16] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [17] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [18] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [19] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [20] DataStreamCalcRule$9091.processElement (null)
  [21] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
  [22] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
  [23] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [24] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [25] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [26] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [27] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [28] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [29] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [30] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [31] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [32] TableFunctionCollector$9065.collect (null)
  [33] org.apache.flink.table.functions.TableFunction.collect (TableFunction.scala:92)
  [34] com.ymm.hubble.metric.flink.config.RowsByMetricAndTags.eval (RowsByMetricAndTags.java:161)
  [35] DataStreamCorrelateRule$9023.processElement (null)
  [36] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:79)
  [37] org.apache.flink.table.runtime.CRowCorrelateProcessRunner.processElement (CRowCorrelateProcessRunner.scala:35)
  [38] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [39] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [40] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [41] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [42] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [43] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [44] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [45] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [46] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [47] DataStreamCalcRule$8990.processElement (null)
  [48] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66)
  [49] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35)
  [50] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [51] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [52] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [53] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [54] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [55] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [56] org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator.processElement (TimestampsAndPeriodicWatermarksOperator.java:67)
  [57] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [58] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [59] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [60] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [61] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [62] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51)
  [63] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37)
  [64] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28)
  [65] DataStreamSourceConversion$8954.processElement (null)
  [66] org.apache.flink.table.runtime.CRowOutputProcessRunner.processElement (CRowOutputProcessRunner.scala:67)
  [67] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66)
  [68] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560)
  [69] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535)
  [70] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515)
  [71] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679)
  [72] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657)
  [73] org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollectWithTimestamp (StreamSourceContexts.java:310)
  [74] org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collectWithTimestamp (StreamSourceContexts.java:409)
  [75] org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp (AbstractFetcher.java:398)
  [76] org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord (Kafka010Fetcher.java:89)
  [77] org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop (Kafka09Fetcher.java:154)
  [78] org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run (FlinkKafkaConsumerBase.java:721)
  [79] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:87)
  [80] org.apache.flink.streaming.api.operators.StreamSource.run (StreamSource.java:56)
  [81] org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run (SourceStreamTask.java:99)
  [82] org.apache.flink.streaming.runtime.tasks.StreamTask.invoke (StreamTask.java:306)
  [83] org.apache.flink.runtime.taskmanager.Task.run (Task.java:703)
  [84] java.lang.Thread.run (Thread.java:748)

反覆測試了幾次,截圖

現象再具體一下,就是新起來的任務,發送1條數據的時候,org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit這個方法是沒有觸發的,第2條發送了會觸發,那麼就不能在org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit這個地方打斷點了,要提前!

判斷到底是沒從kafka讀到消息,還是讀到了不觸發RecordWriter.emit函數!

-------------------------------------------------------------------------------------

//最開始從kafka讀消息的地方
stop in org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher.emitRecord

//發射給channel的緩衝區[可能有遠程操作]
stop in org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit

//拿到消息進行處理
stop at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator:299

通過修改第一個斷點的位置可以判斷問題!

後來終於搞明白是代碼的問題!!!第一條數據觸發了 拉取配置線程的初始化,然後被丟掉了!第2條數據正常被處理!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章