加速Flink佈局,Pinterest的自助式故障診斷工具實踐

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲簡化和加速故障排查,Pinterest流處理平臺團隊基於Flink構建並推出了稱爲Dr. Squirrel的診斷工具,揭示並聚合任務狀態,洞悉根本致因,提供解決問題的可操作過程。自發布以來,該工具顯著提升了開發人員和平臺團隊的工作效率。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注:本文作者"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/fanshu-jiang\/","title":null,"type":null},"content":[{"type":"text","text":"Fanshu Jiang"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/luniu\/","title":null,"type":null},"content":[{"type":"text","text":"Lu Niu"}]},{"type":"text","text":"任職於Pinterest流處理平臺團隊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pinterest流處理已賦能多項"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/unified-flink-source-at-pinterest-streaming-data-processing-c9d4e89f2ed6","title":null,"type":null},"content":[{"type":"text","text":"實時用例"}]},{"type":"text","text":"。近幾年來,基於Flink的平臺支持近實時地產出活躍內容和度量報告,表現出了對業務的巨大價值,並在未來有潛力去賦能更多的用例。但要充分發掘Flink的潛力,需解決開發速度上的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要形成生成環境中穩定的數據流,從寫下第一行代碼開始需數週時間。其中Flink任務的故障排查和調優尤其耗時,因爲在排查中會面對海量的日誌和度量,調優中會涉及林林總總的配置。查找出導致開發問題的根本致因,在一定程度上需要深入理解Flink的內部機制。這不僅影響了開發速度,引發低於預期的Flink上手體驗,而且導致大量的平臺支持需求,限制了流處理用例的可擴展性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲簡化和加速故障排查,我們構建並推出了一款Flink診斷工具,稱爲"},{"type":"text","marks":[{"type":"strong"}],"text":"Dr. Squirrel"},{"type":"text","text":"。該工具揭示並聚合任務狀態,洞悉根本致因,提供解決問題的可操作過程。自發布以來,該工具顯著地提升了開發人員和平臺團隊的工作效率。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Flink任務排查的難點"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"日誌和度量散佈於大規模存儲中,其中僅少量涉及故障"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"故障排查中,工程人員的通常做法是:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過YARN界面,滾動瀏覽長篇累牘的JM\/TM日誌。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查看數十種任務\/服務器度量面板。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索任務的配置,並逐一驗證。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點擊Flink Web界面提供的各項任務圖,查看檢查點對齊(alignment)、數據偏斜和反壓(backpressure)等細節信息。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢查這些狀態頗爲耗時,但其中的90%是並無異常的,或是與根本致因無關。如果能提供相關信息的一站式聚合,僅揭示與故障排查相關的問題,這無疑將節省大量的時間。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"發現存在問題的度量後,應採取怎樣的措施?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任務相關方在發現有問題度量後,常常會問到這個問題。因爲要獲得根本致因,還需做更多的推理。例如,檢查點超時可能表明超時配置不正確,也可能是由於反壓、s3文件系統上傳慢、垃圾回收機制、數據偏斜等問題導致。TaskManager日誌丟失可能表明節點故障,但通常是由於堆問題或者"},{"type":"link","attrs":{"href":"https:\/\/issues.apache.org\/jira\/browse\/FLINK-18712","title":null,"type":null},"content":[{"type":"text","text":"RocksDB statebackend OOM"}]},{"type":"text","text":"問題導致。排查並徹底驗證每個可能致因,這需要一定時間。但80%的問題修復是有規律可循的。因此作爲平臺團隊,我們考慮是否可以通過編程去分析系統狀態,無需任務相關方推斷就能給出真實致因。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"故障排查文檔遠遠不夠"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們向用戶提供故障排查文檔。但隨着故障排查用例的持續增長,文檔的篇幅也越來越長,難以快速地查找到問題的相關診斷和操作。爲確定根本致因,工程人員不得不手工執行if-else檢測邏輯,導致自助式檢測難以順利開展,同時其它問題也仍要依賴平臺團隊去故障排查。此外,在平臺推出新的任務健康需求時,文檔尚未完美到可據此做出響應。我們意識到,爲了更有效地分享故障排查要點,強化逐個集羣任務的健康需求,需要我們去開發一款新的工具。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Dr. Squirrel:自助式故障排查診斷工具"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鑑於上述挑戰,我們構建並推出了一款快速問題檢測和排查診斷工具,稱爲Dr. Squirrel。其設計目標是:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將故障排查時間從小時級削減到分鐘級。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將開發人員的多種故障排查工具聚合爲一款。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"故障排查中不必掌握Flink內部機制,僅需略有了解。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總而言之,該工具將有用信息聚合爲一處,執行任務健康檢查,清晰標記非健康的任務,分析根本致因,給出可操作步驟,幫助修復問題。下面介紹部分高亮特性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"更有效的日誌查看"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於每個運行的任務,Dr. Squirrel將高亮標識TaskManager丟失和OOM問題等直接觸發重啓的異常,幫助在海量堆積日誌中快速地查找出值得關注的相關異常。它收集警告(warning)、錯誤(error)和信息(info)日誌各部分中所有包含堆棧追蹤的信息,並檢查每個日誌內容中是否存在“error”關鍵字,爲在故障排查指南中逐步解決問題提供線索。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3d\/3d9df3e9059643766d0948378ee53231.png","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dr. Squirrel的搜索條支持對全日誌的搜索,並基於此提供兩種更高效的日誌查看方式,即時間線(Timeline)視圖和特異(Unique exception)視圖。時間線視圖如下圖所示,其中按時間順序爲用戶提供具有“Class Name”信息的日誌查看,並預先生成ElasticSearch鏈接,以滿足細節查看需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/44\/441edcda4e4c019f39b87e940420ac43.png","alt":"","title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶只需點擊一下,可會切換到特異視圖。相同異常被分組爲同一行顯示,並提供首次出現、最近出現和合計出現次數等元數據,有助於識別最頻繁發生的異常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/44\/44da52760182082ae175919576890e9b.png","alt":"","title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"任務健康一目瞭然"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dr. Squirrel提供了健康查看頁面,給出任務的確切健康情況,適合無論資深工程人員還是新手查看。不同於直接展示度量的面板,Dr. Squirrel對各個度量監控一小時,清晰給出是否符合我們平臺穩定性需求。對於平臺團隊,這是一種有效的、可擴展的交流方式,強調任務的穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"健康檢查頁面分爲如下數個區域,各區域聚焦於不同的任務健康特性。快速地做一次整體瀏覽,就能很好地把握任務的整體健康情況。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"基本任務狀態"},{"type":"text","text":"區域:展示基本健康狀態,例如通量、完全重啓率,檢查點規模和持續時間,持續檢查點失敗、最近一小時內的最大併發等情況。未通過健康檢查的度量,會標記爲“Failed”,並置頂顯示。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c7\/c79f52e74e00ca685f9e9eb53b0b8ea8.png","alt":"","title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"反壓任務"},{"type":"text","text":"區域:以一定粒度追蹤每個Operator的反壓狀態。如果一分鐘內無反壓,就可視化展示爲一個綠色方塊,否則展示一個紅色方塊。每個Operator展示爲60個方塊,表示最近一小時內的反壓狀態。這易於識別反壓發生頻次,並確定最先啓動的Operator。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/16\/16f8dc5eb5cd5383bda2f8d786442ef0.png","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"垃圾回收Old Gen時間"},{"type":"text","text":"區域:採用和反壓任務區域同樣的可視化方式,概覽垃圾回收是否過於頻繁發生。垃圾回收可對通量和檢查點造成潛在影響。由於採用相同的可視化方式,我們可以清晰地查看垃圾回收和反壓是否同時發生,進而判斷垃圾回收是否是導致反壓的潛在原因。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/dc\/dcc8a57bb758956f6480207f84603e2a.png","alt":"","title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"JobManager\/TaskManager內存使用"},{"type":"text","text":"區域:追蹤展示YARN容器的內存使用情況,即通過運行在工作節點上的駐留進程收集Flink Java進程的常駐集規模(resident set size,RSS)。RSS內存包含所有Flink內存模型和未被Flink追蹤的內存部分,例如JVM進程堆棧、線程元數據、使用JNI的用戶代碼內存分配,由此更爲準確地展示了內存的使用情況。該區域中標記了配置後的最大JM\/TM內存,以及90%使用閥值,爲用戶快速定位接近出現OOM問題的具體容器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/61\/61020ae9095b53fcb86c5fa2f877697d.png","alt":"","title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"CPU使用率"},{"type":"text","text":"區域:巡視所有使用CPU資源高於指定vcore的容器,幫助監控並避免在多租戶Hadoop集羣中出現“不安分的鄰居”(Noisy neighbor)問題。即如果單個用戶工作負載的CPU使用率過高,會影響到其它用戶的性能和穩定性。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/61\/61020ae9095b53fcb86c5fa2f877697d.png","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"有效配置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Flink任務可在不同層級上配置,例如執行層的in-code配置,客戶層的任務屬性文件和命令行參數,以及系統層的flink-conf.yaml文件。在測試和熱修復(hotfix)中,工程人員常常會發生在不同層級配置同一參數的問題。由於各層級間存在各異的覆蓋關係,很難考慮到具體那一層級上的配置值是最終生效的。爲解決這個問題,我們構建了一個配置庫,指明任務運行中所使用的有效配置值,並提供給Dr. Squirrel展示。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"可查詢的聚類任務健康狀況"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dr. Squirrel提供了豐富的任務狀態展示,是掌握逐個集羣任務健康狀態的資源中心,併爲探究平臺改進提供洞悉。例如,列出排名前十位的重啓根本致因,出現內存或反壓問題任務的百分比等。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上特性所示,度量和日誌將彙集到同一處。爲實現可擴展的信息採集,我們在自定義的Flink版本中添加了MetricReporter和KafkaLog4jAppender組件,持續發送度量和日誌到Kafka Topic。此外,KafkaLog4jAppender還提供對我們很重要的日誌的過濾功能,即堆棧跟蹤(stacktrace)所給出的警告、錯誤和信息日誌。隨後,由作爲Flink任務的FlinkJobWatcher執行一系列分析和轉換,實現該任務度量和日誌的連接運算。FlinkJobWatcher每隔5分鐘創建一次任務健康快照,發送給作爲Kafka Topic的JobSnapshot。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着Flink用例的不斷增長,導致生成大量的日誌和度量。作爲Flink任務,FlinkJobWatcher能處理不斷增長的數據規模,易於並行調優,保證系統通量能匹配用例數量增長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/89\/89d70adebb14eeee68188a45a55dae58.png","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着JobSnapshot的啓用,越來越多的數據需獲取和歸併到JobSnapshot。針對此,我們使用"},{"type":"link","attrs":{"href":"http:\/\/www.dropwizard.io\/","title":null,"type":null},"content":[{"type":"text","text":"dropwizard"}]},{"type":"text","text":"構建了RESTful服務,不斷讀取JobSnapshot Topic,並通過RPC拉取外部數據。其中,外部數據源包括從YARM ResourceManager獲取的用戶名和加載時間等靜態數據、Flink REST API獲取的配置、對比時序度量是否符合細粒度標準下閾值的內部工具Automated Canary Analysis(ACA),以及其他一些內部查看工具,它們通過運行工作節點駐留進程採集RSS內存、CPU使用率等自定義度量。我們還使用React實現前端用戶界面,可更好地查看健康狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f8\/f86d309490784bb97b688c545db90116.png","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將持續改進Dr. Squirrel,以提供更好的診斷能力,進而實現完全自助式檢查。具體包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"容量規劃"},{"type":"text","text":":監控並評估系統通量、內存和vcore使用情況,進而找出最有效的資源設置。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"與CI\/CD的集成"},{"type":"text","text":":運行CI\/CD流水線自動化,實現更改從開發到生產環境的驗證和推送。我們將集成Dr.Squirrel到CI\/CD,以推送新更改,提供更確切的任務健康情況。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"報警和通知"},{"type":"text","text":":向任務所有者和平臺團隊彙總報告健康狀況。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"任務粒度的代價估計"},{"type":"text","text":":基於資源的使用情況,給出每個任務的代價估計,爲預算的規劃和掌控提供參考。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"更多Pinterest流處理參考資料:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/unified-flink-source-at-pinterest-streaming-data-processing-c9d4e89f2ed6","title":null,"type":null},"content":[{"type":"text","text":"Pinterest的統一Flink源:流數據處理"}]},{"type":"text","text":"(Unified Flink Source at Pinterest: Streaming Data Processing)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/detecting-image-similarity-in-near-real-time-using-apache-flink-723ce072b7d2","title":null,"type":null},"content":[{"type":"text","text":"使用Apache Flink實現近實時圖像相似度檢測"}]},{"type":"text","text":"(Detecting Image Similarity in (Near) Real-time Using Apache Flink)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/pinterest-visual-signals-infrastructure-evolution-from-lambda-to-kappa-architecture-f8f58b127d98","title":null,"type":null},"content":[{"type":"text","text":"Pinterest的視頻架構:從Lambda到Kappa架構的演進"}]},{"type":"text","text":"(Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/real-time-experiment-analytics-at-pinterest-using-apache-flink-841c8df98dc2","title":null,"type":null},"content":[{"type":"text","text":"基於Apache Flink的Pinterest實驗性實時業務平臺"}]},{"type":"text","text":"(Real-time experiment analytics at Pinterest using Apache Flink)"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"},{"type":"text","text":" "},{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/faster-flink-adoption-with-self-service-diagnosis-tool-at-pinterest-50a07143f444","title":null,"type":null},"content":[{"type":"text","text":"Faster Flink adoption with self-service diagnosis tool at Pinterest"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章