Gartner APM 魔力象限技術解讀——全量存儲? No! 按需存儲?YES!

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用鏈記錄了完整的請求狀態及流轉信息,是一座巨大的數據寶庫。但是,其龐大的數據量帶來的成本及性能問題是每個實際應用 Tracing 同學繞不開的難題。如何以最低的成本,按需記錄最有價值的鏈路及其關聯數據,是本文探討的主要話題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"核心關鍵詞是:邊緣計算 + 冷熱數據分離。"},{"type":"text","text":" 如果你正面臨全量存儲調用鏈成本過高,而採樣後查不到數據或圖表不準等問題,請耐心讀完本文,相信會給你帶來一些啓發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/53\/5341ad59fb9933fcebba5b276ffec308.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"邊緣計算,記錄更有價值的數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"邊緣計算,顧名思義就是在邊緣節點進行數據計算,趕時髦的話也可以稱之爲“計算左移”。"},{"type":"text","marks":[{"type":"strong"}],"text":"在網絡帶寬受限,傳輸開銷與全局數據熱點難以解決的背景下, 邊緣計算是尋求成本與價值平衡最優解的一種有效方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tracing 領域最常用的邊緣計算就是在用戶進程內進行數據過濾和分析。而在公有云環境,用戶集羣或專有網絡內部的數據加工也屬於邊緣計算,這樣可以節省大量的公網傳輸開銷,並分散全局數據計算的壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,從數據層面看,邊緣計算一方面可以篩選出更有價值的數據,另一方面可以通過加工提煉數據的深層價值,以最小的成本記錄最有價值的數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"篩選更有價值的數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"鏈路數據的價值分佈是不均勻的。"},{"type":"text","text":"據不完全統計,調用鏈的實際查詢率小於百萬分之一。全量存儲數據不僅會造成巨大的成本浪費,也會顯著影響整條數據鏈路的性能及穩定性。如下列舉兩種常見的篩選策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於鏈路數據特徵進行調用鏈採樣上報(Tag-based Sampling)。比如錯\/慢調用全採,特定服務每秒前 N 次採樣,特定業務場景自定義採樣等。下圖展示了阿里雲 ARMS 自定義採樣配置頁面("},{"type":"text","marks":[{"type":"italic"}],"text":"https:\/\/help.aliyun.com\/document_detail\/194773.html"},{"type":"text","text":"),用戶可以根據自身需要自由定製存儲策略,實際存儲成本通常小於原始數據的 5%。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d3\/d30245462ac06765a417905def0c73d6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"異常場景下自動保留關聯數據現場。我們在診斷問題根因時,除了調用鏈之外,還需要結合日誌、異常堆棧、本地方法耗時、內存快照等關聯信息進行綜合判斷。如果每一次請求的關聯信息全都記錄下來,大概率會造成系統的崩潰。"},{"type":"text","marks":[{"type":"strong"}],"text":"因此, 能否通過邊緣計算自動保留異常場景下的快照現場是衡量 Tracing 產品優劣的重要標準之一。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示,阿里雲 ARMS 產品提供了慢調用線程剖析("},{"type":"text","marks":[{"type":"italic"}],"text":"https:\/\/help.aliyun.com\/document_detail\/87560.html"},{"type":"text","text":"),內存異常HeapDump ("},{"type":"text","marks":[{"type":"italic"}],"text":"https:\/\/help.aliyun.com\/document_detail\/72191.html"},{"type":"text","text":")等能力。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4d\/4dfcb4a8526187152bd0d3c73fa5d381.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/57be7b3857a53fc54ba8850fc4a2e938.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"無論哪種篩選策略,其核心思想都是通過邊緣節點的數據計算,丟棄無用或低價值數據,保留異常現場或滿足特定條件的高價值數據。"},{"type":"text","text":"這種基於數據價值的選擇性上報策略性價比遠高於全量數據上報,未來可能會成爲 Tracing 的主流趨勢。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"提煉數據價值"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了數據篩選,在邊緣節點進行數據加工,比如預聚合和壓縮,同樣可以在滿足用戶需求的前提下,有效節省傳輸和存儲成本。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預聚合統計:在客戶端進行預聚合的最大好處, 就是在不損失數據精度的同時大幅減少數據上報量。比如,對調用鏈進行 1% 採樣後,仍然可以提供精準的服務概覽\/上下游等監控告警能力。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據壓縮:對重複出現的長文本(如異常堆棧,SQL 語句)進行壓縮編碼,也可以有效降低網絡開銷。結合非關鍵字段模糊化處理效果更佳。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"冷熱數據分離,低成本滿足個性化的後聚合分析需求"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"邊緣計算可以滿足大部分預聚合分析場景,但是無法滿足多樣化的後聚合分析需求,比如某個業務需要統計耗時大於3秒的接口及來源分佈,這種個性化的後聚合分析規則是無法窮舉的。而當我們無法預先定義分析規則時,貌似就只能採用成本極高的全量原始數據存儲。難道就沒有優化的空間麼?答案是有的,接下來我們就介紹一種低成本解決後聚合分析問題的方案——冷熱數據分離。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"冷熱數據分離方案簡述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"冷熱數據分離的價值基礎在於用戶的查詢行爲滿足時間上的局部性原理。"},{"type":"text","text":"簡單理解就是,最近的數據最常被查詢,冷數據查詢概率較小。例如,由於問題診斷的時效性,50% 以上的鏈路查詢分析發生在 30分鐘內,7天之後的鏈路查詢通常集中在錯慢調用鏈。理論基礎成立,接下來討論如何實現冷熱數據分離。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,熱數據存在時效性,如果只需記錄最近一段時間內的熱數據,對於存儲空間的要求就會下降很多。另外,在公有云環境下,不同用戶的數據天然具備隔離性。因此,在用戶 VPC 內部的熱數據計算和存儲方案就具備更優的性價比。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,冷數據的查詢具備指向性,可以通過不同的採樣策略篩選出滿足診斷需求的冷數據進行持久化存儲。例如錯慢採樣,特定業務場景採樣等。由於冷數據存儲週期較長,對穩定性要求較高,可以考慮在 Region 內統一管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上所述,熱數據存儲週期短,成本低,但可以滿足實時全量後聚合分析需求;而冷數據經過精準採樣後數據總量大幅下降,通常只有原始數據量的 1% ~10%,並可以滿足大多數場景的診斷訴求。兩相結合,實現了成本與體驗的平衡最優解。國內外領先的 APM 產品,如 ARMS、Datadog、Lightstep 均採用了冷熱數據分離的存儲方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0c\/0cb4f4151d0d3ff8931d61085b89fb84.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"熱數據實時全量分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鏈路明細數據包含了最完整最豐富的的調用信息,APM 領域最常用的服務面板、上下游依賴、應用拓撲等視圖均是基於鏈路明細數據統計得出。基於鏈路明細數據的後聚合分析可以根據用戶個性化需求更有效的定位問題。但是,後聚合分析的最大挑戰是要基於全量數據進行統計,否則會出現樣本傾斜導致最終結論離實際相差甚遠。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里雲 ARMS 作爲 2021 年 Gartner APM 魔力象限中國唯一入選雲廠商,提供了 30分鐘內熱數據全量分析的能力,可以實現各種條件組合下的過濾與聚合,如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/56\/566693224efa86aec4ea9ce800448d5c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"冷數據持久化採樣分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全量調用鏈的持久化存儲成本非常高,而前文提到 30分鐘後調用鏈的實際查詢率不足百萬分之一,並且大多數的查詢集中在錯慢調用鏈,或滿足特定業務特徵的鏈路,相信經常排查鏈路問題的同學會有同感。因此,我們應該只保留少量滿足精準採樣規則的調用鏈,從而極大的節省冷數據持久化存儲成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼精準採樣應該如何實現呢?業界常用的方法主要分爲頭部採樣(Head-based Sampling)和尾部採樣(Tail-based Sampling)兩種。頭部採樣一般在客戶端 Agent 等邊緣節點進行,例如根據接口服務進行限流採樣或固定比例採樣;而尾部採樣通常基於全量熱數據進行過濾,如錯慢全採等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"最理想的採樣策略應該只存儲真正需要查詢的數據,APM 產品需要提供靈活的採樣策略配置能力與最佳實踐,用戶結合自身業務場景進行自適應的調整。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結語"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當越來越多的企業和應用上雲,公有云集羣規模爆發式增長,“成本”將是企業用雲的關鍵衡量因素。而在雲原生時代,充分利用邊緣節點的計算和存儲能力,結合冷熱數據分離實現高性價比的數據價值探索已經逐漸成爲 APM 領域的主流。全量數據上報、存儲、再分析這種傳統方案將面臨越來越大的挑戰。未來會如何,讓我們拭目以待。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:阿里巴巴中間件(ID:Aliware_2018)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/nWu2vIXZlouTQKAvc65_mQ","title":"xxx","type":null},"content":[{"type":"text","text":"Gartner APM 魔力象限技術解讀——全量存儲? No! 按需存儲?YES!"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章