Prometheus 存儲層的演進

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 是當下最流行的監控平臺之一,它的主要職責是從各個目標節點中採集監控數據,後持久化到本地的時序數據庫中,並向外部提供便捷的查詢接口。本文嘗試探討 Prometheus 存儲層的演進過程,信息源主要來自於 Prometheus 團隊在歷屆 PromConf 上的分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時序數據庫是 Promtheus 監控平臺的一部分,在瞭解其存儲層的演化過程之前,我們需要先了解時序數據庫及其要解決的根本問題。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"TSDB"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"時序數據庫 (Time Series Database, TSDB)"},{"type":"text","text":" 是數據庫大家庭中的一員,專門存儲隨時間變化的數據,如股票價格、傳感器數據、機器狀態監控等等。"},{"type":"text","marks":[{"type":"strong"}],"text":"時序 (Time Series)"},{"type":"text","text":" 指的是某個變量隨時間變化的所有歷史,而"},{"type":"text","marks":[{"type":"strong"}],"text":"樣本 (Sample)"},{"type":"text","text":" 指的是歷史中該變量的瞬時值:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/20/201272f18e27f0072f47aad8a3e41299.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個樣本由"},{"type":"text","marks":[{"type":"strong"}],"text":"時序標識"},{"type":"text","text":"、"},{"type":"text","marks":[{"type":"strong"}],"text":"時間戳"},{"type":"text","text":"和"},{"type":"text","marks":[{"type":"strong"}],"text":"數值"},{"type":"text","text":" 3 部分構成,其所屬的時序就由一系列樣本構成。由於時間是連續的,我們不可能、也沒有必要記錄時序在每個時刻的數值,因此"},{"type":"text","marks":[{"type":"strong"}],"text":"採樣間隔"},{"type":"text","text":" (Interval) 也是時序的重要組成部分。採樣間隔越小、樣本總量越大、捕獲細節越多;採樣間隔越大、樣本總量越小、遺漏細節越多。以服務器機器監控爲例,通常採樣間隔爲 15 秒。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據的高效查詢離不開索引,對於時序數據而言,唯一的、天然的索引就是時間 (戳)。因此通常時序數據庫的存儲層相比於關係型數據庫要簡單得多。仔細思考,你可能會發現時序數據在某種程度上就是鍵值數據的一個子集,因此鍵值數據庫天然地可以作爲時序數據的載體。通常一個時序數據庫能容納百萬量級以上的時序數據,要從其中搜索到其中少量的幾個時序也非易事,因此對時序本身建立高效的索引也很重要。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"The Fundamental Problem of TSDBs"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TSDB 要解決的基本問題,可以概括爲下圖:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/61/61e987cbb05f960a3ac43fb93c3a4ded.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究過存儲引擎結構和性能優化的工程師都會知道:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多數據庫的奇技淫巧都是在解決內存與磁盤的讀寫模式、性能的不匹配問題"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時序數據庫也是數據庫的一種,只要它想持久化,自然不能例外。但與鍵值數據庫相比,時序數據庫存儲的數據有更特殊的讀寫特徵,Björn Rabenstein 將稱其爲:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Vertical writes, horizontal(-ish) reads"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"垂直寫,水平讀"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中每條橫線就是一個時序,每個時序由按照 (準) 固定間隔採集的樣本數據構成,通常在時序數據庫中會有很多活躍時序,因此數據寫入可以用一個垂直的窄方框表示,即每個時序都要寫入新的樣本數據;用戶在查詢時,通常會觀察某個、某幾個時序在某個時間段內的變化趨勢,或對其進行聚合計算,因此數據讀取可以用一個水平的方框表示。是謂 “垂直寫、水平讀”。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Storage Layer of Prometheus"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 是爲雲原生環境中的數據監控而生,在其設計過程中至少需要考慮以下兩個方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在雲原生環境中,實例可能隨時出現、消失,因此時序也可能隨時出現或消失,即系統中存在大量時序,其中部分處於活躍狀態,這會在多方面帶來挑戰:如何存儲大量時序避免資源浪費如何定位被查詢的少數幾個時序"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"監控系統本身應該儘量少地依賴外部服務,否則外部服務失效將引發監控系統失效"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於第 2 點,Prometheus 團隊選擇放棄集羣,使用單機架構,並且在單機系統中使用本地 TSDB 做數據持久化,完全不依賴外部服務;第 1 點是需要存儲、索引、查詢引擎層合作解決的問題,在下文中我們將進一步分析存儲層在其中的作用。Prometheus 存儲層的演進可以分成 3 個階段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1st Generation: Prototype"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2nd Generation: Prometheus V1"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3rd Generation: Prometheus V2"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"注意:本節只關注 Prometheus 時序數據的存儲,不涉及索引、WAL 等其它數據的存儲。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Data Model"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管數據模型是存儲層之上的抽象,理論上它不應該影響存儲層的設計。但理解數據模型能夠幫助我們更快地理解存儲層。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Prometheus 中,每個時序實際上由多個"},{"type":"text","marks":[{"type":"strong"}],"text":"標籤"},{"type":"text","text":" (labels) 標識,如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"api_http_requests_total{path=\"/users\",status=200,method=\"GET\",instance=\"10.111.201.26\"}"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該時序的名字爲 api_http_requests_total,標籤爲 path、status、method 和 instance,只有時序名字和標籤鍵值完全相同的時序纔是同一個時序。事實上,時序名字就是一個隱藏標籤:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"{__name__=\"api_http_requests_total\",path=\"/users\",status=200,method=\"GET\","}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"instance=\"10.111.201.26\"}"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於用戶來說,標籤之間不存在先後順序,用戶可能關注:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有 api 調用的 status"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某個 path 調用的成功率、QPS"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某個實例、某個 path 調用的成功率"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"…"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1st Generation: Prototype"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Prototype 階段,Prometheus 直接利用開源的鍵值數據庫 (LevelDB) 作爲本地持久化存儲,並採用與 "},{"type":"link","attrs":{"href":"https://cloud.google.com/bigtable/docs/schema-design-time-series?hl=en#server_metrics","title":null},"content":[{"type":"text","text":"BigTable 推薦的時序數據方案"}]},{"type":"text","text":" 類似的 schema 設計:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c1ee35cd6c56a0abdba02facd0eb8e23.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將"},{"type":"text","marks":[{"type":"strong"}],"text":"時序名稱、標籤 (固定順序)、時間戳"},{"type":"text","text":"拼接成每個樣本的鍵,於是同一個時序的數據就能夠連續存儲在鍵值數據庫中,提高範圍查詢的效率。但從圖中可以看出,這種方式存儲的鍵很長,儘管鍵值數據庫內部會對數據進行壓縮,但是在內存中這樣存儲數據很浪費空間,這無法滿足項目的設計要求。Prometheus 希望在內存中壓縮數據,使得內存中可以容納更多活躍的時序數據,同時在磁盤中也能按類似的方式壓縮編碼,提高效率。時序數據比通用鍵值數據有更顯著的特徵。即使鍵值數據庫能夠壓縮數據,但針對時序數據的特徵,使用特殊的壓縮算法能夠取得更好的壓縮率。因此在 Prototype 階段,使用三方鍵值數據庫的方案最終流產。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2nd Generation: Prometheus V1"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Compression"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Why Compression"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設監控系統的需求如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"500 萬活躍時序"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"30 秒採樣間隔"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1 個月數據留存"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼經過計算可以得到具體的存儲要求:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平均每秒採集 166000 個樣本"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"存儲樣本總量爲 4320 億個樣本"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設沒有任何壓縮,不算時序標識,每個樣本需要 16 個字節存儲空間 (時間戳 8 個字節、數值 8 個字節),整個系統的"},{"type":"text","marks":[{"type":"strong"}],"text":"存儲總量爲 7TB"},{"type":"text","text":",假設數據需要留存 6 個月,則"},{"type":"text","marks":[{"type":"strong"}],"text":"總量爲 42 TB"},{"type":"text","text":",那麼如果能找到一種有效的方式壓縮數據,就能在單機的內存和磁盤中存放更多、更長的時序數據。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Chunked Storage Abstraction"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提到 TSDB 的根本問題是 “垂直寫,水平讀”,每次採樣都會需要爲每個活躍時序寫入一條樣本數據,但如果每次爲每個時序寫入 16 個字節到 HDD/SSD 中,顯然這對塊存儲設備十分不友好,效率低下。因此 Prometheus V2 將數據按固定長度切割相同大小的分段 (Chunks),方便壓縮、批量讀寫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訪問時序數據時,Prometheus 使用 3 層抽象,如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b3/b3c625910f6056532e3786a15df352c4.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用層使用 Series Iterator 順序訪問時序中的樣本,而 Series Iterator 底下由一個個 Chunk Iterator 拼接而成,每個 Chunk Iterator 負責將壓縮編碼的時序數據解碼返回。這樣做的好處是,"},{"type":"text","marks":[{"type":"strong"}],"text":"每個 Chunk 甚至可以使用完全不同的方式編碼"},{"type":"text","text":",方便開發團隊嘗試不同的編碼方案。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Timestamp Compression: Double Delta"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於通常數據採樣間隔是固定值,因此前後時間戳的差值幾乎固定,如 15s,30s。但如果我們更近一步,只存儲差值的差值,那麼幾乎不用再爲新的時間戳花費額外的空間,這便是所謂的 “"},{"type":"text","marks":[{"type":"strong"}],"text":"Double Delta"},{"type":"text","text":"“。本質上,如果未來所有的採集時間戳都可以精準預測,那麼每個新時間戳的信息熵爲 0 bit。但現實並不完美,網絡可能延遲、中斷,實例可能遇到 GC、重啓,採樣間隔隨時有可能波動:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/01/01b5489a1d90b48f3e0e91bea69a2d57.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但這種波動的幅度有限,Prometheus 採用了和 FB 的內存時序數據庫 Gorilla 類似的方式編碼時間戳,詳情可以參考我的另一篇博客 "},{"type":"link","attrs":{"href":"https://tech.ipalfish.com/blog/2020/03/31/the-evolution-of-prometheus-storage-layer/(/blog/2020/02/16/Gorilla-A-Fast-Scalable-In-Memory-Time-Series-Database-2015/","title":null},"content":[{"type":"text","text":"Gorilla"}]},{"type":"text","text":") 以及 Björn Rabenstein 在 PromConn 2016 的演講 "},{"type":"link","attrs":{"href":"https://docs.google.com/presentation/d/1TMvzwdaS8Vw9MtscI9ehDyiMngII8iB_Z5D4QW4U4ho/edit#slide=id.g15afea0287_0_16","title":null},"content":[{"type":"text","text":"ppt"}]},{"type":"text","text":" ,細節比較瑣碎,這裏不贅述。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"Value Compression"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 和 Gorilla 中的每個樣本值都是 float64 類型。Gorilla 利用 float64 的二進制表示 (IEEE754) 將前後兩個樣本值 XOR 來尋找壓縮的空間,能獲得 1.37 bytes/sample 的壓縮能力。Prometheus V2 採用的方式比較簡單:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果可能的話,使用整型 (8/16/32 位) 存儲,否則用 float32,最後實在不行就直接存儲 float64"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果數值增長得很規律,則不使用額外的空間存儲"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上做法給 Prometheus V1 帶來了 3.3 bytes/sample 的壓縮能力。相比於爲完全存儲於內存中的 Gorilla 相比,這樣的壓縮能力對於 Prometheus 已經夠用,但在 V2 中,Prometheus 也融合了 Gorilla 採用的壓縮技術。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Chunk Encoding"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus V1 將每個時序分割成大小爲 1KB 的 chunks,如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2c/2c9dd3faa9bdb888b25078c4cf9ddf1d.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在內存中保留着最近寫入的 chunk,其中 head chunk 正在接收新的樣本。每當一個 head chunk 寫滿 1KB 時,會立即被凍結,我們稱之爲完整的 chunk,從此刻開始該 chunk 中的數據就是不可變的 (immutable) ,同時生成一個新的 head chunk 負責消化新的請求。每個完整的 chunk 會被儘快地持久化到磁盤中。內存中保存着每個時序最近被寫入或被訪問的 chunks,當 chunks 數量過多時,存儲引擎會將超過的 chunks 通過 LRU 策略清出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Prometheus V1 中,每個時序都會被存儲到在一個獨佔的文件中,這也意味着大量的時序將產生大量的文件。存儲引擎會定期地去檢查磁盤中的時序文件,是否已經有 chunk 數據超過保留時間,如果有則將其刪除 (複製後刪除)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus 的查詢引擎的查詢過程必須完全在內存中進行。因此在執行之前,存儲引擎需要將不在內存中的 chunks 預加載到內存中:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/09/0949cf5ea604441385fc5b2640886787.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果在內存中的 chunks 持久化之前系統發生崩潰,則會產生數據丟失。爲了減少數據丟失,Prometheus V1 還使用了額外的 checkpoint 文件,用於存儲各個時序中尚未寫入磁盤的 chunks:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30f4bd7c297d3d74cfed5289d0fbbf11.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Prometheus V1 vs. Gorilla"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正因爲 Prometheus V1 與 Gorilla 的設計理念、需求有所不同,我們可以通過對比二者來理解其設計過程中使用不同決策的原因。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8c/8ce5cdc24e75e89dcffb0122bdd1d018.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3rd Generation: Prometheus V2"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"The Main Problem With 2nd Generation"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus V1 中,每個時序數據對應一個磁盤文件的方式給系統帶來了比較大的麻煩:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於在雲原生環境下,會不斷產生新的時序、廢棄舊的時序 (Series Churn),因此實際上存儲層需要的文件數量遠遠高於活躍的時序數量。任其發展遲早會將文件系統的 inodes 消耗殆盡。而且一旦發生,恢復系統將異常麻煩。不僅如此,在新舊時序大量更迭時,由於舊時序數據尚未從內存中清出,系統的內存消耗量也會飆升,造成 OOM。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即便使用 chunks 來批量讀寫數據,從整體上看,系統每秒鐘仍要向磁盤寫入數千個 chunks,造成 I/O 壓力;如果通過增大每批寫入的量來減少 I/O 次數,又將造成內存的壓力。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時將所有時序文件保持打開狀態很不合理,需要消耗大量的資源。如果在查詢前後打開、關閉文件,又會增加查詢的時延。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據超過留存時間時需要刪除相關的 chunks,這意味着每隔一段時間就要對數百萬的文件執行一次刪除數據操作,這個過程可能需要持續數小時。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過週期性地將未持久化的 chunks 寫入 checkpoint 文件理論上確實可以減少數據丟失,但是如果執行數據恢復需要很長時間,那麼實際上又錯過了新的數據,還不如不恢復。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此 Prometheus 的第三代存儲引擎,主要改變就是放棄 “一個時序對應一個文件” 的設計理念。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Macro Design"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三代存儲引擎在磁盤中的文件結構如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6b/6b3e116902771cf008d83da02001f6fd.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根目錄下,順序排列着編了號的 blocks,每個 block 中包含 index 和 chunk 文件夾,後者裏面包含編了號的 chunks,每個 chunk 包含"},{"type":"text","marks":[{"type":"strong"}],"text":"許多不同時序的樣本數據"},{"type":"text","text":"。其中 index 文件中的信息可以幫我我們快速鎖定時序的標籤及其可能的取值,進而找到相關的時序和持有該時序樣本數據的 chunks。值得注意的是,最新的 block 文件夾中還包含一個 wal 文件夾,後者將承擔故障恢復的職責。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Many Little Databases"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三代存儲引擎將所有時序數據按時間分片,即在時間維度上將數據劃分成互不重疊的 blocks,如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5f/5f0c8ffaf88cc7fb5950f134ab6fd221.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個 block 實際上就是一個小型數據庫,內部存儲着該時間窗口內的所有時序數據,因此它需要擁有自己的 index 和 chunks。除了最新的、正在接收新鮮數據的 block 之外,其它 blocks 都是不可變的。由於新數據的寫入都在內存中,數據的寫效率較高:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ee/eed5d98a2f2c9d85a0b53e1373ac9069.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了防止數據丟失,所有新採集的數據都會被寫入到 WAL 日誌中,在系統恢復時能快速地將其中的數據恢復到內存中。在查詢時,我們需要將查詢發送到不同的 block 中,再將結果聚合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按時間將數據分片賦予了存儲引擎新的能力:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當查詢某個時間範圍內的數據,我們可以直接忽略在時間範圍外的 blocks"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫完一個 block 後,我們可以將輕易地其持久化到磁盤中,因爲只涉及到少量幾個文件的寫入"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"新的數據,也是最常被查詢的數據會處在內存中,提高查詢效率 (第二代同樣支持)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個 chunk 不再是固定的 1KB 大小,我們可以選擇任意合適的大小,選擇合適的壓縮方式"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"刪除超過留存時間的數據變得異常簡單,直接刪除整個文件夾即可"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"mmap"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三代引擎將數百萬的小文件合併成少量大文件,也讓 mmap 成爲可能。利用 mmap 將文件 I/O 、緩存管理交給操作系統,降低 OOM 發生的頻率。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Compaction"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Macro Design 中,我們將所有時序數據按時間切割成許多 blocks,當新寫滿的 block 持久化到磁盤後,相應的 WAL 文件也會被清除。寫入數據時,我們希望每個 block 不要太大,比如 2 小時左右,來避免在內存中積累過多的數據。讀取數據時,若查詢涉及到多個時間段,就需要對許多個 block 分別執行查詢,然後再合併結果。假如需要查詢一週的數據,那麼這個查詢將涉及到 80 多個 blocks,降低數據讀取的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了既能寫得快,又能讀得快,我們就得引入 compaction,後者將一個或多個 blocks 中的數據合併成一個更大的 block,在合併的過程中會自動丟棄被刪除的數據、合併多個版本的數據、重新結構化 chunks 來優化查詢效率,如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/39/39790ee1365a2c32c142d68b4f8eea8b.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Retention"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據超過留存時間時,刪除舊數據非常容易:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/89aad68806cb19fba69f39c52954d09b.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直接刪除在邊界之外的 block 文件夾即可。如果邊界在某個 block 之內,則暫時將它留存,知道邊界超出爲止。當然,在 Compaction 中,我們會將舊的 blocks 合併成更大的 block;在 Retention 時,我們又希望能夠粒度更小。所以 Compaction 與 Retention 的策略之間存在着一定的互斥關係。Prometheus 的系統參數可以對單個 block 的大小作出限制,來尋找二者之間的平衡。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看到這裏,相信你已經發現了,"},{"type":"text","marks":[{"type":"strong"}],"text":"這不就是 "},{"type":"text","text":"LSM Tree"},{"type":"text","marks":[{"type":"strong"}],"text":" 嗎?"},{"type":"text","text":"每個 block 就是按時間排序的 SSTable,內存中的 block 就是 MemTable。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Compression"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三代存儲引擎融合了 Gorilla 的 XOR float encoding 方案,將壓縮能力提升到 1-2 bytes/sample。具體方案可以概括爲:按順序採用以下第一條適用的策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Zero encoding:如果完全可預測,則無需額外空間"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Integer double-delta encoding:如果是整型,可以利用 double-delta 原理,將不等的前後間隔分成 6/13/20/33 bits 幾種,來優化空間使用"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"XOR float encoding:參考 Gorilla"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Direct encoding:直接存 float64"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"平均下來能取得 1.28 bytes/sample 的壓縮能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://tech.ipalfish.com/blog/about/","title":""},"content":[{"type":"text","text":"歡迎瞭解伴魚技術團隊"}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"References"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.youtube.com/watch?v=b_pEevMAC3I&feature=youtu.be","title":null},"content":[{"type":"text","text":"PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz"}]},{"type":"text","text":", "},{"type":"link","attrs":{"href":"https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf","title":null},"content":[{"type":"text","text":"slides"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://fabxc.org/tsdb/","title":null},"content":[{"type":"text","text":"Writing a Time Series Database from Scratch"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.youtube.com/watch?v=HbnGSNEjhUc","title":null},"content":[{"type":"text","text":"PromCon 2016: The Prometheus Time Series Database - Björn Rabenstein"}]},{"type":"text","text":", "},{"type":"link","attrs":{"href":"https://docs.google.com/presentation/d/1TMvzwdaS8Vw9MtscI9ehDyiMngII8iB_Z5D4QW4U4ho/edit#slide=id.g59e2f6081_1_0","title":null},"content":[{"type":"text","text":"slides"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.youtube.com/watch?v=evPYwNzoltU&t=782s","title":null},"content":[{"type":"text","text":"Percona Live Open Source Database Conference 2017: Life of a PromQL query"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://prometheus.io/docs/prometheus/1.8/storage/","title":null},"content":[{"type":"text","text":"Prometheus 1.8 doc: storage"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://prometheus.io/docs/prometheus/latest/storage/","title":null},"content":[{"type":"text","text":"Prometheus 2.16 doc: storage"}]}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://cloud.google.com/bigtable/docs/schema-design-time-series?hl=en#server_metrics","title":null},"content":[{"type":"text","text":"Google Cloud: Schema Design for Time Series Data"}]}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章