Apache Pulsar 在 BIGO 的性能調優實戰(上)

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在人工智能技術的支持下,BIGO 基於視頻的產品和服務受到廣泛歡迎,在 150 多個國家/地區擁有用戶,其中包括 Bigo Live(直播)和 Likee(短視頻)。Bigo Live 在 150 多個國家/地區興起,Likee 有 1 億多用戶,並在 Z 世代中很受歡迎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着業務的迅速增長,BIGO 消息隊列平臺承載的數據規模出現了成倍增長,下游的在線模型訓練、在線推薦、實時數據分析、實時數倉等業務對消息的實時性和穩定性提出了更高的要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BIGO 消息隊列平臺使用的是開源 Kafka,然而隨着業務數據量的成倍增長、消息實時性和系統穩定性要求不斷提高,多個 Kafka 集羣的維護成本越來越高,主要體現在:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據存儲和消息隊列服務綁定,集羣擴縮容/分區均衡需要大量拷貝數據,造成集羣性能下降"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當分區副本不處於 ISR(同步)狀態時,一旦有 broker 發生故障,可能會造成丟數或該分區無法提供讀寫服務"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Kafka broker 磁盤故障/使用率過高時,需要進行人工干預"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣跨區域同步使用 KMM(Kafka Mirror Maker),性能和穩定性難以達到預期"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 catch-up 讀場景下,容易出現 PageCache 污染,造成讀寫性能下降"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然 Kafka 的 topic partition 是順序寫入,但是當 broker上有成百上千個topic partition 時,從磁盤角度看就變成了隨機寫入,此時磁盤讀寫性能會隨着 topic partition 數量的增加而降低,因此 Kafka broker 上存儲的 topic partition 數量是有限制的"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着 Kafka 集羣規模的增長,Kakfa 集羣的運維成本急劇增長,需要投入大量的人力進行日常運維。在 BIGO,擴容一臺機器到 Kafka 集羣並進行分區均衡,需要 0.5人/天;縮容一臺機器需要 1 人/天"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了提高消息隊列實時性、穩定性和可靠性,降低運維成本,我們重新考慮了 Kafka 架構設計上的不足,調研能否從架構設計上解決這些問題,滿足當前的業務要求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"下一代消息流平臺:Pulsar"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Pulsar 是 Apache 軟件基金會頂級項目,是下一代雲原生分佈式消息流平臺,集消息、存儲、輕量化函數式計算爲一體。Pulsar 於 2016 年由 Yahoo 開源並捐贈給 Apache 軟件基金會進行孵化,2018 年成爲Apache 軟件基金會頂級項目。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar 採用計算與存儲分離的分層架構設計,支持多租戶、持久化存儲、多機房跨區域數據複製,具有強一致性、高吞吐以及低延時的高可擴展流數據存儲特性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar 吸引我們的主要特性如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線性擴展:能夠無縫擴容到成百上千個節點"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高吞吐:已經在 Yahoo 的生產環境中經受了考驗,支持每秒數百萬消息的 發佈-訂閱(Pub-Sub)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"低延遲:在大規模的消息量下依然能夠保持低延遲(小於 5 ms)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"持久化機制:Plusar 的持久化機制構建在 Apache BookKeeper 上,提供了讀寫分離"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讀寫分離:BookKeeper 的讀寫分離 IO 模型極大發揮了磁盤順序寫性能,對機械硬盤相對比較友好,單臺 bookie 節點支撐的 topic 數不受限制"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Pulsar 的架構設計解決了我們使用 Kafka 過程中遇到的各種問題,並且提供了很多非常棒的特性,如多租戶、消息隊列和批流融合的消費模型、強一致性等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了進一步加深對 Apache Pulsar 的理解,衡量 Pulsar 能否真正滿足我們生產環境大規模消息 Pub-Sub 的需求,我們從 2019 年 12 月份開始進行了一系列壓測工作。由於我們使用的是機械硬盤,沒有 SSD,在壓測過程中遇到了一些列性能問題,非常感謝 StreamNative 同學的幫助,感謝斯傑、翟佳、鵬輝的耐心指導和探討,經過一系列的性能調優,不斷提高 Pulsar 的吞吐和穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過 3~4 個月的壓測和調優,2020 年 4 月份我們正式在生產環境中使用 Pulsar 集羣。我們採用 bookie 和 broker 在同一個節點的混部模式,逐步替換生產環境的 Kafka 集羣。截止到目前爲止,生產環境中 Pulsar 集羣規模爲十幾臺,日處理消息量爲百億級別,並且正在逐步擴容和遷移 Kafka 流量到 Pulsar 集羣。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"壓測/使用 Pulsar 遇到的問題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家在使用/壓測 Pulsar 時,可能會遇到如下問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar broker 節點負載不均衡。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar broker 端 Cache 命中率低,導致大量讀請求進入 bookie,且讀性能比較差。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"壓測時經常出現 broker 內存溢出現象(OOM)。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie 出現 direct memory OOM 導致進程掛掉。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie 節點負載不均衡,且經常抖動。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"當 Journal 盤爲 HDD 時,雖然關閉了 fsync,但是 bookie add entry 99th latency 依舊很高,寫入性能較差。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"當 bookie 中有大量讀請求時,出現寫被反壓,add entry latency 上升。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar client 經常出現“Lookup Timeout Exception”。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":9,"align":null,"origin":null},"content":[{"type":"text","text":"ZooKeeper 讀寫延遲過高導致整個 Pulsar 集羣不穩定。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":10,"align":null,"origin":null},"content":[{"type":"text","text":"使用 reader API(eg. pulsar flink connector) 消費 Pulsar topic 時,消費速度較慢(Pulsar 2.5.2 之前版本)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Journal/Ledger 盤爲機械硬盤(HDD)時,問題 4、5、6、7 表現得尤爲嚴重。這些問題直觀來看,是磁盤不夠快造成的,如果 Journal/Ledger 盤讀寫速度足夠快,就不會出現消息在 direct memory 中堆積,也就不會有一系列 OOM 的發生。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於在我們消息隊列生產系統中,需要存儲的數據量比較大(TB ~ PB 級別),Journal 盤和 Ledger 盤都是 SSD 需要較高的成本,那麼有沒有可能在 Pulsar / BookKeeper 上做一些參數/策略的優化,讓 HDD 也能發揮出較好的性能呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在壓測和使用 Pulsar 過程中,我們遇到了一系列性能問題,主要分爲 Pulsar Broker 層面和 BookKeeper 層面。爲此,本系列性能調優文章分爲兩篇,分別介紹 BIGO 在使用 Pulsar 過程中對 Pulsar Broker 和 Bookkeeper 進行性能調優的解決方案,以使得 Pulsar 無論在磁盤爲 SSD 還是 HDD 場景下,都能獲得比較好的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於篇幅原因,本次性能調優系列分爲兩部分,上半部分主要介紹 Pulsar broker 的性能調優,下半部分主要介紹 BookKeeper 與 Pulsar 結合過程中的性能調優。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文接下來主要介紹 Pulsar / BookKeeper 中和性能相關的部分,並提出一些性能調優的建議(這些性能調優方案已經在 BIGO 生產系統中穩定運行,並獲得了不錯的收益)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"環境部署與監控"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"環境部署與監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 BookKeeper 和 Pulsar Broker 重度依賴 ZooKeeper,爲了保證 Pulsar 的穩定,需要保證 ZooKeeper Read/Write 低延遲。此外,BookKeeper 是 IO 密集型任務,爲了避免 IO 之間互相干擾,Journal/Ledger 放在獨立磁盤上。總結如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie Journal/Ledger 目錄放在獨立磁盤上"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Journal/Ledger 目錄的磁盤爲 HDD 時,ZooKeeper dataDir/dataLogDir 不要和 Journal/Ledger 目錄放在同一塊磁盤上"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BookKeeper 和 Pulsar Broker 均依賴 direct memory,而且 BookKeeper 還依賴 PageCache 進行數據讀寫加速,所以合理的內存分配策略也是至關重要的。Pulsar 社區的 sijie 推薦的內存分配策略如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OS: 1 ~ 2 GB"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JVM: 1/2"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* heap: 1/3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* direct memory: 2/3"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PageCache: 1/2"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設機器物理內存爲 128G,bookie 和 broker 混部,內存分配如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OS: 2GB"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker: 31GB"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* heap: 10GB"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* direct memory: 21GB"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie: 32GB"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* heap: 10GB"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* direct memory: 22GB"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PageCache: 63GB"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Monitor:性能調優,監控先行"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了更加直觀地發現系統性能瓶頸,我們需要爲 Pulsar/BookKeeper 搭建一套完善的監控體系,確保每一個環節都有相關指標上報,當出現異常(包括但不限於性能問題)時,能夠通過相關監控指標快速定位性能瓶頸,並制定相應解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar/BookKeeper 都提供了 Prometheus 接口,相關統計指標可以直接使用 Http 方式獲取並直接對接 Prometheus/Grafana。感興趣的同學可以直接按照 Pulsar Manager 的指導進行安裝: https://github.com/streamnative/pulsar-manager。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要重點關注的指標如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar Broker"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* jvm heap/gc"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* bytes in per broker"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* message in per broker"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* loadbalance"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* broker 端 Cache 命中率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* bookie client quarantine ratio"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* bookie client request queue"}]},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"BookKeeper"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* bookie request queue size"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* bookie request queue wait time"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* add entry 99th latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* read entry 99th latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* journal create log latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* ledger write cache flush latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* entry read throttle"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"3","normalizeStart":"3"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"ZooKeeper"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* local/global ZooKeeper read/write request latency"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有一些指標在上面 repo 中沒有提供相應 Grafana 模板,大家可以自己添加 PromQL 進行配置。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Pulsar broker 端性能調優"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對 Pulsar broker 的性能調優,主要分爲如下幾個方面:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"負載均衡"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* Broker 之間負載均衡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* Bookie 節點之間的負載均衡"}]},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"限流"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* Broker 接收消息需要做流控,防止突發洪峯流量導致 broker direct memory OOM。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"\t* Broker 發送消息給 consumer/reader 時需要做流控,防止一次發送太多消息造成 consumer/reader 頻繁 GC。"}]},{"type":"numberedlist","attrs":{"start":"3","normalizeStart":"3"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"提高 Cache 命中率"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"保證 ZooKeeper 讀寫低延遲"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"關閉 auto bundle split,保證系統穩定"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"負載均衡"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Broker 之間負載均衡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker 之間負載均衡,能夠提高 broker 節點的利用率,提高 Broker Cache 命中率,降低 broker OOM 概率。這一部分內容主要涉及到 Pulsar bundle rebalance 相關知識。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Namespace Bundle 結構如下,每個 namespace(命名空間)由一定數量的 bundle 組成,該 namespace 下的所有 topic 均通過 hash 方式映射到唯一 bundle 上,然後 bundle 通過 load/unload 方式加載/卸載到提供服務的 broker 上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果某個 broker 上沒有 bundle 或者 bundle 數量比其他 broker 少,那麼這臺 broker 的流量就會比其他 broker 低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f183defecdc4cd233d33774897ee12d6.jpeg","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現有的/默認的 bundle rebalance 策略(OverloadShedder)爲:每隔一分鐘統計集羣中所有 broker 的 CPU、Memory、Direct Memory、BindWith In、BindWith Out 佔用率的最大值是否超過閾值(默認爲85%);如果超過閾值,則將一定數量大入流量的 bundle 從該 broker 中卸載掉,然後由 leader 決定將被卸載掉的 bundle 重新加載到負載最低的 broker 上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個策略存在的問題是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"默認閾值比較難達到,很容易導致集羣中大部分流量都集中在幾個 broker 上;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"閾值調整標準難以確定,受其他因素影響較大,特別是這個節點上部署有其他服務的情況下;"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"broker 重啓後,長時間沒有流量均衡到該 broker 上,因爲其他 broker 節點均沒有達到 bundle unload 閾值。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲此,我們開發了一個基於均值的負載均衡策略,並支持 CPU、Memory、Direct Memory、BindWith In、BindWith Out 權重配置,相關策略請參見 "},{"type":"link","attrs":{"href":"https://github.com/apache/pulsar/pull/6772","title":""},"content":[{"type":"text","text":"PR-6772"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該策略在 Pulsar 2.6.0 版本開始支持,默認關閉,可以在 broker.conf 中修改如下參數開啓:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"loadBalancerLoadSheddingStrategy=org.apache.pulsar.broker.loadbalance.impl.ThresholdShedder"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以通過如下參數來精確控制不同採集指標的權重:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# The broker resource usage threshold.\n# When the broker resource usage is greater than the pulsar cluster average resource usage,\n# the threshold shredder will be triggered to offload bundles from the broker.\n# It only takes effect in ThresholdSheddler strategy.\nloadBalancerBrokerThresholdShedderPercentage=10\n\n# When calculating new resource usage, the history usage accounts for.\n# It only takes effect in ThresholdSheddler strategy.\nloadBalancerHistoryResourcePercentage=0.9\n# The BandWithIn usage weight when calculating new resource usage.\n# It only takes effect in ThresholdShedder strategy.\nloadBalancerBandwithInResourceWeight=1.0\n\n# The BandWithOut usage weight when calculating new resource usage.\n# It only takes effect in ThresholdShedder strategy.\nloadBalancerBandwithOutResourceWeight=1.0\n\n# The CPU usage weight when calculating new resource usage.\n# It only takes effect in ThresholdShedder strategy.\nloadBalancerCPUResourceWeight=1.0\n\n# The heap memory usage weight when calculating new resource usage.\n# It only takes effect in ThresholdShedder strategy.\nloadBalancerMemoryResourceWeight=1.0\n\n# The direct memory usage weight when calculating new resource usage.\n# It only takes effect in ThresholdShedder strategy.\nloadBalancerDirectMemoryResourceWeight=1.0\n\n# Bundle unload minimum throughput threshold (MB), avoiding bundle unload frequently.\n# It only takes effect in ThresholdShedder strategy.\nloadBalancerBundleUnloadMinThroughputThreshold=10"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"均衡 bookie 節點之間的負載"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie 節點負載監控如下圖所示,我們會發現:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie 節點之間負載並不是均勻的,最高流量節點和最低流量節點可能相差幾百 MB/s"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在高負載情況下,某些節點的負載可能會出現週期性上漲和下降,週期爲 30 分鐘"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f3/f3b9adf54e320509494b214f5de1e231.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些問題的影響是:bookie 負載不均衡,導致 BookKeeper 集羣利用率下降,且容易出現抖動。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出現這個問題的原因在於:bookie client 對 bookie 寫請求的熔斷策略粒度太大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先來回顧一下 Pulsar broker 寫入 bookie 的策略:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 broker 接收到 producer 發送的 message 時,首先會將消息存放在 broker 的 direct memory 中,然後調用 bookie client 根據配置的(EnsembleSize,WriteQuorum,AckQuorum)策略將 message 以 pipeline 方式發送給 bookies。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie client 每分鐘會統計各 bookie 寫入的失敗率(包括寫超時等各類異常)。默認情況下,當失敗率超過 5 次/分鐘時,這臺 bookie 將會被關入小黑屋 30 分鐘,避免持續向出現異常的 bookie 寫入數據,從而保證 message 寫入成功率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9f/9fe51934c9432525834cf6d433d5aa07.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個熔斷策略存在的問題是:某臺 bookie 負載(流量)很高時,所有寫入到該 bookie 的消息有可能同時會變慢,所有 bookie client 可能同時收到寫入異常,如寫入超時等,那麼所有 bookie client 會同時把這臺 bookie 關入小黑屋 30 分鐘,等到 30 分鐘之後又同時加入可寫入列表中。這就導致了這臺 bookie 的負載週期性上漲和下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決該問題,我們引入了基於概率的 quarantine 機制,當 bookie client 寫入消息出現異常時,並不是直接將這臺 bookie 關入小黑屋,而是基於概率決定是否 quarantine。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一 quarantine 策略可以避免所有 bookie client 同時將同一臺 bookie 關入小黑屋,避免 bookie 入流量抖動。相關 PR 請參見:BookKeeper PR-2327 ,由於代碼沒有合併和發佈到 bookie 主版本,大家如果想使用該功能,需要自己獨立編譯代碼:https://github.com/apache/bookkeeper/pull/2327。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c1ffa2414c0c4cc217af5410060b8e34.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從 BIGO 實踐測試來看,該功能將 bookie 節點之間入流量標準差從 75 MB/s 降低到 40 MB/s。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/3587f445f63cf54ae7d7ca9dd826cace.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"限流"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":">>Broker direct memory OOM(內存溢出)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在生產環境中,在高吞吐場景下,我們經常遇到 broker direct memory OOM,導致 broker 進程掛掉。這裏的原因可能是底層 bookie 寫入變慢,導致大量數據積壓在 broker direct memory 中。Producer 發送的消息在 broker 中的處理過程如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2b/2b0328cb284c1d814b094481e8919b6f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在生產環境中,我們不能保證底層 bookie 始終保持非常低的寫延遲,所以需要在 broker 層做限流。Pulsar 社區的鵬輝開發了限流功能,限流邏輯如下圖所示:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bb/bb18efff1482b876378b5fa047e211e2.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Pulsar 2.5.1 版本中已發佈,請參見 "},{"type":"link","attrs":{"href":"https://github.com/apache/pulsar/pull/6178","title":""},"content":[{"type":"text","text":"PR-6178"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Consumer 消耗大量內存"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 producer 端以 batch 模式發送消息時,consumer 端往往會佔用過多內存導致頻繁 GC,監控上的表現是:這個 topic 的負載在 consumer 啓動時飆升,然後逐漸迴歸到正常水平。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個問題的原因需要結合 consumer 端的消費模式來看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 consumer 調用 receive 接口消費一條消息時,它會直接從本地的 receiverQueue 中請求一條消息,如果 receiverQueue 中還有消息可以獲取,則直接將消息返回給 consumer 端,並更新 availablePermit,當 availablePermit < receiverQueueSize/2 時,Pulsar client 會將 availablePermit 發送給 broker,告訴 broker 需要 push 多少條消息過來;如果 receiverQueue 中沒有消息可以獲取,則等待/返回失敗,直到 receiverQueue 收到 broker 推送的消息纔將 consumer 喚醒。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4e/4e208ca1e8f3ae5509d94ac5ed136138.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker 收到 availablePermit 之後,會從 broker Cache/bookie 中讀取 "},{"type":"codeinline","content":[{"type":"text","text":"max(availablePermit, batchSize)"}]},{"type":"text","text":" 條 entry,併發送給 consumer 端。處理邏輯如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d682e913dbe5a6490a87a53542a08d49.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏的問題是:當 producer 開啓 batch 模式發送,一個 entry 包含多條消息,但是 broker 處理 availablePermit 請求仍然把一條消息作爲一個 entry 來處理,從而導致 broker 一次性將大量信息發送給 consumer,這些消息數量遠遠超過 availiablePermit(availiablePermit vs. availiablePermit * batchSize)的接受能力,引起 consumer 佔用內存暴漲,引發頻繁 GC,降低消費性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決 consumer 端內存暴漲問題,我們在 broker 端統計每個 topic 平均 entry 包含的消息數(avgMessageSizePerEntry), 當接收到 consumer 請求的 availablePermit 時,將其換算成需要發送的 entry 大小,然後從 broker Cache/bookie 中拉取相應數量的 entry,然後發送給 consumer。處理邏輯如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/81/818ca46c6f1d210059d603c69cdfd2b5.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個功能在 Pulsar 2.6.0 中已發佈,默認是關閉的,大家可以通過如下開關啓用該功能:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# Precise dispatcher flow control according to history message number of each entry\npreciseDispatcherFlowControl=true"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"提高 Cache 命中率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar 中有多層 Cache 提升 message 的讀性能,主要包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker Cache"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie write Cache(Memtable)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Bookie read Cache"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OS PageCache"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章主要介紹 broker Cache 的運行機制和調優方案,bookie 側的 Cache 調優放在下篇介紹。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 broker 收到 producer 發送給某個 topic 的消息時,首先會判斷該 topic 是否有 Active Cursor,如果有,則將收到的消息寫入該 topic 對應的 Cache 中;否則,不寫入 Cache。處理流程如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9d/9dea1ebb893833d030b73d25120eb0cf.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"判斷是否有 Active Cursor 需要同時滿足以下兩個條件:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"有 durable cursor"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"Cursor 的 lag 在 managedLedgerCursorBackloggedThreshold 範圍內"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 reader 使用 non-durable cursor 進行消費,所以 producer 寫入的消息不會進入 broker Cache,從而導致大量請求落到 bookie 上,性能有所損耗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"streamnative/pulsar-flink-connector 使用的是 reader API 進行消費,所以同樣存在消費性能低的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們 BIGO 消息隊列團隊的趙榮生同學修復了這個問題,將 durable cursor 從 Active Cursor 判斷條件中刪除,詳情請見 "},{"type":"link","attrs":{"href":"https://github.com/apache/pulsar/pull/6769","title":""},"content":[{"type":"text","text":"PR-6769"}]},{"type":"text","text":" ,這個 feature 在 Pulsar 2.5.2 發佈,有遇到相關性能問題的同學請升級 Pulsar 版本到 2.5.2 以上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,我們針對 topic 的每個 subscription 添加了 Cache 命中率監控,方便進行消費性能問題定位,後續會貢獻到社區。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Tailing Read"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於已經在 broker Cache 中的數據,在 tailing read 場景下,我們怎樣提高 Cache 命中率,降低從 bookie 讀取數據的概率呢?我們的思路是儘可能讓數據從 broker Cache 中讀取,爲了保證這一點,我們從兩個地方着手優化:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"控制判定爲 Active Cursor 的最大 lag 範圍,默認是 1000 個 entry ,由如下參數控:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# Configure the threshold (in number of entries) from where a cursor should be considered 'backlogged'\n# and thus should be set as inactive.\nmanagedLedgerCursorBackloggedThreshold=1000"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Active Cursor 的判定如下圖所示。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/52/52f0abaddb060614c3c8fff1071befee.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"控制 broker Cache 的 eviction 策略,目前 Pulsar 中只支持默認 eviction 策略,有需求的同學可以自行擴展。默認 eviction 策略由如下參數控制:"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# Amount of memory to use for caching data payload in managed ledger. This memory\n# is allocated from JVM direct memory and it's shared across all the topics\n# running in the same broker. By default, uses 1/5th of available direct memory\nmanagedLedgerCacheSizeMB=\n\n# Whether we should make a copy of the entry payloads when inserting in cache\nmanagedLedgerCacheCopyEntries=false\n\n# Threshold to which bring down the cache level when eviction is triggered\nmanagedLedgerCacheEvictionWatermark=0.9\n\n# Configure the cache eviction frequency for the managed ledger cache (evictions/sec)\nmanagedLedgerCacheEvictionFrequency=100.0\n\n# All entries that have stayed in cache for more than the configured time, will be evicted\nmanagedLedgerCacheEvictionTimeThresholdMillis=1000"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"####Catchup Read"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於 Catchup Read 場景,broker Cache 大概率會丟失,所有的 read 請求都會落到 bookie 上,那麼有沒有辦法提高讀 bookie 的性能呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broker 向 bookie 批量發送讀取請求,最大 batch 由 dispatcherMaxReadBatchSize 控制,默認是 100 個 entry。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# Max number of entries to read from bookkeeper. By default it is 100 entries.\ndispatcherMaxReadBatchSize=100"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一次讀取的 batchSize 越大,底層 bookie 從磁盤讀取的效率越高,均攤到單個 entry 的 read latency 就越低。但是如果過大也會造成 batch 讀取延遲增加,因爲底層 bookie 讀取操作時每次讀一條 entry,而且是同步讀取。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一部分的讀取調優放在《Apache Pulsar 在 BIGO 的性能調優實戰(下)》中介紹。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"保證 ZooKeeper 讀寫低延遲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 Pulsar 和 BookKeeper 都是嚴重依賴 ZooKeeper 的,如果 ZooKeeper 讀寫延遲增加,就會導致 Pulsar 服務不穩定。所以需要優先保證 ZooKeeper 讀寫低延遲。建議如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在磁盤爲 HDD 情況下,ZooKeeper dataDir/dataLogDir 不要和其他消耗 IO 的服務(如 bookie Journal/Ledger 目錄)放在同一塊盤上(SSD 除外);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"ZooKeeper dataDir 和 dataLogDir 最好能夠放在兩塊獨立磁盤上(SSD 除外);"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"監控 broker/bookie 網卡利用率,避免由於網卡打滿而造成和 ZooKeeper 失聯。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"關閉 auto bundle split,保證系統穩定"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pulsar bundle split 是一個比較耗費資源的操作,會造成連接到這個 bundle 上的所有 producer/consumer/reader 連接斷開並重連。一般情況下,觸發 "},{"type":"codeinline","content":[{"type":"text","text":"auto bundle split"}]},{"type":"text","text":" 的原因是這個 bundle 的壓力比較大,需要切分成兩個 bundle,將流量分攤到其他 broker,來降低這個 bundle 的壓力。控制 auto bundle split 的參數如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":""},"content":[{"type":"text","text":"# enable/disable namespace bundle auto split\nloadBalancerAutoBundleSplitEnabled=true\n\n# enable/disable automatic unloading of split bundles\nloadBalancerAutoUnloadSplitBundlesEnabled=true\n\n# maximum topics in a bundle, otherwise bundle split will be triggered\nloadBalancerNamespaceBundleMaxTopics=1000\n\n# maximum sessions (producers + consumers) in a bundle, otherwise bundle split will be triggered\nloadBalancerNamespaceBundleMaxSessions=1000\n\n# maximum msgRate (in + out) in a bundle, otherwise bundle split will be triggered\nloadBalancerNamespaceBundleMaxMsgRate=30000\n\n# maximum bandwidth (in + out) in a bundle, otherwise bundle split will be triggered\nloadBalancerNamespaceBundleMaxBandwidthMbytes=100"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當觸發 auto bundle split 時 broker 負載比較高,關閉這個 bundle 上的 producer/consumer/reader,連接就會變慢,並且 bundle split 的耗時也會變長,就很容易造成 client 端(producer/consumer/reader)連接超時而失敗,觸發 client 端自動重連,造成 Pulsar/Pulsar client 不穩定。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於生產環境,我們的建議是:預先爲每個 namespace 分配好 bundle 數,並關閉 auto bundle split 功能。如果在運行過程中發現某個 bundle 壓力過大,可以在流量低峯期進行手動 bundle split,降低對 client 端的影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於預先分配的 bundle 數量不宜太大,bundle 數太多會給 ZooKeeper 造成比較大的壓力,因爲每一個 bundle 都要定期向 ZooKeeper 彙報自身的統計數據。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇從性能調優角度介紹了 Pulsar 在 BIGO 實踐中的優化方案,主要分爲環境部署、流量均衡、限流措施、提高 Cache 命中率、保證 Pulsar 穩定性等 5 個方面,並深入介紹了 BIGO 消息隊列團隊在進行 Pulsar 生產落地過程中的一些經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇主要解決了開篇提到的這幾個問題(1、2、5、7、8、9 )。對於問題 3,我們提出了一個緩解方案,但並沒有指出 Pulsar broker OOM 的根本原因,這個問題需要從 BookKeeper 角度來解決,剩下的問題都和 BookKeeper 相關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 Pulsar 使用分層存儲架構,底層的 BookKeeper 仍需要進行一系列調優來配合上層 Pulsar,充分發揮高吞吐、低延遲性能;下篇將從 BookKeeper 性能調優角度介紹 BIGO 的實踐經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非常感謝 StreamNative 同學的悉心指導和無私幫助,讓 Pulsar 在 BIGO 落地邁出了堅實的一步。Apache Pulsar 提供的高吞吐、低延遲、高可靠性等特性極大提高了 BIGO 消息處理能力,降低了消息隊列運維成本,節約了近一半的硬件成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,我們也積極融入 Pulsar 社區,並將相關成果貢獻回社區。我們在 Pulsar Broker 負載均衡、Broker Cache 命中率優化、Broker 相關監控、Bookkeeper 讀寫性能優、Bookkeeper 磁盤 IO 性能優化、Pulsar 與 Flink & Flink SQL 結合等方面做了大量工作,幫助社區進一步優化、完善 Pulsar 功能。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"關於作者"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"陳航,BIGO 大數據消息平臺團隊負責人,負責承載大規模服務與應用的集中發佈-訂閱消息平臺的創建與開發。他將 Apache Pulsar 引入到 BIGO 消息平臺,並打通上下游系統,如 Flink、ClickHouse 和其他實時推薦與分析系統。他目前聚焦 Pulsar 性能調優、新功能開發及 Pulsar 生態集成方向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章