百分點大數據技術團隊:萬億級大數據監控平臺建設實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着互聯網業務的迅速發展,用戶對系統的要求也越來越高,而做好監控爲系統保駕護航,能有效提高系統的可靠性、可用性及用戶體驗。監控系統是整個運維環節乃至整個項目及產品生命週期中最重要的一環。百分點大數據技術團隊基於大數據平臺項目,完成了百億流量、約3000+臺服務器集羣規模的大數據平臺服務的監控,沉澱了一套適合自身業務和技術特點的監控架構設計思路、設計方法和落地方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文主要從監控系統整體設計和技術方案落地兩大部分闡述了大數據監控平臺的建設過程,旨在幫助大家瞭解監控系統設計思路,對於監控系統建設提供專業指導。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"整體設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在整體監控設計中,百分點大數據團隊採用“去中心化”、“服務透明化”的設計思路,同時具備極強的擴展能力、自動化能力和高可靠性設計思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"去中心化設計:"},{"type":"text","text":"由於要同時監控18個異地的數據中心,開始百分點大數據團隊考慮過18箇中心各自監控,但是整體性差、不直觀且維護成本高。綜合考慮了鏈路帶寬、監控工具性能和數據量多維度指標,百分點大數據團隊決定只在一個主中心建立從監控數據採集到數據可視化的能力,其它中心只是監控數據的輸送者,最終形成“1 Server+18 Slaves”覆蓋18個數據中心的監控框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"服務透明化設計:"},{"type":"text","text":"通過將每個組件的存儲、處理、查詢能力標準量化,保證穩定可控。具體來說,對每個組件容量、每項性能指標閾值進行設計,並將組件的能力指標和當前的狀態以可視化的形式展現,通過標準值建立預警機制和對應處理措施,過程對於用戶是無感知的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"擴展及自動化能力設計:"},{"type":"text","text":"接入一個數據中心的監控數據並完成監控指標的調試,在0.5天即可完成,而且此設計能夠無縫集成多個數據中心的監控數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"監控設計方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"評價一個監控系統的好壞最重要三要素是:監控粒度、監控指標完整性、監控實時性,從系統分層體系可以把監控系統分爲三個層次:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"業務層:"},{"type":"text","text":"業務系統本質目的是爲了達成業務目標,因此監控業務系統是否正常最有效的方式是從數據上監控業務目標是否達成。對業務運營數據進行監控,可及時發現程序bug或業務邏輯設計缺陷,比如數據趨勢、流量大小等。業務系統的多樣性決定了應由各個業務系統實現監控指標開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"平臺層:"},{"type":"text","text":"對應用的整體運行狀況進行了解、把控,如果將應用當成黑盒子,開發和運維就無從知曉應用當前狀態,不能及時發現潛在故障。應用監控不應侷限於業務系統,還包括各種中間件和計算引擎,如ClickHouse、ElasticSearch、redis、zookeeper、kafka等。常用監控數據:JVM堆內存、GC、CPU使用率、線程數、TPS、吞吐量等,一般通過抽象出的統一指標收集組件,收集應用級指標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統層:"},{"type":"text","text":"實時掌握服務器工作狀態,留意性能、內存消耗、容量和整體系統健康狀態,保證服務器穩定運行。監控指標:內存、磁盤、CPU、網絡流量、系統進程等系統級性能指標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在重要監控指標項章節,我們將詳細介紹每一層級組件的監控指標含義和閾值等。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.2 系統設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"工欲善其事必先利其器,根據對一些監控產品的調研以及對監控的分層介紹、所需解決的問題,可以發現監控系統從收集到分析的流程架構:採集-存儲-分析-展示-告警。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據採集: "},{"type":"text","text":"通過SNMP、Agent、ICMP、SSH、IPMI等協議對系統進行數據採集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據存儲:"},{"type":"text","text":"主要存儲在MySQL上,也可以存儲在其他數據庫服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據分析:"},{"type":"text","text":"當事後需要覆盤分析故障時,監控系統能給我們提供圖形和時間等相關信息,方面確定故障所在。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據展示:"},{"type":"text","text":" Web界面展示(移動APP、java_php開發一個web界面也可以)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"監控報警:"},{"type":"text","text":"電話報警、郵件報警、微信報警、短信報警、報警升級機制等(無論什麼報警都可以)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"報警處理:"},{"type":"text","text":"當接收到報警,我們需要根據故障的級別進行處理,比如:重要緊急、重要不緊急等。根據故障的級別,配合相關的人員進行快速處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4f\/4fbf28d7c9fa42fb3f3301fce85648ec.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在整個監控方案需求中整理了基礎組件、大數據組件共12個,每種組件又包含多個監控指標項,約519項。爲便於查看過去90天的監控歷史數據,全部採集的監控數據週期保存90天,90天的數據量在800G左右,每項指標根據其特性採集頻率分爲15s、30s。基於監控需求的分析結果,百分點大數據團隊從源數據採集,存儲並針對性的做了數據清洗、分析等開發工作,最後彙總展示到監控平臺中提供告警和預警的功能,監控平臺提供非常炫酷的頁面展示還可投放到大屏上。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"技術方案"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"技術架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b2\/b29a0166ac9a40901af1644c3e485a70.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"監控技術方案通過實時數據採集、實時數據處理可視化和高可用技術等,實現了多種大數據平臺組件的性能指標的監控。監控系統由Zabbix、Prometheus + Grafana這兩部分構成。Zabbix 負責服務器的硬件監控,Prometheus+Grafana負責集羣狀態的監控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zabbix通過分佈式主動監控方式,對服務器進行硬件監控,Zabbix Agent通過向Zabbix Proxy請求獲取監控項列表來定期發送採集到的新值給Zabbix Proxy,Proxy將多個監控設備的信息先緩存到本地,然後傳輸到所屬的Zabbix Server。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus通過集成各類Exporter來採集組件指標,如上圖所示,通過Node Exporter、Clickhouse Exporter等第三方Exporter來實現對應組件的數據採集,同時通過Jmx Exporter來實現對Oss Tomcat、HBase、業務系統、數據流的數據採集工作,並將其數據存儲在本地時間序列數據庫中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Grafana通過接口調用和指標編輯來讀取Prometheus所採集的數據進行可視化展示。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2技術選型"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)Zabbix"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zabbix是一個基於Web界面提供分佈式系統監視以及網絡監視功能的企業級開源解決方案,它能監視各種網絡參數,保證服務器系統的安全運營,並提供柔軟的通知機制以讓系統管理員快速定位\/解決存在的各種問題,是企業自動化運維監控的利器。Zabbix靈活的設計爲用戶提供了易用的二次開發接口,讓用戶既可以使用Zabbix本身提供的功能,又可以自定義更多的監控項功能,如硬件監控、操作系統、服務進程,以及網絡設備等。值得一提的是,它所提供的Proxy分佈式架構能夠在監控多個遠程區域設備的同時,分擔server的監控壓力且不增加系統的維護複雜度,爲項目實施提供便利。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高可用設計圖中提到,Zabbix通過Proxy收集項目中所有服務器的硬件監控指標數據並進行預警和展示,通過Ansible批量在服務器端安裝Zabbix Agent 並啓動,由客戶端主動發起請求向Zabbix Server進行註冊,自動完成服務器在Zabbix Web的配置工作。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)Prometheus"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus是由前Google員工2015年正式發佈的開源監控系統,採用Go語言開發,它不僅有一個很酷的名字,同時還有Google與K8s的強力支持,開源社區異常火爆,在2016年加入雲原生基金會,是繼K8s後託管的第二個項目,未來前景被相當看好。數據採集基於Pull模式,架構簡單,不依賴外部存儲,單個服務器節點可直接工作,二進制文件啓動即可,屬於輕量級的Server,便於遷移和維護。同時其監控數據直接存儲在Prometheus Server本地的時序數據庫中,單個實例可以處理數百萬的Metrics。Prometheus靈活的數據模型和強大的數據查詢語句能夠在對服務內部進行詳細狀態監控的同時還支持數據的內部查詢,幫助快速定位和診斷問題,非常適用於面向服務架構的監控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在技術架構中,每個Prometheus負責拉取該區域所有組件的指標數據並存儲在本地,通過Prometheus UI界面可以查詢該區域所需指標是否收集到數據、數據是否正常,從而判斷數據採集端數據收集狀態。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)Grafana"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Grafana是一個可視化儀表盤,通過整合每個區域Prometheus所採集的數據實現對該區域的集羣監控目的,並將其美觀、直接地展示給使用者。通過Grafana的Datasource鏈接Prometheus url,並對接入的數據進行分組、過濾、聚合等邏輯運算來達到在面板中直觀展示指標含義的目的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3非功能技術實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大型的IT架構環境中,系統的組成部分跨區域分佈在18個不同城市,跨節點、多IDC、業務類型複雜、業務需求多樣,因此監控系統要能滿足業務中不斷變化的需求。在這種環境中構建監控系統,首先要做的事情是掌握全局信息,同時需要考慮業務未來的發展趨勢。而這個環境的監控技術方案既要能滿足當前業務需求,又能滿足不斷增長的業務需求,因此技術方案需要考慮以下三個因素:高可用性、高吞吐性、可擴展性。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)高可用性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基礎架構使用LAMP環境,採用Keepalived實現Zabbix、Grafana服務器高可用,保證主Server的Mysql或者httpd宕掉後能切換到從Server。同時數據庫做主主同步,保證兩邊服務器數據的一致性,實現數據庫的高可用,Zabbix和Grafan數據庫選用的磁盤類型均爲Raid5,保證在一塊盤離線的情況下保證數據的正常訪問。下圖爲Zabbix高可用分佈式架構流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f6\/f6a9b62cf78aa1c690ff2b9598a9772f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)高吞吐性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zabbix、Grafana及Prometheus聯合監控3000+臺服務器,實現從硬件層到應用層共計23萬+Items、17萬+Triggers的全方位監控,每秒更新2.43+萬條數據,每天共計產生1.1T+數據量。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)可擴展性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zabbix Proxy可以代替Zabbix Server 收集性能和可用性數據,然後將數據彙報給 Zabbix Server,並且在一定程度上分擔了Zabbix Server 壓力的同時,不增加監控系統的維護複雜度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Prometheus負責收集一個地區所有服務器服務的運行時狀態數據,Grafana則通過插件調用API接口來對數據進行可視化展示。下圖爲Ansible批量安裝Proxy節點代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b1\/b1d1c8d5b1dd7cb39255ce3b63285afe.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.4核心組件監控指標"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做好一款監控系統,其中最重要的一項是服務的監控項和每個監控項對應的多個指標,需要明白它的具體含義,設定好其閾值,閾值的準確性決定了監控系統的質量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zabbix通過ICMP ping、磁盤、風扇、內存、電源、主板溫度、CPU溫度、電壓、Raid狀態、電池、網卡等方面對服務器進行硬件監控,同時通過對組件的進程監控來實現應用程序的存活狀態檢測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cfcbf99640c5eb33635415e7df6639aa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Grafana+Prometheus主要負責業務系統、CK、ES、Ceph、Oss、Kafka、ZK、數據流等服務或組件的狀態監控。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)ElasticSearch監控項"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ES監控主要針對兩個級別,分別是集羣級別和節點級別。集羣級別的監控主要是針對整個ES集羣來說,包括集羣的健康狀況、集羣的狀態等。節點級別的監控主要是針對每個ES實例的監控,其中包括每個實例的查詢索引指標和物理資源使用指標。集羣級別指標獲取ES集羣的運行狀態;節點級別指標則更多的用於問題的排查,當發現集羣出現問題時更可能多的時候會直接定位到具體的ES實例,通過查看單臺實例的資源使用情況或者其他指標進行問題排查。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8d\/8d2a580ccb40c526bbd598f04cb700d3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)ClickHouse監控項"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過慢查詢、拒絕寫入、QPS、讀寫壓力、Http & Tcp 連接數、Zookeeper狀態等各項監控指標實時的反映出用戶最原始的讀寫請求及ClickHouse 集羣的讀寫性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ba\/bacbb4712c7f8d275f2d06bfff4b8664.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)Kafka監控項"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當Kafka集羣出現異常時,Kafka Controller的存活狀態、副本Leader的選舉延遲時間、Follower和Leader的同步消息長度、Broker端關鍵JMX指標等監控指標結合歷史狀態數據能夠幫助快速定位和分析問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f3\/f3a2d45a29cfa83c66c9f34e69f252c9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(4)Ceph監控項"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當Ceph集羣信息狀態異常時,需要通過查看集羣細節來判斷出現故障的集羣節點。因此Ceph集羣主要從以下幾個方面進行監控:集羣狀態、OSD狀態、集羣容量、OSD利用率、延遲數量、恢復進度、Objects狀態。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b5\/b524799aeeb0adb485edde125bcb4098.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(5)HBase監控項"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase採集的監控數據主要包括以下幾個方面:所有Regionserver、Master機器 JVM的狀態,例如關於線程的信息,GC 的次數和時間,內存使用狀況,ERROR、WARN、Fatal事件出現的次數,以及Regionserver、Master進程中的統計信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c4\/c4ac2bd0ffeebb9df0c875596074c9aa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(6)Zookeeper監控項"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zookeeper主要從系統監控、Zookeeper節點這兩個方面進行監控,系統監控包含內存使用量,網路帶寬佔用,磁盤使用量等;Zookeeper節點包含節點活躍數、延時時間、收發包數、連接數、臨時節點數量等方面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/59\/595579cd48b66b2ed3c299c0875b2570.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最佳實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在面臨着巨大Zabbix的使用過程中,隨着監控對象的增多,Zabbix Server面臨非常大的壓力,出現一系列性能瓶頸問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Zabbix隊列中有太多達到30w+,被延遲的Item會長達10分鐘左右;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"帶有nodata()函數的觸發器出現告警;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於數據展示量大,前端界面無響應或響應很慢。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲解決以上三個問題,主要從zabbix配置參數和數據庫參數兩方面進行性能調優,並給出一般建議供其他技術人員做參考。下面爲Zabbix 隊列積壓圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d6\/d68322350149f59f16c34d520601d2d0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1最佳參數優化說明"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)Zabbix配置參數調優"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HistoryStorageDateIndex=1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 初始化時啓動的pollers進程數量。由於本次採用主動式,因此該參數可以調製最小"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartPollers=1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 預處理進程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartPreprocessors=40"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartPollersUnreachable=1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartTrappers=15"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 啓用ICMP協議Ping主機方式啓動線程數量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartPingers=1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 用於設置自動發現的主機線程數量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartDiscoverers=1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 禁用zabbix自帶的housekeeping策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HousekeepingFrequency=0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# zabbix初始化時佔用多少系統共享內存用於存儲配置信息"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CacheSize=2G"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 將採集數據從緩存同步到數據庫的線程數量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StartDBSyncers=25"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 劃分2G內存用於存儲採集的歷史數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HistoryCacheSize=2G"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 存儲歷史數據索引所佔用的大小"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HistoryIndexCacheSize=256M"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# 分配緩存趨勢數據的內存"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TrendCacheSize=256M"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ValueCacheSize=2G"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Timeout=10"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AlertScriptsPath=\/usr\/lib\/zabbix\/alertscripts"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ExternalScripts=\/usr\/lib\/zabbix\/externalscripts"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FpingLocation=\/usr\/sbin\/fping"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LogSlowQueries=1000"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)數據庫參數調優"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遵從MySQL性能調優說明。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於MySQL,使用InnoDB表結構。如果使用InnoDB,ZABBIX的運行速度至少要快1.5倍。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對常用表進行數據庫表分區並執行定期清理策略,常用表:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"‘history’,‘history_str’,‘items’,‘functions’,‘triggers','trends’。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d8\/d84e84b64444b014a30765bd1e6f3310.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)性能優化一般建議"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"僅監控所需參數;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調整所有項目的“更新間隔”;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調整默認模板的參數;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調整housekeeping參數;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"避免使用長期給出的觸發器作爲函數參數,例如,max(3600)的計算速度明顯比max(60)慢。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"zabbix性能調優前後的對比效果如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2b\/2bb51e5e0c03b102977bd11aab587f7e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"性能調優前"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/de\/de6768051e12690ca414fd9a60515c05.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"性能調優後"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2硬件監控實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過Zabbix Agent向zabbix_agentd.conf 配置文件中的ServerActive 請求獲取檢查清單,Server 讀取Zabbix Web中的硬件監控列表進行響應,Agent解析響應中Item Name,調用相應的參數開始定期收集數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注:$IPMI_IP 爲IPMI的IP地址,1.3.6.1.4.1.674.10892.5.5.1.20.130.1.1.37.1爲dell 服務器raid卡的snmpoid。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UserParameter=RAIDControllerStatus,\/etc\/zabbix\/scripts\/zabbix_agent_snmp.shRAIDControllerStatus"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"cat\/etc\/zabbix\/scripts\/zabbix_agent_snmp.sh"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"function get_RAIDControllerStatus(){"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   RAIDControllerStatusvalue=`snmpwalk -v 2c -c public $IPMI_IP1.3.6.1.4.1.674.10892.5.5.1.20.130.1.1.37.1 |awk -F 'INTEGER: ' '{print $2}'`"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過Zabbix Agent收集到的硬件監控指標數據如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c6\/c6b982e24ea76d7c206258f0827e5049.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然Zabbix能通過Zabbix Agent對每臺服務器的硬件情況進行監控並及時報警,但是對整個項目的某個區域的情況沒有很好的彙總展示和反饋,因此百分點大數據團隊將Prometheus與Grafana結合,實現對當前區域所有服務器所有磁盤空間、內存使用率的降序排序來實現該需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Grafana中根目錄下磁盤使用率的metric指標如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"node_filesystem_size_bytes{IP_Range=\"$IP_Range\",fstype=\"xfs\",mountpoint=\"\/\"}-node_filesystem_free_bytes{IP_Range=\"$IP_Range\",fstype=\"xfs\",mountpoint=\"\/\"}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1-(node_filesystem_free_bytes{IP_Range=\"$IP_Range\",fstype=\"xfs\",mountpoint=\"\/\"}\/node_filesystem_size_bytes{IP_Range=\"$IP_Range\",fstype=\"xfs\",mountpoint=\"\/\"})"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際效果如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/12\/122ccbda49b29486bb3b3e367cb43b88.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了快速定位和解決問題,除對整個項目所有服務器常用指標有整體的概覽和了解外,只對每臺服務器的硬件層有詳細的監控是不夠的,仍需對它的系統層運行情況有大體且直觀的瞭解。如下圖所示是單臺服務器系統層的運行情況展示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8d\/8d2468597fa130694792a7ae701aac2e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3平臺組件集羣監控實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示是所有運行在系統上的程序的總體監控列表,其中不乏業務系統、數據流,也不乏ClickHouse、Ceph、ElasticSearch等集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e9\/e930c435d21da141c411e83d452f38f3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(1)ElasticSearch集羣監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過ES數據採集程序將每個ES集羣的監控數據彙總到ES監控集羣中,Grafana接入ES監控集羣鏈接進行展示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"採集端部分代碼如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/79\/79accb0d5b68d70260f797306f46f2a3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"效果圖如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/71\/7162ccf8a856f1e275e283c6fc23d485.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(2)ClickHouse集羣監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ClickHouse數據採集由兩部分組成:①Prometheus主動拉取Ck_exporter所採集的數據;②Pushgateway將自定義指標推入Prometheus。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pushgateway自定義指標部分展示如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/11d180d2f5f5e9269b432b63af5d9bdd.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終展示效果圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/01\/01e2a3adde55138b20c87eda0936084a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(3)Kafka集羣監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過Kafka集羣中的JMX來解析Kafka部分監控指標,開放Kafka的JMX端口,在.\/bin\/kafka-server-start.sh中插入如下內容,位置如下圖所示,同時將jar和yml文件放入相應位置並重啓Kafka集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6f\/6f1695d85fdada4fd54da4636a583ad0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JMX監控效果圖如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2e\/2ee736902023b3d19201585fdd753c60.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(4)Ceph集羣監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單個Ceph Exporter可以對整個Ceph集羣的數據進行採集,而爲了防止單點故障,故在此處做了Ceph exporter的高可用。Ceph Exporter從社區網站直接下載並啓動,通過Promtheus拉取Ceph Exporter中的數據並進行分組、彙總等運算呈現如下效果圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d8\/d88e444e1b9954352cbf55986b312cab.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(5)Hbase集羣監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於HBase是集成在Ambari中,因此需要在Ambari Web界面開啓HMaster和HRegionServer的jmx端口進行展示。在HBase-env.sh配置文件中插入如下內容:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4c\/4c177d148fbea38d21491904bb6a751c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase效果圖如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/76\/76acd7173bea652ede8719be4752f6ef.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"(6)Zookeeper集羣監控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Prometheus通過接入Zookeeper的第三方工具zk_exporter來採集數據,直接從社區網站下載啓動即可,通過指標篩選和聚合,最終效果圖如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/53\/83\/536ba28d59ac5743d2d3e4a38bb00c83.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結語與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百分點科技希望通過本篇文章的分享,幫助大家快速瞭解大規模機器集羣下的監控設計架構思路,以及每個核心組件重要的監控指標項含義和閾值範圍,提供最佳實踐的優化參數,爲大家在實施過程中提供一些參考。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於配置文件、Json面板文件和更詳細的過程信息等問題,歡迎您來諮詢,大家一起探討、共同進步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:百分點(ID:baifendian_com)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/PVmboxVJg2leJbiHOWvO8w","title":"xxx","type":null},"content":[{"type":"text","text":"百分點大數據技術團隊:萬億級大數據監控平臺建設實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章