硬剛Presto | Presto原理&調優&面試&實戰全面升級版

原創

2021-06-28 10:43

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很久之前，曾經寫過一篇 ","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247497505&idx=1&sn=6ae3547d253cf76afe4c813d850e999d&chksm=fd3eb1b4ca4938a253b7f0d54d35eaf110890f24d846a5c9ee85e48a2884bb4ad526ea0eccb1&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"Presto在大數據領域的實踐和探索","attrs":{}}],"marks":[{"type":"underline"}]},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"》","attrs":{}},{"type":"text","text":" 。文中詳細講解了Presto的原理和應用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天這篇文章是升級版本，把我個人讀過的文章和書籍的筆記進行了系統整理。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"從起源、原理、調優、面試、實踐應用進行了全方位的升級","attrs":{}},{"type":"text","text":"。希望對你們有幫助。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、起源","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 是由 FaceBook 開源的一個 MPP 計算引擎，主要用來以解決 Facebook 海量 Hadoop 數據倉庫的低延遲交互分析問題，Facebook 版本的 Presto 更多的是以解決企業內部需求功能爲主，也叫 PrestoDB，版本號以 0.xxx 來劃分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後來，Presto 其中的幾個人出來創建了更通用的 Presto 分支，取名 Presto SQL，版本號以 xxx 來劃分，例如 345 版本，這個開源版本也是更爲被大家通用的版本。前一段時間，爲了更好的與 Facebook 的 Presto 進行區分，Presto SQL 將名字改爲 Trino，除了名字改變了其他都沒變。不管是 Presto DB 還是 Presto SQL，它們”本是同根生“，因此它們的大部分的機制原理是一樣的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"我是誰？我從哪裏來？要到哪裏去？","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.\nPresto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.\nPresto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow \"free\" solution that requires excessive hardware.\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是官網對Presto的定義，Presto 是由 Facebook 開源的大數據分佈式 SQL 查詢引擎，適用於交互式分析查詢，可支持衆多的數據源，包括 HDFS，RDBMS，KAFKA 等，而且提供了非常友好的接口開發數據源連接器。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、特點及場景介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1.特點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 引擎相較於其他引擎的特點正如文章標題描述的這樣，多源、即席。多源就是它可以支持跨不同數據源的聯邦查詢，即席即實時計算，將要做的查詢任務實時拉取到本地進行現場計算，然後返回計算結果。除此之外，對於引擎本身，它有幾個值得關注的特點：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（1）多租戶：它支持併發執行數百個內存、I/O 以及 CPU 密集型的負載查詢，並支持集羣規模擴展到上千個節點；（2）聯邦查詢：它可以由開發者利用開放接口自定義開發針對不同數據源的連接器（Connector),從而支持跨多種不同數據源的聯邦數據查詢；（3）內在特性：爲了保證引擎的高效性，Presto 還進行了一些優化，例如基於 JVM 運行，Code- Generation 等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2.場景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 的應用場景非常廣泛，接下來我們主要介紹幾種使用比較廣泛的場景進行介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（1）交互式分析：交互式查詢是 Presto 主打的應用場景，Presto 的即席計算特性和內部設計機制就是爲了能夠更好地支持用戶進行交互式分析。可以類比用戶基於 Hive 交互式查詢 HDFS 中的數據，用戶可以基於 Presto 查詢各種不同的數據源的數據。（2）批量 ETL。（3）Facebook 的 A/B Test 基礎架構也是基於Presto 構建的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto之所以能在各個內存計算型數據庫中脫穎而出，在於以下幾點：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"清晰的架構，是一個能夠獨立運行的系統，不依賴於任何其他外部系統。例如調度，presto自身提供了對集羣的監控，可以根據監控信息完成調度。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單的數據結構，列式存儲，邏輯行，大部分數據都可以輕易的轉化成presto所需要的這種數據結構。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"豐富的插件接口，完美對接外部存儲系統，或者添加自定義的函數。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/66/66c3c574f8eed6fa906987aba6e48c48.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 1 Presto 架構圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 主要是由 Client、Coordinator、Worker 以及 Connector 等幾部分構成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1.SQL 語句提交：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶或應用通過 Presto 的 JDBC 接口或者 CLI 來提交 SQL 查詢，提交的 SQL 最終傳遞給 Coordinator 進行下一步處理；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2.詞/語法分析：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先會對接收到的查詢語句進行詞法分析和語法分析，形成一棵抽象語法樹。然後，會通過分析抽象語法樹來形成邏輯查詢計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 2 查詢 SQL","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3.生成邏輯計劃：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 2 是 TPC-H 測試基準中的一條 SQL 語句，表達的是兩表連接同時帶有分組聚合計算的例子，經過詞法語法分析後，得到 AST，然後進一步分析得到如下的邏輯計劃。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 3 邏輯計劃","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖就是一棵邏輯計劃樹，每個節點代表一個物理或邏輯操作，每個節點的子節點作爲該節點的輸入。邏輯計劃只是一個單純描述 SQL 的執行邏輯，但是並不包括具體的執行信息，例如該操作是在單節點上執行還是可以在多節點並行執行，再例如什麼時候需要進行數據的 shuffle 操作等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"4.查詢優化：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator 將一系列的優化策略（例如剪枝操作、謂詞下推、條件下推等）應用於與邏輯計劃的各個子計劃，從而將邏輯計劃轉換成更加適合物理執行的結構，形成更加高效的執行策略。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面具體來說說優化器在幾個方面所做的工作：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（1）自適應：Presto 的 Connector 可以通過 Data Layout API 提供數據的物理分佈信息（例如數據的位置、分區、排序、分組以及索引等屬性），如果一個表有多種不同的數據存儲分佈方式，Connector 也可以將所有的數據佈局全部返回，這樣 Presto 優化器就可以根據 query 的特點來選擇最高效的數據分佈來讀取數據並進行處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（2）謂詞下推：謂詞下推是一個應用非常普遍的優化方式，就是將一些條件或者列儘可能的下推到葉子結點，最終將這些交給數據源去執行，從而可以大大減少計算引擎和數據源之間的 I/O，提高效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b79e5ccc6bb28c0803995bf28323a4e4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 4 圖 3 的邏輯計劃進一步轉換後的執行計劃（未進行）","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（3）節點間並行：不同 stage 之間的數據 shuffle 會帶來很大的內存和 CPU 開銷，因此，將 shuffle 數優化到最小是一個非常重要的目標。圍繞這個目標，Presto 可以藉助一下兩類信息：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據佈局信息：上面我們提到的數據物理分佈信息同樣可以用在這裏以減少 shuffle 數。例如，如果進行 join 連接的兩個表的字段同屬於分區字段，則可以將連接操作在在各個節點分別進行，從而可以大大減少數據的 shuffle。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再比如兩個表的連接鍵加了索引，可以考慮採用嵌套循環的連接策略。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（4）節點內並行：優化器通過在節點內部使用多線程的方式來提高節點內對並行度，延遲更小且會比節點間並行效率更高。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"交互式分析：交互式查詢的負載大部分是一次執行的短查詢，查詢負載一般不會經過優化，這就會導致數據傾斜的現象時有發生。典型的表現爲少量的節點被分到了大量的數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"批量 ETL：這類的查詢特點是任務會不加過濾的從葉子結點拉取大量的數據到上層節點進行轉換操作，致使上層節點壓力非常大。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對以上兩種場景遇到的問題，引擎可以通過多線程來運行單個操作符序列（或 pipeline），如圖 5 所示的，pipeline1 和 2 通過多線程並行執行來加速 build 端的 hash-join。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 5 pipeline1 和 2 通過多線程並行執行來加速 build 端的 hash-join","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然，除了上述列舉的 Presto 優化器已經實現的優化策略，Presto 也正在積極探索 Cascades framework，相信未來優化器會得到進一步的改進。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"5.容錯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 可以對一些臨時的報錯採用低級別的重試來恢復。Presto 依靠的是客戶端的自動重跑失敗查詢。內嵌容錯機制來解決 coordinator 或者 worker 節點壞掉的情況目前Presto支持的並不理想。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標準檢查點或者部分修復技術是計算代價比較高的，而且很難在這種一旦結果可用就返回給客戶端（即時查詢類）的系統中實現。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"四、資源和調度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們借用美團的博客中的一張架構圖：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/87/873fd180d9696f690f69f8b7b7b811cd.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto查詢引擎是一個Master-Slave的架構，由一個Coordinator節點，一個Discovery Server節點，多個Worker節點組成，Discovery Server通常內嵌於Coordinator節點中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator負責解析SQL語句，生成執行計劃，分發執行任務給Worker節點執行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Worker節點負責實際執行查詢任務。Worker節點啓動後向Discovery Server服務註冊，Coordinator從Discovery Server獲得可以正常工作的Worker節點。如果配置了Hive Connector，需要配置一個Hive MetaStore服務爲Presto提供Hive元信息，Worker節點與HDFS交互讀取數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Presto的服務進程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto集羣中有兩種進程，Coordinator服務進程和worker服務進程。coordinator主要作用是接收查詢請求，解析查詢語句，生成查詢執行計劃，任務調度和worker管理。worker服務進程執行被分解的查詢執行任務task。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator 服務進程部署在集羣中的單獨節點之中，是整個presto集羣的管理節點，主要作用是接收查詢請求，解析查詢語句，生成查詢執行計劃Stage和Task並對生成的Task進行任務調度，和worker管理。Coordinator進程是整個Presto集羣的master進程，需要與worker進行通信，獲取最新的worker信息，有需要和client通信，接收查詢請求。Coordinator提供REST服務來完成這些工作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto集羣中存在一個Coordinator和多個Worker節點，每個Worker節點上都會存在一個worker服務進程，主要進行數據的處理以及Task的執行。worker服務進程每隔一定的時間會發送心跳包給Coordinator。Coordinator接收到查詢請求後會從當前存活的worker中選擇合適的節點運行task。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8797c164e40686b10cb8cdd39678755.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖展示了從宏觀層面概括了Presto的集羣組件：1個coordinator，多個worker節點。用戶通過客戶端連接到coordinator，可以短可以是JDBC驅動或者Presto命令行cli。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto是一個分佈式的SQL查詢引擎，組裝了多個並行計算的數據庫和查詢引擎（這就是MPP模型的定義）。Presto不是依賴單機環境的垂直擴展性。她有能力在水平方向，把所有的處理分佈到集羣內的各個機器上。這意味着你可以通過添加更多節點來獲得更大的處理能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用這種架構，Presto查詢引擎能夠並行的在集羣的各個機器上，處理大規模數據的SQL查詢。Presto在每個節點上都是單進程的服務。多個節點都運行Presto，相互之間通過配置相互協作，組成了一個完整的Presto集羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/91/911c559558d47930030822ae716d1fd4.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖展示了集羣內coordinator和worker之間，以及worker和worker之間的通信。coordinator向多個worker通信，用於分配任務，更新狀態，獲得最終的結果返回用戶。worker之間相互通信，向任務的上游節點獲取數據。所有的worker都會向數據源讀取數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Coordinator","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator的作用是：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從用戶獲得SQL語句","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解析SQL語句","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"規劃查詢的執行計劃","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"管理worker節點狀態","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator是Presto集羣的大腦，並且是負責和客戶端溝通。用戶通過PrestoCLI、JDBC、ODBC驅動、其他語言工具庫等工具和coordinator進行交互。Coordinator從客戶端接受SQL語句，例如select語句，才能進行計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Presto集羣必須有一個coordinator，可以有一個或多個worker。在開發和測試環境中，一個Presto進程可以同時配置成兩種角色。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator追蹤每個worker上的活動，並且協調查詢的執行過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator給查詢創建了一個包含多階段的邏輯模型，一旦接受了SQL語句，Coordinator就負責解析、分析、規劃、調度查詢在多個worker節點上的執行過程，語句被翻譯成一系列的任務，跑在多個worker節點上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"worker一邊處理數據，結果會被coordinator拿走並且放到output緩存區上，暴露給客戶端。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦輸出緩衝區被客戶完全讀取，coordinator會代表客戶端向worker讀取更多數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"worker節點，和數據源打交道，從數據源獲取數據。因此，客戶端源源不斷的讀取數據，數據源源源不斷的提供數據，直到查詢執行結束。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Coordinator通過基於HTTP的協議和worker、客戶端之間進行通信。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6e/6edbaa027bf7807c6904e870e9b54f9f.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖給我們展示了客戶端、coordinator，worker之間的通信。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Workers","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto的worker是Presto集羣中的一個服務。它負責運行coordinator指派給它的任務，並處理數據。worker節點通過連接器（connector）向數據源獲取數據，並且相互之間可以交換數據。最終結果會傳遞給coordinator。coordinator負責從worker獲取最終結果，並傳遞給客戶端。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Worker之間的通信、worker和coordinator之間的通信採用基於HTTP的協議。下圖展示了多個worker如何從數據源獲取數據，並且合作處理數據的流程。直到某一個worker把數據提供給了coordinator。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"查詢調度：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 通過 Coordinator 將 stage 以 task 的形式分發到 worker 節點，coordinator 將 task 以 stage 爲單位進行串聯，通過將不同 stage 按照先後執行順序串聯成一棵執行樹，確保數據流能夠順着 stage 進行流動。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 引擎處理一條查詢需要進行兩套調度，第一套是如何調度 stage 的執行順序，第二套是判斷每個 stage 有多少需要調度的 task 以及每個 task 應該分發到哪個 worker 節點上進行處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（1）stage 調度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 支持兩種 stage 調度策略：All-at-once 和 Phased 兩種。All-at- once 策略針對所有的 stage 進行統一調度，不管 stage 之間的數據流順序，只要該 stage 裏的 task 數據準備好了就可以進行處理；Phased 策略是需要以 stage 調度的有向圖爲依據按序執行，只要前序任務執行完畢開會開始後續任務的調度執行。例如一個 hash-join 操作，在 hash 表沒有準備好之前，Presto 不會調度 left side 表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（2）task 調度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在進行 task 調度的時候，調度器會首先區分 task 所在的 stage 是哪一類 stage：Leaf Stage 和 intermediate stage。Leaf Stage 負責通過 Connector 從數據源讀取數據，intermediate stage 負責處理來此其他上游 stage 的中間結果；","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"leaf stages：在分發 leaf stages 中的 task 到 worker 節點的時候需要考慮網絡和 connector 的限制。例如蠶蛹 shared- nothing 部署的時候，worker 節點和存儲是同地協作，這時候調度器就可以根據 connector data Layout API 來決定將 task 分發到哪些 worker 節點。資料表明在一個生產集羣大部分的 CPU 消耗都是花費在了對從 connector 讀取到的數據的解壓縮、編碼、過濾以及轉換等操作上，因此對於此類操作，要儘可能的提高並行度，調動所有的 worker 節點來並行處理。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"intermediate stages：這裏的 task 原則上可以被分發到任意的 worker 節點，但是 Presto 引擎仍然需要考慮每個 stage 的 task 數量，這也會取決於一些相關配置，當然，有時候引擎也可以在運行的時候動態改變 task 數。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（3）split 調度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Leaf stage 中的一個 task 在一個工作節點開始執行的時候，它會收到一個或多個 split 分片，不同 connector 的 split 分片所包含的信息也不一樣，最簡單的比如一個分片會包含該分片 IP 以及該分片相對於整個文件的偏移量。對於 Redis 這類的鍵值數據庫，一個分片可能包含表信息、鍵值格式以及要查詢的主機列表。Leaf stage 中的 task 必須分配一個或多個 split 才能夠運行，而 intermediate stage 中的 task 則不需要。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（3）split 分配當 task 任務分配到各個工作節點後，coordinator 就開始給每個 task 分配 split 了。Presto 引擎要求 Connector 將小批量的 split 以懶加載的方式分配給 task。這是一個非常好的特點，會有如下幾個方面的優點：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解耦時間：將前期的 split 準備工作與實際的查詢執行時間分開；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"減少不必要的數據加載：有時候一個查詢可能剛出結果但是還沒完全查詢完就被取消了，或者會通過一些 limit 條件限制查詢到部分數據就結束了，這樣的懶加載方式可以很好的避免過多加載數據；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"維護 split 隊列：工作節點會爲分配到工作進程的 split 維護一個隊列，Coordinator 會將新的 split 分配給具有最短隊列的 task，Coordinator 分給最短的。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"減少元數據維護：這種方式可以避免在查詢的時候將所有元數據都維護在內存中，例如對於 Hive Connector 來講，處理 Hive 查詢的時候可能會產生百萬級的 split，這樣就很容易把 Coordinator 的內存給打滿。當然，這種方式也不是沒有缺點，他的缺點是可能會導致難以準確估計和報告查詢進度。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"資源管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 適用於多租戶部署的一個很重要的因素就是它完全整合了細粒度資源管理系統。一個單集羣可以併發執行上百條查詢以及最大化的利用 CPU、IO 和內存資源。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（1）CPU 調度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 首要任務是優化所有集羣的吞吐量，例如在處理數據是的 CPU 總利用量。本地（節點級別）調度又爲低成本的計算任務的週轉時間優化到更低，以及對於具有相似 CPU 需求的任務採取 CPU 公平調度策略。一個 task 的資源使用是這個線程下所有 split 的執行時間的累計，爲了最小化協調時間，Presto 的 CPU 使用最小單位爲 task 級別並且進行節點本地調度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 通過在每個節點併發調度任務來實現多租戶，並且使用合作的多任務模型。任何一個 split 任務在一個運行線程中只能佔中最大 1 秒鐘時長，超時之後就要放棄該線程重新回到隊列。如果該任務的緩衝區滿了或者 OOM 了，即使還沒有到達佔用時間也會被切換至另一個任務，從而最大化 CPU 資源的利用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當一個 split 離開了運行線程，Presto 需要去定哪一個 task（包含一個或多個 split）排在下一位運行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 通過合計每個 task 任務的總 CPU 使用時間，從而將他們分到五個不同等級的隊列而不是僅僅通過提前預測一個新的查詢所需的時間的方式。如果累積的 Cpu 使用時間越多，那麼它的分層會越高。Presto 會爲每一個曾分配一定的 CPU 總佔用時間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度器也會自適應的處理一些情況，如果一個操作佔用超時，調度器會記錄他實際佔用線程的時長，並且會臨時減少它接下來的執行次數。這種方式有利於處理多種多樣的查詢類型。給一些低耗時的任務更高的優先級，這也符合低耗時任務往往期望儘快處理完成，而高耗時的任務對時間敏感性低的實際。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（2）內存管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在像 Presto 這樣的多租戶系統中，內存是主要的資源管理挑戰之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.內存池","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Presto 中，內存被分成用戶內存和系統內存，這兩種內存被保存在內存池中。用戶內存是指用戶可以僅根據系統的基本知識或輸入數據進行推理的內存使用情況(例如，聚合的內存使用與其基數成比例)。另一方面，系統內存是實現決策(例如 shuffle 緩衝區)的副產品，可能與查詢和輸入數據量無關。換句話說，用戶內存是與任務運行有關的，我們可以通過自己的程序推算出來運行時會用到的內存，系統內存可能更多的是一些不可變的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 引擎對單獨對用戶內存和總的內存（用戶+系統）進行不同的規則限制，如果一個查詢超過了全局總內存或者單個節點內存限制，這個查詢將會被殺掉。當一個節點的內存耗盡時，該查詢的預留內存會因爲任務停止而被阻塞。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有時候，集羣的內存可能會因爲數據傾斜等原因造成內存不能充分利用，那麼 Presto 提供了兩種機制來緩解這種問題--溢寫和保留池。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.溢寫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當某一個節點內存用完的時候，引擎會啓動內存回收程序，現將執行的任務序列進行升序排序，然後找到合適的 task 任務進行內存回收（也就是將狀態進行溢寫磁盤），知道有足夠的內存來提供給任務序列的後一個請求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.預留池","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果集羣的沒有配置溢寫策略，那麼當一個節點內存用完或者沒有可回收的內存的時候，預留內存機制就來解除集羣阻塞了。這種策略下，查詢內存池被進一步分成了兩個池：普通池和預留池。這樣當一個查詢把普通池的內存資源用完之後，會得到所有節點的預留池內存資源的繼續加持，這樣這個查詢的內存資源使用量就是普通池資源和預留池資源的加和。爲了避免死鎖，一個集羣中同一時間只有一個查詢可以使用預留池資源，其他的任務的預留池資源申請會被阻塞。這在某種情況下是優點浪費，集羣可以考慮配置一下去殺死這個查詢而不是阻塞大部分節點。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、Presto調優","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"合理設置分區","attrs":{}},{"type":"text","text":"與Hive類似，Presto會根據元信息讀取分區數據，合理的分區能減少Presto數據讀取量，提升查詢性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用列式存儲","attrs":{}},{"type":"text","text":"Presto對ORC文件讀取做了特定優化，因此在Hive中創建Presto使用的表時，建議採用ORC格式存儲。相對於Parquet，Presto對ORC支持更好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用壓縮","attrs":{}},{"type":"text","text":"數據壓縮可以減少節點間數據傳輸對IO帶寬壓力，對於即席查詢需要快速解壓，建議採用snappy壓縮","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"預排序","attrs":{}},{"type":"text","text":"對於已經排序的數據，在查詢的數據過濾階段，ORC格式支持跳過讀取不必要的數據。比如對於經常需要過濾的字段可以預先排序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內存調優","attrs":{}},{"type":"text","text":"Presto有三種內存池，分別爲GENERAL_POOL、RESERVED_POOL、SYSTEM_POOL。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GENERAL_POOL：用於普通查詢的 physical operators。GENERAL_POOL 值爲總內存（Xmx 值）- 預留的（max-memory-per-node）- 系統的（0.4 * Xmx）。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SYSTEM_POOL：系統預留內存，用於讀寫 buffer，worker 初始化以及執行任務必要的內存。大小由 config.properties 裏的 resources.reserved-system-memory 指定。默認值爲 JVM max memory * 0.4。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RESERVED_POOL：大部分時間裏是不參與計算的，只有當同時滿足如下情形下，纔會被使用，然後從所有查詢裏獲取佔用內存最大的那個查詢，然後將該查詢放到 RESERVED_POOL 裏執行，同時注意 RESERVED_POOL 只能用於一個 Query。大小由 config.properties 裏的 query.max-memory-per-node 指定，默認值爲：JVM max memory * 0.1。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這三個內存池佔用的內存大小是由下面算法進行分配的：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"builder.put(RESERVED_POOL, new MemoryPool(RESERVED_POOL, config.getMaxQueryMemoryPerNode()));\nbuilder.put(SYSTEM_POOL, new MemoryPool(SYSTEM_POOL, systemMemoryConfig.getReservedSystemMemory()));\nlong maxHeap = Runtime.getRuntime().maxMemory();\nmaxMemory = new DataSize(maxHeap - systemMemoryConfig.getReservedSystemMemory().toBytes(), BYTE);\nDataSize generalPoolSize = new DataSize(Math.max(0, maxMemory.toBytes() - config.getMaxQueryMemoryPerNode().toBytes()), BYTE);\nbuilder.put(GENERAL_POOL, new MemoryPool(GENERAL_POOL, generalPoolSize));\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單的說，RESERVED_POOL大小由config.properties裏的query.max-memory-per-node指定；SYSTEM_POOL由config.properties裏的resources.reserved-system-memory指定，如果不指定，默認值爲Runtime.getRuntime().maxMemory() ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"0.4，即0.4 ","attrs":{}},{"type":"text","text":"Xmx值。而GENERAL_POOL值爲：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總內存（Xmx值）- 預留的（max-memory-per-node）- 系統的（0.4 * Xmx）。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從Presto的開發手冊中可以看到：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"GENERAL_POOL is the memory pool used by the physical operators in a query.\nSYSTEM_POOL is mostly used by the exchange buffers and readers/writers.\nRESERVED_POOL is for running a large query when the general pool becomes full.\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單說GENERAL_POOL用於普通查詢的physical operators；SYSTEM_POOL用於讀寫buffer；而RESERVED_POOL比較特殊，大部分時間裏是不參與計算的，只有當同時滿足如下情形下，纔會被使用，然後從所有查詢裏獲取佔用內存最大的那個查詢，然後將該查詢放到 RESERVED_POOL 裏執行，同時注意RESERVED_POOL只能用於一個Query。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們經常遇到的幾個錯誤：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"Query exceeded per-node total memory limit of xx\n適當增加query.max-total-memory-per-node。\n\nQuery exceeded distributed user memory limit of xx\n適當增加query.max-memory。\n\nCould not communicate with the remote task. The node may have crashed or be under too much load\n內存不夠，導致節點crash，可以查看/var/log/message。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"並行度","attrs":{}},{"type":"text","text":"調整線程數增大 task 的併發以提高效率。修改參數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SQL優化","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只選擇使用必要的字段：由於採用列式存儲，選擇需要的字段可加快字段的讀取、減少數據量。避免採用 * 讀取所有字段","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過濾條件必須加上分區字段","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Group By語句優化：合理安排Group by語句中字段順序對性能有一定提升。將Group By語句中字段按照每個字段distinct數據多少進行降序排列，減少GROUP BY語句後面的排序一句字段的數量能減少內存的使用.","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Order by時使用Limit，儘量避免ORDER BY：Order by需要掃描數據到單個worker節點進行排序，導致單個worker需要大量內存","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用近似聚合函數：對於允許有少量誤差的查詢場景，使用這些函數對查詢性能有大幅提升。比如使用approx_distinct() 函數比Count(distinct x)有大概2.3%的誤差","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用regexp_like代替多個like語句：Presto查詢優化器沒有對多個like語句進行優化，使用regexp_like對性能有較大提升","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Join語句時將大表放在左邊：Presto中join的默認算法是broadcast join，即將join左邊的表分割到多個worker，然後將join右邊的表數據整個複製一份發送到每個worker進行計算。如果右邊的表數據量太大，則可能會報內存溢出錯誤。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Rank函數代替row_number函數來獲取Top N","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UNION ALL 代替 UNION ：不用去重","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用WITH語句：查詢語句非常複雜或者有多層嵌套的子查詢，請試着用WITH語句將子查詢分離出來","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"元數據緩存","attrs":{}},{"type":"text","text":"Presto 支持 Hive connector，元數據存儲在 Hive metastore 中，調整元數據緩存的相關參數可以提高訪問元數據的效率。修改參數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9f/9f4b0cd8cd898ae688c699e111c9a0b5.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Hash 優化","attrs":{}},{"type":"text","text":"針對 Hash 場景的優化。修改參數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d1/d1b187d6cd52caf14bc3da5ecc5b4f97.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"優化 OBS 相關參數","attrs":{}},{"type":"text","text":"Presto 支持 on OBS，讀寫 OBS 過程中可以調整 OBS 客戶端參數來提交讀寫效率。修改參數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"六、Presto數據模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto採取了三層表結構，我們可以和Mysql做一下類比：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"catalog 對應某一類數據源，例如hive的數據，或mysql的數據","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"schema 對應mysql中的數據庫","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"table 對應mysql中的表","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Presto中定位一張表，一般是catalog爲根，例如：一張表的全稱爲 hive.testdata.test，標識 hive(catalog)下的 testdata(schema)中test表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以簡理解爲：數據源.數據庫.數據表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外，presto的存儲單元包括：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Page：多行數據的集合，包含多個列的數據，內部僅提供邏輯行，實際以列式存儲。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Block：一列數據，根據不同類型的數據，通常採取不同的編碼方式，瞭解這些編碼方式，有助於自己的存儲系統對接presto。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto中處理的最小數據單元是一個Page對象，Page對象的數據結構如下圖所示。一個Page對象包含多個Block對象，每個Block對象是一個字節數組，存儲一個字段的若干行。多個Block橫切的一行是真實的一行數據。一個Page最大1MB，最多16 * 1024行數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"核心問題之Presto爲什麼這麼快？","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在選擇Presto時很大一個考量就是計算速度，因爲一個類似SparkSQL的計算引擎如果沒有速度和效率加持，那麼很快就就會被拋棄。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美團的博客中給出了這個答案：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完全基於內存的並行計算","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"流水線式的處理","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本地化計算","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態編譯執行計劃","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小心使用內存和數據結構","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"類BlinkDB的近似查詢","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GC控制","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和Hive這種需要調度生成計劃且需要中間落盤的核心優勢在於：Presto是常駐任務，接受請求立即執行，全內存並行計算；Hive需要用yarn做資源調度，接受查詢需要先申請資源，啓動進程，並且中間結果會經過磁盤。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"七、行業典型應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Presto 在滴滴的應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"滴滴 Presto 用了3年時間逐漸接入公司各大數據平臺，併成爲了公司首選 Ad-Hoc 查詢引擎及 Hive SQL 加速引擎，支持了包含以下的業務場景：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive SQL查詢加速","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據平臺Ad-Hoc查詢","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"報表（BI報表、自定義報表）","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"活動營銷","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據質量檢測","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"資產管理","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"固定數據產品","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了適配各個業務線，二次開發了 JDBC、Go、Python、Cli、R、NodeJs 、HTTP 等多種接入方式，打通了公司內部權限體系，讓業務方方便快捷的接入 Presto 的，滿足了業務方多種技術棧的接入需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 接入了查詢路由 Gateway，Gateway 會智能選擇合適的引擎，用戶查詢優先請求 Presto，如果查詢失敗，會使用 Spark 查詢，如果依然失敗，最後會請求 Hive。在 Gateway 層，我們做了一些優化來區分大查詢、中查詢及小查詢，對於查詢時間小於 3 分鐘的，我們即認爲適合 Presto 查詢，比如通過 HBO（基於歷史的統計信息）及 JOIN 數量來區分查詢大小，架構圖如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/36939ab058bbcaf690c698af4f24a7f8.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在滴滴內部，Presto 主要用於 Ad-Hoc 查詢及 Hive SQL 查詢加速，爲了方便用戶能儘快將 SQL 遷移到 Presto 引擎上，且提高 Presto 引擎查詢性能，我們對 Presto 做了大量二次開發。這些功能主要包括：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive SQL 兼容","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"物理資源隔離","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直連Druid 的 Connector","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多租戶等","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto 在使用過程中會遇到很多穩定性問題，比如 Coordinator OOM，Worker Full GC 等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"滴滴給我們總結了 Coordinator 常見的問題和解決方法：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用HDFS FileSystem Cache導致內存泄漏，解決方法禁止FileSystem Cache，後續Presto自己維護了FileSystem Cache","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jetty導致堆外內存泄漏，原因是Gzip導致了堆外內存泄漏，升級Jetty版本解決","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Splits太多，無可用端口，TIME_WAIT太高，修改TCP參數解決","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto內核Bug，查詢失敗的SQL太多，導致Coordinator內存泄漏，社區已修復","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而 Presto Worker 主要用於計算，性能瓶頸點主要是內存和 CPU。內存方面通過三種方法來保障和查找問題：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過Resource Group控制業務併發，防止嚴重超賣","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過JVM調優，解決一些常見內存問題，如Young GC Exhausted","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"善用MAT工具，發現內存瓶頸","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Presto 在有讚的應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有贊在Presto上主要用來進行以下業務支持：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據平臺(DP)的臨時查詢: 有讚的大數據團隊使用臨時查詢進行探索性的數據分析的統一入口，同時也提供了脫敏，審計等功能。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BI 報表引擎：爲商家提供了各類分析型的報表。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元數據數據質量校驗等：元數據系統會使用 Presto 進行數據質量校驗。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據產品：比如 CRM 數據分析，人羣畫像等會使用 Presto 進行計算。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a82fc0efe4cb4c547ef5e5bfdcaa129.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然，有贊在使用Presto的過程中也經歷了漫長的迭代：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一階段: Presto 和 Hadoop 混合部署","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二階段: Presto 集羣完全獨立階段","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三階段: 低延時業務專用 Presto 集羣階段","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第二階段的資源隔離主要還是靠 Resource Group，但是這種隔離方式相對比較弱，不能提供細粒度的隔離，任務之間還是會互相影響。此外，不同業務的 sql 類型，查詢數據量，查詢時間，可容忍的 SLA，可提供的最優配置都是不一樣的。有些業務方需要一個特別低的響應時間保證，於是有贊給這類業務部署了專門的集羣去處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署在這個集羣上的業務要求低延時，通常是 3 秒內，甚至有些能夠達到 1 秒內，而且會有一定量的併發。不過這類業務通常數據量不是非常大，而且通常都是大寬表，也就不需要再去 Join 別的數據，Group By 形成的 Group 基數和產生的聚合數據量不是特別大，查詢時間主要消耗在數據掃描讀取時間上。同樣也提供了資源完全獨立，具有本地 HDFS 的專用 Presto 集羣給這類業務方去使用。此外，會爲這種業務提供深度的性能測試，調整相應的配置，比如將 Task Concurrency 改成 1，在併發量高的測試場景中，反而由於減少了線程間切換，性能會更好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後，有贊在使用Presto的過程中發生的主要問題包括：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS 小文件問題","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HDFS 小文件問題在大數據領域是個常見的問題。數倉 Hive 表有些表的文件有幾千個，查詢特別慢。Presto 下面這兩個參數限制了 Presto 每個節點每個 Task 可執行的最大 Split 數目：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"node-scheduler.max-splits-per-node=100\nnode-scheduler.max-pending-splits-per-task=10\n","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多個列 Distinct 的問題","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單的說，正常的優化器應該使用 grouping sets 去將多個 group by 整合到一起來提升性能：","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":" SELECT a1, a2,..., an, F1(b1), F2(b2), F3(b3), ...., Fm(bm), F1(distinct c1), ...., Fm(distinct cm) FROM Table GROUP BY a1, a2, ..., an\n\n 轉換爲\n\n SELECT a1, a2,..., an, arbitrary(if(group = 0, f1)),...., arbitrary(if(group = 0, fm)), F(if(group = 1, c1)), ...., F(if(group = m, cm)) FROM\n SELECT a1, a2,..., an, F1(b1) as f1, F2(b2) as f2,...., Fm(bm) as fm, c1,..., cm group FROM\n SELECT a1, a2,..., an, b1, b2, ... ,bn, c1,..., cm FROM Table GROUP BY GROUPING SETS ((a1, a2,..., an, b1, b2, ... ,bn), (a1, a2,..., an, c1), ..., ((a1, a2,..., an, cm)))\n GROUP BY a1, a2,..., an, c1,..., cm group\n GROUP BY a1, a2,..., an\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是很遺憾，Presto並沒有實現這樣的功能。以上就是有贊在使用Presto的一些經驗。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"八、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"小編在學習Presto的過程中和其他的OLAP一樣，也是通過漫長的文檔搜索，官網摸索主鍵精進的，事實上在任何一門新技術的使用過程中大家都會遇到各種各樣的問題，如果利用現在有的資料解決問題就是考驗我們的時候了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d3/d33376d0950657df14f4fda990dd420c.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247503995&idx=1&sn=ead9bbd4ea821c94efc1e18875c1722c&chksm=fd3e96eeca491ff8e0b6a0e1ad3c9ada5365457dd730ce9e19339803504ae10d8108c296fc27&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《硬剛Presto|Presto原理&調優&面試&實戰全面升級版》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247503879&idx=2&sn=bd009e298f2bdf9bb8abc9271b515143&chksm=fd3e9692ca491f84722c922000754aafc4d5a271b0ee40d525ce177de3c4442ef83b5ade8e05&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《硬剛Apache Iceberg | 技術調研&在各大公司的實踐應用大總結》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247503675&idx=1&sn=3ee6af64d0126c78b48cad219308f81e&chksm=fd3e89aeca4900b8b8954e9569ee3c0877881fac8c792bfafc22e7e9d3e8524da8eb860d33d8&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《硬剛ClickHouse | 4萬字長文ClickHouse基礎&實踐&調優全視角解析》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247503576&idx=1&sn=f9fc428799e0fcc78e94360e1cec7b95&chksm=fd3e884dca49015b6d38c437f603b4deffeeb0cefafd32d358a891bfe820f734116100395bba&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《硬剛數據倉庫|SQL Boy的福音之數據倉庫體系建模&實施&注意事項小總結》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247502750&idx=1&sn=bd9a9173d060dc4e4ebd49c8efc6acfe&chksm=fd3e8d0bca49041dea84da93910e5efdc4935e520525c09887c986691377aeb48e5cf7fb5667&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"《硬剛Hive | 4萬字基礎調優面試小總結》","attrs":{}}],"marks":[{"type":"underline"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247503741&idx=1&sn=e5039be93123f2e337013756a818bfc3&chksm=fd3e89e8ca4900fe603b63c5722a6fb8a32bd63d6ba23e0028851948a71b877eb1f742d95087&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《硬剛用戶畫像(一) | 標籤體系下的用戶畫像建設小指南》","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247489571&idx=1&sn=56a634d66fb689907b4ab51ed2d3707a&chksm=fd3d5eb6ca4ad7a0cc5fa4f895354e58ed7f2cb8558369ed6149560a5e7fca97b8545036fe87&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"《硬剛用戶畫像(二) | 基於大數據的用戶畫像構建小百科全書》","attrs":{}}]}]}]}