數據庫內核雜談(十七):code-gen 深入學習

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文是數據庫內核系列文章之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"本文我們將和大家一起深入學習code-gen,基於Thomas Neumann的論文:Efficiently Compiling Efficient Query Plans for Modern Hardware。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Volcano模式的不足"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"稍稍複習一下傳統的Volcano(火山模型)執行模式:優化器會生成由物理算子構成的物理執行計劃;執行計劃呈樹狀結構,上層的算子通過pull模式拉去數據:調用下層算子的next方法獲取下一個tuple。Volcalno模式的優勢在於,容易理解,非常通用,並且容易實現。但是,通用性卻帶來了性能的損失,僅從next方法的調用來看,首先,由於通用性,會牽涉到dynamic binding,其次是方法調用本身會造成開銷。並且,每一個算子在處理每一個tuple都需要調用next方法。試想一個複雜的查詢語句由20個算子構成,處理大約10Million個數據。光調用next方法就有約200Million次。其次,由於數據是一個tuple一個tuple處理,數據的locality不好,即,要被處理的數據在內存中可能並不是捱得很近,這使得CPU寄存器使用效率降低(因爲需要不斷地將待處理的數據換入和換出寄存器),導致性能降低。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"手寫代碼的洞見"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"本文引用了MonetDB\/X100的另一篇文章(MonetDB\/X100 - a DBMS in the CPU cache)中的一個觀點:給定一個SQL語句,如果讓一個技術精湛的程序員(比如Jeff Dean),手寫一段代碼來實現這個語句,那一定能得到比Volcano模型更好的性能。仔細想想,這個觀點很容易被證實。首先,因爲已經瞭解了所有的SQL查詢算子,完全可以把所有的算子放在一個方法中實現。這樣,就完全避免了方法調用,因此減少了不計其數的next方法調用。其次,因爲完全不需要兼顧數據和代碼通用性,可以通過直接訪問內存來訪問和操縱數據。拋開其他的優化方式,僅上述提到的兩種優化,就能明顯地提高執行語句的性能。MonetDB文中一個給出了了一個比較數據,參見下圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1b\/1b848f4ad848f2b39f5535b653741204.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"基於上述的洞見,本文提出了一個新的執行模式和優化方法:給定一個SQL語句,除了生成優化後的執行計劃,更進一步,生成執行計劃的機器代碼。這個機器代碼只服務於執行這一個語句,完全不用考慮各種通用和兼容性。雖然,生成的代碼可能不能真正達到Jeaf Dean的水準,但是完全可以通過一些優化方式來逼近最佳性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如何定義最佳性能呢?因爲程序最終都是由CPU執行,本文定義的優化函數是,儘可能地減少CPU instructions。如何才能減少CPU instruction呢?文章給出的答案是,以數據爲中心來執行SQL語句,而不是算子。這句話聽起來有點玄學,但本質很簡單,要盡最大可能,使被處理的數據,保存在CPU寄存器中。如此,減少了把數據換入和換出寄存器的無用操作,最終達到減少CPU instructions的目的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"新的執行模式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"本文提出的執行模式是最大化以數據爲中心來處理執行邏輯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先,文章定義了一個新的概念來輔助介紹這種執行模式,pipeline-breaker。Pipeline-breaker的定義是,在執行語句中的某一個算子,如果它的執行邏輯需要把一個待處理的tuple從CPU寄存器中去除,那這個算子就被定義成一個pipeline-breaker。如果一個算子需要等待子算子把所有的tuple都送給它,才能處理數據,那這個算子就被定義成full-pipeline-breaker。(當然,實際情況中,可能某個tuple已經大到CPU寄存器無法存下,文章這邊做了個小假設, 假設有足夠的寄存器)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"有了這個定義後,下一個問題就是,如何才能一直將數據放在寄存器中呢?上面介紹的傳統火山模型顯然是做不到這一點:因爲火山模型算子之間通過方法調用來傳遞tuple,方法調用牽涉到方法棧的更新,數據早就被移出寄存器了。本文提到的方法就是,通過不斷地把數據push給下一個要處理的算子,直到遇到pipeline-breaker。但這個解釋不直觀。我的理解就是,對於某一個tuple數據,把要對其進行處理的算子(直到遇到pipeline-breaker)排成隊,依次對數據進行操作,在這個pipeline處理過程中,數據始終是存放在CPU寄存器中,因此,執行一個pipeline是非常高效的。而根據pipeline-breaker,一個執行計劃就會被pipeline-breaker算子分解成幾個pipeline,數據在這個維度下,總是從一個pipeline,被處理完,進入另一個pipeline。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通過一個例子,更直觀地來了解一下整個過程。給定下面這個SQL語句:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一個join查詢內部夾帶一個子查詢語句。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/72\/729a55cc507abc96243cd112df06bd17.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"正常生成的執行計劃如下圖所示:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/af\/afe769d71c49e2f6930e025470ddd07b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"先對R2進行掃描,filter,然後對R2.z做group by count(*)計算,得到一個臨時的表;然後和R3進行join,最後再和處理過的R1的tuple進行join。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"而根據pipeline-breaker的定義,圖中的GroupBy算子和Join算子就是pipeline-breaker。因爲它們都需要得到子算子的所有tuple,纔可以執行:GroupBy算子需要得到所有的結果纔可以對Z進行count聚合操作,而Join算子,則需要把左側結果生成hash表來做hashJoin。根據先前的定義,這個執行計劃被分解成4個pipeline。如下圖所示:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8c\/8c645c161144b855ab02e4fc2ea41ebd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"而每個pipeline在執行的過程中,是可以最大化將數據保存在CPU寄存器中的。根據這個結果,就可以生成出下面的執行語句代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/20\/20b0c8778a9615288ec85e2060df56ff.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"可見,每個pipeline對應一個最外層的for循環,總共有4個外層for循環,最後一個三層for循環嵌套就是對兩個join的執行。在執行每一層for循環的時候,都可以通過代碼優化,儘可能地將數據保留在CPU寄存器中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文中還介紹了一種新的algebra來描述對算子的改造,並且示例講解,如何根據新定義的algebra,把一個執行計劃拆分成多個pipeline,然後生成出僞代碼。我覺得有點抽象,就不在這深入講解了,有興趣的同學可以去參考原文。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"code-gen 代碼生成"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"剛纔的示例展現了拆分成4個pipeline後的4段for循環的僞代碼,現在來討論如何來生成計算機能看懂的機器代碼。作者提到,一開始想直接讓優化器,優化SQL語句後生成出C++代碼(類似上面的for循環僞代碼),然後編譯成機器碼,但在實驗的過程中發現了弊端,最重要的問題就是生成優化的C++代碼的這個過程非常耗時,時間消耗以秒計算,這個性能是無法接受的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最終通過實驗,決定直接使用LLVM,將語句生成LLVM的彙編代碼,然後用LLVM自帶的JIT編譯器來運行生成機器代碼。文中提到LLVM的好處如下:1)生成的彙編代碼質量很高,會盡可能提高CPU寄存器的使用率,並且,LLVM彙編代碼是可portable的,可以通過不同平臺的LLVM的JIT編譯器最終生成機器碼。2)LLVM的彙編代碼是強類型的(我也不太清楚好處在哪,可能強類型在運行時就性能更好)。3)LLVM已經被廣泛使用在工業界,質量有保證,而且,不斷在進步。4)編譯時間短,是以毫秒計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是,文中指出,也並不是把整個SQL語句,一股腦地全部編譯成LLVM代碼,最重要的原因就是,實現工程量太大了。文中指出,其實一個C++執行引擎代碼都在那裏,並不需要把所有的代碼都重新用LLCVM彙編,而且LLVM彙編代碼的好處在於,和C++代碼,雙向之間可以無縫調用。一些複雜的邏輯和複雜的算子可以用C++的現有實現,比如複雜數據結構的access,包括需要把數據暫存到文件中等等的邏輯。LLVM相當於把這些C++代碼根據具體的執行語句,動態地把這些預編譯的C++代碼結合起來,生成一個完整的執行計劃的彙編代碼。對於熱點代碼(某些code會被幾乎全部的tuple都運行的代碼),可以完全由LLVM生成彙編代碼來提高性能。LLVM和C++代碼的嵌合,文章給出了示意圖。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a0\/a0e21aed40f5bf3770166aa1cd6cff09.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"另外,在測試中,由於會遇到複雜的算子,很難完全生成一個完整的方法來執行語句的所有邏輯,因此可以在LLVM彙編代碼中調用現成的C++代碼來執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文中還介紹了用SIMD來進一步提升性能的高級優化方法。因爲在for循環中,如果block如果過大,對CPU寄存器不友好,就會影響性能。可以通過減少數據block的大小,讓所有的數據還是能存放在寄存器內,然後通過SIMD instructions 來提升速度。除了SIMD,還可以利用multi-core processing,主要就是把要處理的數據分成多分,然後通過多CPU並行處理來提高性能。回顧上面介紹的例子中,不同的pipeline可以並行處理,比如處理R1的pipeline和R2的pipeline,可以並行處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文章在最後對這項技術進行了一些測試,測試結果結果肯定顯而易見,"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/75\/75b2e544c9a3d28d7add03be71a74378.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從圖上很容易得出:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1)C++代碼的編譯時間遠高於LLVM編譯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2)即使算上代碼的編譯時間,LLVM編譯加執行的性能也是最高的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"3)Vec-exec的性能也不錯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"4)這兩種技術都遠優於傳統的數據庫執行引擎。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這篇文章,通過對Thomas Neumann的文章Efficiently Compiling Efficient Query Plans for Modern Hardware進行解讀,深入瞭解了code-gen是如何提高性能的。文章通過定義pipeline-breaker的算子,將一個執行計劃分成多個pipeline,不同的pipeline(如果沒有依賴關係)可以並行執行;並且,每個pipeline都可以儘可能地優化來將數據始終保存在CPU寄存器中來提升性能。文章還驗證了使用LLVM編譯代碼的可行性,並且認爲比C++代碼生成要更優秀。在商用的數據庫系統中,除了上次介紹的Snowflake,也有使用code-gen來提升執行性能的。GreenPlum GPDB的這篇"},{"type":"link","attrs":{"href":"https:\/\/engineering.pivotal.io\/post\/codegen-gpdb-qx\/","title":"xxx","type":null},"content":[{"type":"text","text":"博客"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"介紹瞭如何通過code-gen來提升性能,貼在這裏作爲參考,有興趣的同學可以繼續閱讀。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"感謝閱讀這一期雜談!"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#303030","name":"user"}}],"text":"2021 年的願景之一是做更多對於技術和管理的輸出,如果想要和我更多交流,歡迎關注我的知識星球:"},{"type":"link","attrs":{"href":"https:\/\/t.zsxq.com\/feEUfay","title":null,"type":null},"content":[{"type":"text","text":"Dr.ZZZ 聊技術和管理"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#303030","name":"user"}}],"text":"。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章