百度C++工程師的那些極限優化(併發篇)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"對於工程經驗比較豐富的同學,併發應該也並不是陌生的概念了,但是每個人所理解的併發問題,卻又往往並不統一,本文系統梳理了百度C++工程師在進行併發優化時所作的工作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"underline","attrs":{}}],"text":"全文15706字,預計閱讀時間24分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單回顧一下,一個程序的性能構成要件大概有三個,即算法複雜度、IO開銷和併發能力。由於現代計算機體系結構複雜化,造成很多時候,工程師的性能優化會更集中在算法複雜度之外的另外兩個方向上,即IO和併發,在之前的","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247489076&idx=1&sn=748bf716d94d5ed2739ea8a9385cd4a6&chksm=c03d2648f74aaf5e11298cf450c3453a273eb6d2161bc90e411b6d62fa0c1b96a45e411af805&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"《百度C++工程師的那些極限優化(內存篇)》","attrs":{}}]},{"type":"text","text":"中,我們介紹了百度C++工程師工程師爲了優化性能,從內存IO角度出發所做的一些優化案例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1a/1ad70589e7495ae19d615bc87759170a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次我們就再來聊一聊另外一個性能優化的方向,也就是所謂的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"併發優化","attrs":{}},{"type":"text","text":"。和IO方向類似,對於工程經驗比較豐富的同學,併發應該也並不是陌生的概念了,但是每個人所理解的併發問題,卻又往往並不統一。所以下面我們先回到一個更根本的問題,重新梳理一下所謂的併發優化。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、爲什麼我們需要併發?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是的,這個問題可能有些跳躍,但是在自然地進展到如何處理各種併發問題之前,我們確實需要先停下來,回想一下爲什麼我們需要併發?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這時第一個會冒出來的概念可能會是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"大規模","attrs":{}},{"type":"text","text":",例如我們要設計大規模互聯網應用,大規模機器學習系統。可是我們仔細思考一下,無論使用了那種程度的併發設計,這樣的規模化系統背後,都需要成百上千的實例來支撐。也就是,如果一個設計(尤其是無狀態計算服務設計)已經可以支持某種小規模業務。那麼當規模擴大時,很可能手段並不是提升某個業務單元的處理能力,而是增加更多業務單元,並解決可能遇到的分佈式問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實真正讓併發編程變得有價值的背景,更多是業務單元本身的處理能力無法滿足需求,例如一次請求處理時間過久,業務精細化導致複雜度積累提升等等問題。那麼又是什麼導致了近些年來,業務單元處理能力問題不足的問題呈現更加突出的趨勢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能下面這個統計會很說明問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"(https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/)","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c9/c9c455e30f8765ccd5218a13f6415821.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖從一個長線角度,統計了CPU的核心指標參數趨勢。從其中的晶體管數目趨勢可以看出,雖然可能逐漸艱難,但是摩爾定律依然尚能維持。然而近十多年,出於控制功耗等因素的考慮,CPU的主頻增長基本已經停滯,持續增加的晶體管轉而用來構建了更多的核心。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從CPU廠商角度來看,單片處理器所能提供的性能還是保持了持續提升的,但是單線程的性能增長已經顯著放緩。從工程師角度來看,最大的變化是硬件紅利不再能透明地轉化成程序的性能提升了。隨時代進步,更精準的算法,更復雜的計算需求,都在對的計算性能提出持續提升的要求。早些年,這些算力的增長需求大部分還可以通過處理器更新換代來自然解決,可是隨着主頻增長停滯,如果無法利用多核心來加速,程序的處理性能就會隨主頻一同面臨增長停滯的問題。因此近些年來,是否能夠充分利用多核心計算,也越來越成爲高性能程序的一個標籤,也只有具備了充分的多核心利用能力,才能隨新型硬件演進,繼續表現出指數級的性能提升。而伴隨多核心多線程程序設計的普及,如何處理好程序的併發也逐漸成了工程師的一項必要技能。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/db/db5d2edf564b71748513f4a42f2f4d7f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖描述了併發加速的基本原理,首先是對原始算法的單一執行塊拆分成多個能夠同時運行的子任務,並設計好子任務間的協同。之後利用底層的並行執行部件能力,將多個子任務在時間上真正重疊起來,達到真正提升處理速度的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需要注意的是還有一條從下而上的反向剪頭,主要表達了,爲了正確高效地利用並行執行部件,往往會反向指導上層的併發設計,例如正確地數據對齊,合理的臨界區實現等。雖然加速看似完全是由底層並行執行部件的能力所帶來的,程序設計上只需要做到子任務拆分即可。但是現階段,執行部件對上層還無法達到透明的程度,導致這條反向依賴對於最終的正確性和性能依然至關重要。既瞭解算法,又理解底層設計,並結合起來實現合理的併發改造,也就成爲了工程師的一項重要技能。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、單線程中的並行執行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提到並行執行部件,大家的第一個印象往往時多核心多線程技術。不過在進入到多線程之前,我們先來看看,即使是單線程的程序設計中,依然需要關注的那些並行執行能力。回過頭再仔細看前文的處理器趨勢圖其實可以發現,雖然近年主頻不再增長,甚至穩中有降,但是單線程處理性能其實還是有細微的提升的。這其實意味着,在單位時鐘週期上,單核心的計算能力依然在提升,而這種提升,很大程度上就得益於單核心單線程內的細粒度並行執行能力。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.1 SIMD","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中一個重要的細粒度並行能力就是SIMD(Single Instruction Multiple Data),也就是多個執行單元,同時對多個數據應用相同指令進行計算的模式。在經典分類上,一般單核心CPU被歸入SISD(Single Instruction Single Data),而多核心CPU被歸入MIMD(Mingle Instruction Multiple D ata),而GPU才被歸入SIMD的範疇。但是現代CPU上,除了多核心的MIMD基礎模型,也同時附帶了細粒度SIMD計算能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3c/3c662bb51f021d6c27bf9a8d2b7d91d7.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/98/98e0820f20e7aef0cf8f3f0d4ca0bd6e.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖是Intel關於SIMD指令的一個示意圖,通過增加更大位寬的寄存器實現在一個寄存器中,“壓縮”保存多個較小位寬數據的能力。再通過增加特殊的運算指令,對寄存器中的每個小位寬的數據元素,批量完成某種相同的計算操作,例如圖示中最典型的對位相加運算。以這個對位相加操作爲例,CPU只需要增大寄存器,內存傳輸和計算部件位寬,針對這個特殊的應用場景,就提升到了8倍的計算性能。相比將核心數通用地提升到8倍大小,這種方式付出的成本是非常少的,指令流水線系統,緩存系統都做到了複用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從CPU發展的視角來看,爲了能夠在單位週期內處理更多數據,增加核心數的MIMD強化是最直觀的實現路徑。但是增加一套核心,就意味增加一套 完整的指令部件、流水線部件和緩存部件,而且實際應用時,還要考慮額外的核心間數據分散和聚合的傳輸和同步開銷。一方面高昂的部件需求, 導致完整的核心擴展成本過高,另一方面,多核心間傳輸和同步的開銷針對小數據集場景額外消耗過大,還會進一步限制應用範圍。爲了最大限度利用好有限的晶體管,現代CPU在塑造更多核心的同時,也在另一個維度上擴展單核心的處理和計算位寬,從而實現提升理論計算性能(核心數 * 數據寬度)的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過提起CPU上的SIMD指令支持,有一個繞不開的話題就是和GPU的對比。CPU上早期SIMD指令集(MMX)的誕生背景,和GPU的功能定位就十分類似,專注於加速圖像相關算法,近些年又隨着神經網絡計算的興起,轉向通用矩陣類計算加速。但是由於GPU在設計基礎上就以面向密集可重複計算負載設計,指令部件、流水線部件和緩存部件等可以遠比CPU簡潔,也因此更容易在量級上進行擴展。這就導致,當計算密度足夠大,數據的傳輸和同步開銷被足夠沖淡的情況下(這也是典型神經網絡計算的的特性),CPU僅作爲控制流進行指揮,而數據批量傳輸到GPU協同執行反而 會更簡單高效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Intel自身對SIMD指令集的宣傳,也集中圍繞神經網絡類計算來展開,而在當前工程實踐經驗上,主流的密集計算又以GPU實現爲主。這就導致了不少CPU上SIMD指令集無用論應運而生,尤其是近兩年Intel在AVX512初代型號上的降頻事件,進一步強化了『CPU就應該做好CPU該做的事情』這一論調。但是單單從這一的視角來認識CPU上的SIMD指令又未免有些片面,容易忽視掉一些真正有意義的CPU上SIMD應用場景。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/70/70f1fa38038db8c970994caec85f854c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一段程序來講,如果將每讀取單位數據,對應的純計算複雜度大小定義爲計算密度,而將算法在不同數據單元上執行的計算流的相同程度定義爲模式重複度,那麼可以以此將程序劃分爲4個象限。在大密度可重複的計算負載(典型的重型神經網絡計算),和顯著小密度和非重複計算負載(例如HTML樹狀解析)場景下,業界在CPU和GPU的選取上其實是有相對明確“最優解”的。不過對於過渡地帶,計算的重複特徵沒有那麼強,  或者運算密度沒有那麼大的場景下,雙方的弱點都會被進一步放大。即便是規整可重複的計算負載,隨着計算本身強度減小,傳輸和啓動成本逐漸顯著。另一方面,即便是不太規整可重複的計算負載,隨着計算負荷加大,核心數不足也會逐漸成爲瓶頸。這時候,引入SIMD的CPU和引入SIMT 的GPU間如何選擇和使用,就形成了沒有那麼明確,見仁見智的權衡空間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即使排除了重型神經網絡,從程序的一般特性而言,具有一定規模的重複特性也是一種普遍現象。例如從概念上講,程序中的循環段落,都或多或少意味着批量/重複的計算負載。儘管因爲摻雜着分支控制,導致重複得沒有那麼純粹,但這種一定規模的細粒度重複,正是CPU上SIMD發揮獨特價值的地方。例如最常見的SIMD優化其實就是memcpy,現代的memcpy實現會探測CPU所能支持的SIMD指令位寬,並盡力使用來加速內存傳輸。另一方面現代編譯器也會利用SIMD指令來是優化對象拷貝,進行簡單循環向量化等方式來進行加速。類似這樣的一類優化方法偏『自動透明』,也是默默支撐着主頻不變情況下,性能稍有上升的重要推手。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可惜這類簡單的自動優化能做到的事情還相當有限,爲了能夠充分利用CPU上的SIMD加速,現階段還非常依賴程序層進行主動算法適應性改造,有 目的地使用,換言之,就是主動實施這種單線程內的併發改造。一個無法自動優化的例子就是《內存篇》中提到的字符串切分的優化,現階段通過編譯器分析還很難從循環 + 判斷分支提取出數據並行pattern並轉換成SIMD化的match&mask動作。而更爲顯著的是近年來一批針對SIMD指令重新設計的算法,例如Swiss Table哈希表,simdjson解析庫,base64編解碼庫等,在各自的領域都帶來了倍數級的提升,而這一類算法適應性改造,就已經完全脫離了自動透明所能觸及的範圍。可以預知近些年,尤其隨着先進工藝下AVX512降頻問題的逐漸解決,還會/也需要湧現出更多的傳統基礎算法的SIMD改造。而熟練運用SIMD指令優化技術,也將成爲C++工程師的一項必要技能。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.2 OoOE","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個重要的單線程內並行能力就是亂序執行OoOE(Out of Order Execution)。經典教科書上的CPU流水線機制一般描述如下(經典5級RISC流水線)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0a/0a9bc3bf172d54f5814a93d53900bf95.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"指令簡化表達爲取指/譯碼/計算/訪存/寫回環節,當執行環節遇到數據依賴,以及緩存未命中等場景,就會導致整體停頓的產生。其中MEM環節的影響尤其顯著,主要也是因爲緩存層次的深化和多核心的共享現象,帶來單次訪存所需週期數參差不齊的現象越來越嚴重。上圖中的流水線在多層緩存下的表現,可能更像下圖所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f6fba1aa64b938927b2a8b0a71c50365.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了減輕停頓的影響,現代面向性能優化的CPU一般引入了亂序執行結合超標量的技術。也就是一方面,對於重點執行部件,比如計算部件,訪存部件等,增加多份來支持並行。另一方面,在執行部件前引入緩衝池/隊列機制,通用更長的預測執行來儘可能打滿每個部件。最終從流水線模式,轉向了更類似『多線程』的設計模式:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/496d2c783f987e5b20820ab2d9731665.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"亂序執行系統中,一般會將通過預測維護一個較長的指令序列,並構建一個指令池,通過解析指令池內的依賴關係,形成一張DAG(有向無環圖) 組織的網狀結構。通過對DAG關係的計算,其中依賴就緒的指令,就可以進入執行態,被提交到實際的執行部件中處理。執行部件類似多線程模型中的工作線程,根據特性細分爲計算和訪存兩類。計算類一般有相對固定可預期的執行週期,而訪存類由於指令週期差異較大,採用了異步回調的模型,通過Load/Store Buffer支持同時發起數十個訪存操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"亂序執行系統和傳統流水線模式的區別主要體現在,當一條訪存指令因爲Cache Miss而無法立即完成時,其後無依賴關係的指令可以插隊執行(類似於多線程模型中某個線程阻塞後,OS將其掛起並調度其他線程)。插隊的計算類指令可以填補空窗充分利用計算能力,而插隊的訪存指令通過更早啓動傳輸,讓訪存停頓期儘量重疊來減小整體的停頓。因此亂序執行系統的效率,很大程度上會受到窗口內指令DAG的『扁平』程度的影響,依賴深度較淺的DAG可以提供更高的指令級併發能力,進而提供更高的執行部件利用率,以及更少的停頓週期。另一方面,由於Load/Store Buffer也有最大的容量限制,處理較大區域的內存訪問負載時,將可能帶來更深層穿透的訪存指令儘量靠近排布,來提高訪存停頓的重疊,也能夠有效減少整體的停頓。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然理論比較清晰,可是在實踐中,僅僅從外部指標觀測到的性能表現,往往難以定位亂序執行系統內部的熱點。最直白的CPU利用率其實只能表達線程未受阻塞,真實在使用CPU的時間週期,但是其實並不能體現CPU內部部件真正的利用效率如何。稍微進階一些的IPC(Instruction Per Cyc le),可以相對深入地反應一些利用效能,但是影響IPC的因素又多種多樣。是指令並行度不足?還是長週期ALU計算負載大?又或者是訪存停頓過久?甚至可能是分支預測失敗率過高?真實程序中,這幾項問題往往是並存的,而且單一地統計往往又難以統一比較,例如10次訪存停頓/20次ALU 未打滿/30個週期的頁表遍歷,到底意味着瓶頸在哪裏?這個問題單一的指標往往就難以回答了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.3 TMAM","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TMAM(Top-down Microarchitecture Analysis Method)是一種利用CPU內部PMU(Performance Monitoring Unit)計數器來從上至下分解定位部件瓶頸的手段。例如在最頂層,首先以標定最大指令完成速率爲標準(例如Skylake上爲單週期4條微指令),如果無法達到標定,則認爲瓶頸在於未能充分利用部件。進一步細分以指令池爲分界線,如果指令池未滿,但是取指部件又無法滿負荷輸出微指令,就表明『前端』存在瓶頸。另一種無法達到最大指令速率的因素,是『前端』雖然在發射指令到指令池,但是因爲錯誤的預測,最終沒有產出有效結果,這類損耗則被歸入『錯誤預測』。除此以外的問題就是因爲指令池調度執行能力不足產生的反壓停頓,這一類被歸爲『後端』瓶頸。進一步例如『後端』瓶頸還可以根 據,停頓發生時,是否伴隨了ALU利用不充分,是否伴隨了Load/Store Buffer滿負荷等因素,繼續進行分解細化,形成了一套整體的分析方法。例如針對Intel,這一過程可以通過pmu-tools來被自動完成,對於指導精細化的程序瓶頸分析和優化往往有很大幫助。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"int array[1024];\nfor (size_t i = 0; i data[j] += j;  \n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個非連續內存上進行累加計算的例子,隨外層迭代會跳躍式緩存行訪問,內層循環在連續緩存行上進行無依賴的計算和訪存操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9b/9b5a2ea1975c147549aa2d3b39286849.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到,這一次的瓶頸到了穿透緩存後的內存訪存延遲上,但同時內存訪問的帶寬並沒有被充分利用。這是因爲指令窗口內雖然併發度不低,不過因爲緩存層次系統的特性,內層循環中的多個訪存指令,其實最終都是等待同一行被從內存加載到緩存。導致真正觸發的底層訪存壓力並不足以打滿傳輸帶寬,但是程序卻表現出了較大的停頓。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"for (size_t i = 0; i data[j] += j; \n line2->data[j] += j; \n ...   \n } \n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6c/6c1f67d71bf30912ec3ca02ce8150be4.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過調整循環結構,在每一輪內層循環中一次性計算多行數據,可以在儘量在停頓到來的指令窗口內,讓更多行出於同時從內存系統進行傳輸。從統計指標上也可以看出,瓶頸重心開始從穿透訪存的延遲,逐步轉化向訪存帶寬,而實際的緩存傳輸部件Fill Buffer也開始出現了滿負荷運作的情況。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.4 總結一下單線程併發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現代CPU在遇到主頻瓶頸後,除了改爲增加核心數,也在單核心內逐步強化並行能力。如果說多進程多線程技術的普及,讓多核心的利用技術多少不那麼罕見和困難,那麼單核心內的並行加速技術,因爲更加黑盒(多級緩存加亂序執行),規範性不足(SIMD),相對普及度和利用率都會更差一些。雖然硬件更多的細節嚮應用層暴露讓程序的實現更加困難,不過困難和機會往往也是伴隨出現的,既然客觀發展上這種複雜性增加已經無可避免,那麼是否能善加利用也成了工程師進行性能優化時的一項利器。隨着體系結構的進一步複雜化,可見的未來一段時間裏,能否利用一些體系結構的原理和工具來進行優化,也會不可避免地成爲服務端工程師的一項重要技能。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、多線程併發中的臨界區保護","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比單線程中的併發設計,多線程併發應該是更爲工程師所熟悉的概念。如今,將計算劃分到多線程執行的應用技術本身已經相對成熟了,相信各個服務端工程師都有各自熟悉的隊列+線程池的小工具箱。在不做其他額外考慮的情況下,單純的大任務分段拆分,提交線程池並回收結果可能也僅僅是幾行代碼就可以解決的事情了。真正的難點,其實往往不在於『拆』,而在於『合』的部分,也就是任務拆分中無法避免掉的共享數據操作環節。如果說更高的分佈式層面,還可以儘可能地利用Share Nothing思想,在計算髮生之前,就先儘量通過任務劃分來做到儘可能充分地隔離資源。但是深入到具體的計算節點內部,如果再進行一些細粒度的拆分加速時,共享往往就難以徹底避免了。如何正確高效地處理這些無法避免的共享問題,就涉及到併發編程中的一項重要技術,臨界區保護。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.1 什麼是臨界區","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/69/699b7cb609e696ad47312365563933af.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算法併發改造中,一般會產生兩類段落,一類是多個線程間無需交互就可以獨立執行的部分,這一部分隨着核心增多,可以順利地水平擴展。而另一類是需要通過操作共享的數據來完成執行,這部分操作爲了能夠正確執行,無法被多個核心同時執行,只能每個線程排隊通過。因此臨界區內的代碼,也就無法隨着核心增多來擴展,往往會成爲多線程程序的瓶頸點。也是因爲這個特性,臨界區的效率就變得至關重要,而如何保證各個線程安全地通過臨界區的方法,就是臨界區保護技術。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1.1 Mutual Exclusion","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e5/e5a3f19df7aa0e479879222b129e14e2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最基本的臨界區保護方法,就是互斥技術。這是一種典型的悲觀鎖算法,也就是假設臨界區高概率存在競爭,因此需要先利用底層提供的機制進行仲裁,成功獲得所有權之後,才進入臨界區運行。這種互斥算法,有一個典型的全局阻塞問題,也就是上圖中,當臨界區內的線程發生阻塞,或被操作系統換出時,會出現一個全局執行空窗。這個執行空窗內,不僅自身無法繼續操作,未獲得鎖的線程也只能一同等待,造成了阻塞放大的現象。但是對於並行區,單一線程的阻塞只會影響自身,同樣位於在上圖中的第二次阻塞就是如此。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於真實發生在臨界區內的阻塞往往又是不可預期的,例如發生了缺頁中斷,或者爲了申請一塊內存而要先進行一次比較複雜的內存整理。這就會讓阻塞擴散的問題更加嚴重,很可能改爲讓另一個線程先進入臨界區,反而可以更快順利完成,但是現在必須所有併發參與者,都一起等待臨界區持有者來完成一些並沒有那麼『關鍵』的操作。因爲存在全局阻塞的可能性,採用互斥技術進行臨界區保護的算法有着最低的阻塞容忍能力,一般在『非阻塞算法』領域作爲典型的反面教材存在。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1.2 Lock Free","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/83/83a949ed9a1f63601704a1842ea4e610.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對互斥技術中的阻塞問題,一個改良型的臨界區保護算法是無鎖技術。雖然叫做無鎖,不過主要是取自非阻塞算法等級中的一種分類術語,本質上是一種樂觀鎖算法。也就是首先假設臨界區不存在競爭,因此直接開始臨界區的執行,但是通過良好的設計,讓這段預先的執行是無衝突可回滾的。但是最終設計一個需要同步的提交操作,一般基於原子變量CAS(Compare And Swap),或者版本校驗等機制完成。在提交階段如果發生衝突,那麼被仲裁爲失敗的各方需要對臨界區預執行進行回滾,並重新發起一輪嘗試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無鎖技術和互斥技術最大的區別是,臨界區核心的執行段落是可以類似並行段落一樣獨立進行,不過又不同於真正的並行段落,同時執行的臨界區中,只有一個是真正有效的,其餘最終將被仲裁爲無效並回滾。但是引入了冗餘的執行操作後,當臨界區內再次發生阻塞時,不會像互斥算法那樣在參與線程之間進行傳播,轉而讓一個次優的線程成功提交。雖然從每個併發算法參與線程的角度,存在沒有執行『實質有效』計算的段落,但是這種浪費計算的段落,一定對應着另一個參與線程執行了『有效』的計算。所以從整個算法層面,能夠保證不會全局停頓,總是有一些有效的計算在運行。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.1.3 Wait-Free","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e2ebfae94b61f7a0ac0ab2c597be20c0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無鎖技術主要解決了臨界區內的阻塞傳播問題,但是本質上,多個線程依然是排隊順序經過臨界區。形象來說,有些類似交通中的三叉路口匯合, 無論是互斥還是無鎖,最終都是把兩條車道匯聚成了一條單車道,區別只是指揮是否高明能保證沒有斷流出現。可是無論如何,臨界區內全局吞吐降低成串行這點是共同的缺陷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而Wait Free級別和無鎖的主要區別也就體現在這個吞吐的問題上,在無全局停頓的基礎上,Wait Free進一步保障了任意算法參與線程,都應該在有限的步驟內完成。這就和無鎖技術產生了區別,不只是整體算法時時刻刻存在有效計算,每個線程視角依然是需要持續進行有效計算。這就要求了多線程在臨界區內不能被細粒度地串行起來,而必須是同時都能進行有效計算。回到上面三叉路口匯聚的例子,就以爲着在Wait Free級別下,最終匯聚的道路依舊需要是多車道的,以保證可以同時都能夠有進展。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/37a6679ecc77f8fe32f4abf7187f457d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然理論角度存在不少有Wait Free級別的算法,不過大多爲概念探索,並不具備工業使用價值。主要是由於Wait Free限制了同時有進展,但是並沒有描述這個進展有多快。因此進一步又提出了細分子類,以比較有實際意義的Wait-Free Population Oblivious級別來說,額外限制了每個參與線程必須要在預先可給出的明確執行週期內完成,且這個週期不能和與參與線程數相關。這一點明確拒絕了一些類似線程間協作的方案(這些方案往往引起較大的緩存競爭),以及一些需要很長很長的有限步來完成的設計。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a523d9a6b359204e411816dab7dd66b9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖實例了一個典型的Wait Free Population Oblivious思路。進行臨界區操作前,通過一個協同操作爲參與線程分配獨立的ticket,之後每個參與線程可以通過獲取到的ticket作爲標識,操作一塊獨立的互不干擾的工作區,並在其中完成操作。工業可用的Wait Free算法一般較難設計,例如ticket機制要求在協調動作中原子完成工作區分配,而很多數據結構是不容易做到這樣的拆分的。時至今日各種數據結構上工業可用的Wait    Free算法依舊是一項持續探索中的領域。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.2 無鎖不是萬能的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從非阻塞編程的角度看,上面的幾類臨界區處理方案優劣有着顯著的偏序關係,即Wait Free > Lock Free > Mutual Exclusion。這主要是從阻塞適應性角度進行的衡量,原理上並不能直接對應到性能緯度。但是依然很容易給工程師造成一個普適印象,也就是『鎖是很邪惡的東西,不使用鎖來實現算法可以顯著提高性能』,再結合廣爲流傳的鎖操作自身開銷很重的認知,很多工程師在實踐中會有對鎖敬而遠之的傾向。那麼,這個指導思想是否是完全正確的?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們先來一組實驗:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"// 在一個cache line上進行指定步長的斐波那契計算來模擬臨界區計算負載\nuint64_t calc(uint64_t* sequence, size_t size) { \n size_t i; \n for (i = 0; i < size; ++i) { \n sequence[(i + 1) & 7] += sequence[i & 7]; \n } \n return sequence[i & 7];\n}\n{ // Mutual Exclusion \n ::std::lock_guard<::std::mutex> lock(mutex); \n sum += calc(sequence, workload);\n}\n{ // Lock Free / Atomic CAS \n auto current = atomic_sum.load(::std::memory_order_relaxed); \n auto next = current; \n do { \n next = current + calc(sequence, workload); \n } while (!atomic_sum.compare_exchange_weak( \n current, next, ::std::memory_order_relaxed));\n}\n{ // Wait Free / Atomic Modify \n atomic_sum.fetch_add(calc(sequence, workload), ::std::memory_order_relaxed);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏採用多線程累加作爲案例,分別採用上鎖後累加,累加後CAS提交,以及累加後FAA(Fetch And Add)提交三種方法對全局累加結果做臨界區保護。針對不同的併發數量,以及不同的臨界區負載,可以形成如下的三維曲線圖。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中Latency項除以臨界區規模進行了歸一,便於形象展示臨界區負載變化下的臨界區保護開銷趨勢,因此跨不同負載等級下不具備橫向可比性。Cycles項表示多線程協同完成總量爲同樣次數的累加,用到的CPU週期總和,總體隨臨界區負載變化有少量天然傾斜。100/1600兩個截面圖將3中算法疊加在一起展示,便於直觀對比。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d7bc0403c80d222cfd0ec6ad0e71782.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a70a93b53d31b3ead9599cdf45f63b1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3a/3a3aca8debe35cd35f4b64c96657a668.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/75/750cf75c9b3d26202077ca7bcfba646f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/feeafd21700ce04bfcaa29e7c7533531.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"從上面的數據中可以分析出這樣一些信息:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、基於FAA的Wait Free模式各方面都顯著勝過其他方法;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、無鎖算法相比互斥算法在平均吞吐上有一定優勢,但是並沒有達到數量級水平;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、無鎖算法隨競爭提升(臨界區大小增大,或者線程增多),cpu消耗顯著上升;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於這些信息來分析,會發現一個和之前提到的『鎖性能』的常規認知相悖的點。性能的分水嶺並沒有出現在基於鎖的互斥算法和無鎖算法中間, 而是出現在同爲『未使用鎖』的Lock Free和Wait Free算法中間。而且從CPU消耗角度來看,對臨界區比較複雜,競爭強度高的場景,甚至Lock Free因爲『無效預測執行』過多反而引起了過多的消耗。這表明了鎖操作本身的開銷雖然稍重於原子操作,但其實也並非洪水猛獸,而真正影響性能的,是臨界區被迫串行執行所帶來的並行能力折損。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此當我們遇到臨界區保護的問題時,可以先思考一下,是否可以採用Wait Free的方法來完成保護動作,如果可以的話,在性能上能夠接近完全消除了臨界區的效果。而在多數情況下,往往還是要採用互斥或Lock Free來進行臨界區的保護。此時臨界區的串行不可避免,所以充分縮減臨界區的佔比是共性的第一要務,而是否進一步採用Lock Free技術來減少臨界區保護開銷,討論的前提也是臨界區已經顯著很短,不會引起過多的無效預 測。除此以外,由於Lock Free算法一般對臨界區需要設計成兩階段提交,以便支持回滾撤銷,因此往往需要比對應的互斥保護算法更復雜,局部性也可能更差(例如某些場景必須引入鏈表來替換數組)。綜合來看,一般如果無法做到Wait Free,那麼無需對Lock Free過度執着,充分優化臨界區的互斥方法往往也足以提供和Lock Free相當的性能表現了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.3 併發計數器優化案例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上文針對臨界區保護的多種方法所做的實驗,還可以發現一個現象。隨着臨界區逐漸減小,保護措施開銷隨線程數量增加而提升的趨勢都預發顯著,即便是設計上效率和參與線程數本應無關的Wait Free級別也是一樣。這對於臨界區極小的併發計數器場景,依舊會是一個顯著的問題。那麼我們就先從鎖和原子操作的實現角度,看看這些損耗是如何導致的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a21e36408314cdb7a3872ec016231684.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先給出一個典型的鎖實現,左側是鎖的fast path,也就是如果在外層的原子變量操作中未發現競爭,那麼其實上鎖和解鎖其實就只經歷了一組原子變量操作。當fast  path檢測到可能出現衝突時,纔會進入內核,嘗試進行排隊等待。fast  path的存在大幅優化了低衝突場景下的鎖表現,而且現代操作系統內核爲了優化鎖的內存開銷,都提供了『Wait On Address』的功能,也就是爲了支持這套排隊機制,每個鎖常態只需要一個整數的存儲開銷即可,只有在嘗試等待時,纔會創建和佔用額外的輔助結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此實際設計中,鎖可以創建很多,甚至非常多,只要能夠達到足夠細粒度拆解衝突的效果。這其中最典型的就是brpc中計數器框架bvar的設計。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e9/e9f93c4dbfbd8de0639239afa3df4353.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是bvar中基礎統計框架的設計,局部計數和全局匯聚時都通過每個tls附加的鎖來進行臨界區保護。因爲採集週期很長,衝突可以忽略不記,因此雖然默認使用了大量的鎖(統計量 * 線程數),但是並沒有很大的內存消耗,而且運行開銷其實很低,能夠用來支持任意的匯聚操作。這個例子也能進一步體現,鎖本身的消耗其實並不顯著,競爭帶來的軟件或硬件上的串行化纔是開銷的核心。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/af/af644517c6cdce776f4f654149748ee5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過即使競爭很低,鎖也還是會由一組原子操作實現,而當我們自己查看原子操作時,實際是由cache鎖操作保護的原子指令構成,而且這個指令會在亂序執行中起到內存屏障的效果降低訪存重疊的可能性。因此針對非常常用的簡單計數器,在百度內部我們進行了進一步去除局部鎖的改造,來試圖進一步降低統計開銷。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4fb90ea8765b0e3b77c5f06ce30f804.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如對於需要同時記錄次數和總和的IntRecorder,因爲需要兩個64位加法,曾經只能依賴鎖來保證原子更新。但隨着新x86機型的不斷普及,在比較新的X86和ARM服務端機型上已經可以做到128bit的原子load/store,因此可以利用相應的高位寬指令和正確對齊來實現鎖的去除。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/cc1a1b6bd963290d5a6b61c7c1b92712.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個例子是Percentile分位值統計,由於抽樣數據是一個多元素容器,而且分位值統計需要週期清空重算,因此常規也是採用了互斥保護的方法。不過如果引入版本號機制,將清空操作轉交給計數線程自己完成,將sample區域的讀寫完全分離。在這個基礎上,就可以比較簡單的做到線程安全,而且也不用引入原子修改。嚴格意義上,異步清空存在邊界樣本收集丟失的可能性,不過因爲核心的蓄水池抽樣算髮本身也具有隨機性,在監控指標統計領域已經擁有足夠精度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/21/21fb7579b29c6eeb9af25c965316682c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了運行時操作,線程局部變量的組織方式原先採用鎖保護的鏈表進行管理,採用分段數據結合線程編號的方法替換後,做到空間連續化。最終整體進一步改善了計數器的性能。","attrs":{}}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

local count
get value
bvar::IntRecorder
16
7
babylon::IntRecorder
1377
14
bvar::Percentile
938
28
babylon::Percentile
66
14
"}}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.4 併發隊列優化案例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一個在多線程編程中經常出現的數據結構就是隊列,爲了保證可以安全地處理併發的入隊和出隊操作,最基礎的算法是整個隊列用鎖來保護起來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f2/f21cd4d99fb664c524dcd11ac07d752c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個方法的缺點是顯而易見的,因爲隊列往往作爲多線程驅動的數據中樞位置,大量的競爭下,隊列操作被串行很容易影響整體計算的並行度。因此一個自然的改進點是,將隊列頭尾分開保護,先將生產者和消費者解耦開,只追加必要的同步操作來保證不會過度入隊和出隊。這也是Jave中LinkedBlockingQueue所使用的做法。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/377222e55be09a8ae193f17d34d70512.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在頭尾分離之後,進一步的優化進入了兩個方向。首先是因爲單節點的操作具備了Lock Free化的可能,因此產生了對應的Michael & Scott無鎖隊列算法。業界的典型實現有Java的ConcurrentLinkedQueue,以及boost中的boost::lockfree::queue。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a3/a39247e96942e6480ceca91157ed5b13.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而另一個方向是隊列分片,即將隊列拆解成多個子隊列,通過領取token的方式選擇子隊列,而子隊列內部使用傳統隊列算法,例如tbb:: concurrent_queue就是分片隊列的典型實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1f/1f640d240642419bb09279e8e68a0256.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

latency
boost::lockfree::queue
35075
tbb::concurrent_queue
4614
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對兩種方式進行對比,可以發現,在強競爭下,分片隊列的效果其實顯著勝過單純的無鎖處理,這也是前文對於無鎖技術真實效果分析的一個體現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了這類通用隊列,還有一個強化競爭發佈,串行消費的隊列也就是bthread::ExecutionQueue,它在是brpc中主要用於解決多線程競爭fd寫入的問題。利用一些有趣的技巧,對多線程生產側做到了Wait Free級別。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5e7656009d272358cbacf51292b8fb4e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個隊列只持有隊尾,而無隊頭。在生產側,第一步直接將新節點和當前尾指針進行原子交換,之後再將之前的隊尾銜接到新節點之後。因爲無論是否存在競爭,入隊操作都能通過固定的兩步完成,因此入隊算法是Wait Free的。不過這給消費側帶來的麻煩,消費同樣從一個原子交換開始,將隊尾置換成nullptr,之後的消費動作就是遍歷取到的單鏈表。但是因爲生產操作分了兩部完成,此時可能發現部分節點尚處於『斷鏈』狀態,由於消費者無從知曉後續節點信息,只能輪詢等待生產者最終完成第二步。所以理論上,生產/消費算法其實甚至不是Lock Free的,因爲如果生產者在兩階段中間被換出,那麼消費者會被這個阻塞傳播影響,整個消費也只能先阻塞住。但是在排隊寫入fd的場景下,專項優化生產併發是合理,也因此可以獲得更好的執行效率。","attrs":{}}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

latency
tbb::concurrent_queue
4614
bthread::ExecutionQueue
2390
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過爲了能利用原子操作完成算法,bthread::ExecutionQueue引入了鏈表作爲數據組織方式,而鏈表天然存在訪存跳躍的問題。那麼是否可以用數組來同樣實現Wait Free的生產甚至消費併發呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這就是babylon::ConcurrentBoundedQueue所希望解決的問題了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過介紹這個隊列併發原理之前,先插入一個勘誤信息。其實這個隊列在《內存篇》最後也簡單提到過,不過當時粗略的評測顯示了acquire- release等級下,即使不做cache line隔離性能也可以保障。文章發表後收到業界同好反饋,討論發現當時的測試用例命中了Intel Write Combining 優化技術,即當僅存在唯一一個處於等待加載的緩存行時,只寫動作可以無阻塞提前完成,等緩存行真實加載完畢後,再統一提交生效。但是由於內存序問題,一旦觸發了第二個待加載的緩存行後,對於第一個緩存行的Write Combine就無法繼續生效,只能等待第二個緩存行的寫完成後,才能繼續提交。原理上,Write Combine技術確實緩解了只寫場景下的False Sharing,但是隻能等待一個緩存行的限制在真實場景下想要針對性利用起來限制相當大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如在隊列這個典型場景下,往往會同時兩路操作數據和完成標記,很可能同時處於穿透加載中,此時是無法應用Write Combine技術的。此外,能夠在緩存行加載週期內,有如此充分的同行寫入,可能也只有並無真實意義的評測程序才能做到。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"所以從結論上講,通常意義上的多線程cache line隔離還是很有必要的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/dfb7b918223150055c53fe4130a085eb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到babylon::ConcurrentBoundedQueue的設計思路上,其實是將子隊列拆分做到極致,將同步量粒度降低到每個數據槽位上。每個入隊和出隊  請求,首先利用原子自增領取一個遞增的序號,之後利用循環數組的存儲方式,就可以映射到一個具體的數據槽位上。根據操作是入隊還是出隊, 在循環數組上發生了多少次摺疊,就可以在一個數據槽位上形成一個連續的版本序列。例如1號入隊和5號出隊都對應了1號數據槽位,而1號入隊預期的版本轉移是0到1,而5號出隊的版本轉移是2到3。這樣針對同一個槽位的入隊和出隊也可以形成一個連續的版本變更序列,一個領到序號的具體操作,只需要明確檢測版本即可確認自己當前是否可以開始操作,並通過自己的版本變更和後續的操作進行同步。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過同步量下放到每個元素的方式,入隊和出隊操作在可以除了最開始的序號領取存在原子操作級別的同步,後續都可以無干擾並行開展。而更連續的數據組織,也解決了鏈表存儲的訪存跳躍問題。生產消費雙端可併發的特點,也提供了更強的泛用性,實際在MPMC(Multiple Producer Mult iple Consumer)和MPSC(Multiple Producer Single Consumer)場景下都有不錯的性能表現,在具備一定小批量處理的場景下尤其顯著。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/35/3554fc6fcc2145de9d9097e1feea9af5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"招聘信息:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"歡迎出色的C++ 工程師加入百度,與大神一起成長。關注同名公衆號百度Geek說,輸入內推即可,我們期待你的加入!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|","attrs":{}}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247489076&idx=1&sn=748bf716d94d5ed2739ea8a9385cd4a6&chksm=c03d2648f74aaf5e11298cf450c3453a273eb6d2161bc90e411b6d62fa0c1b96a45e411af805&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"百度C++工程師的那些極限優化(內存篇)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|百度大規模Service Mesh落地實踐","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|","attrs":{}}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247493116&idx=1&sn=90925b509f4d8bfedc7066f2317e3d9c&chksm=c03ed580f7495c9621068194b799dd7fcc9ebff535a6fa04aacf593eae549c8d500b06df57d1&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"一種基於實時分位數計算的系統及方法","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章