high-performance server architecture

引言
本文將與你分享我多年來在服務器開發方面的一些經驗。對於這裏所說的服務器,更精確的定義應該是每秒處理大量離散消息或者請求的服務程序,網絡服務器更符合這種情況,但並非所有的網絡程序都是嚴格意義上的服務器。使用“高性能請求處理程序”是一個很糟糕的標題,爲了敘述起來簡單,下面將簡稱爲“服務器”。

本文不會涉及到多任務應用程序,在單個程序裏同時處理多個任務現在已經很常見。比如你的瀏覽器可能就在做一些並行處理,但是這類並行程序設計沒有多大挑戰性。真正的挑戰出現在服務器的架構設計對性能產生制約時,如何通過改善架構來提升系統性能。對於在擁有上G內存和G赫茲CPU上運行的瀏覽器來說,通過DSL進行多個併發下載任務不會有如此的挑戰性。這裏,應用的焦點不在於通過吸管小口吮吸,而是如何通過水龍頭大口暢飲,這裏麻煩是如何解決在硬件性能的制約.(作者的意思應該是怎麼通過網絡硬件的改善來增大流量)

一些人可能會對我的某些觀點和建議發出置疑,或者自認爲有更好的方法, 這是無法避免的。在本文中我不想扮演上帝的角色;這裏所談論的是我自己的一些經驗,這些經驗對我來說, 不僅在提高服務器性能上有效,而且在降低調試困難度和增加系統的可擴展性上也有作用。但是對某些人的系統可能會有所不同。如果有其它更適合於你的方法,那實在是很不錯. 但是值得注意的是,對本文中所提出的每一條建議的其它一些可替代方案,我經過實驗得出的結論都是悲觀的。你自己的小聰明在這些實驗中或許有更好的表現,但是如果因此慫恿我在這裏建議讀者這麼做,可能會引起無辜讀者的反感。你並不想惹怒讀者,對吧?

本文的其餘部分將主要說明影響服務器性能的四大殺手:

1)   數據拷貝(Data Copies)

2)   環境切換(Context Switches)

3)   內存分配(Memory allocation)

4)   鎖競爭(Lock contention)

在文章結尾部分還會提出其它一些比較重要的因素,但是上面的四點是主要因素。如果服務器在處理大部分請求時能夠做到沒有數據拷貝,沒有環境切換,沒有內存分配,沒有鎖競爭,那麼我敢保證你的服務器的性能一定很出色。

數據拷貝(Data Copies)
本節會有點短,因爲大多數人在數據拷貝上吸取過教訓。幾乎每個人都知道產生數據拷貝是不對的,這點是顯而易見的,在你的職業生涯中, 你很早就會見識過它;而且遇到過這個問題,因爲10年前就有人開始說這個詞。對我來說確實如此。現今,幾乎每個大學課程和幾乎所有how-to文檔中都提到了它。甚至在某些商業宣傳冊中,"零拷貝" 都是個流行用語。

儘管數據拷貝的壞處顯而易見,但是還是會有人忽視它。因爲產生數據拷貝的代碼常常隱藏很深且帶有僞裝,你知道你所調用的庫或驅動的代碼會進行數據拷貝嗎?答案往往超出想象。猜猜“程序I/O”在計算機上到底指什麼?哈希函數是僞裝的數據拷貝的例子,它有帶拷貝的內存訪問消耗和更多的計算。曾經指出哈希算法是一種有效的“拷貝+”似乎能夠被避免,但據我所知,有一些非常聰明的人說過要做到這一點是相當困難的。如果想真正去除數據拷貝,不管是因爲影響了服務器性能,還是想在黑客大會上展示"零複製”技術,你必須自己跟蹤可能發生數據拷貝的所有地方,而不是輕信宣傳。

有一種可以避免數據拷貝的方法是使用buffer的描述符(或者buffer chains的描述符)來取代直接使用buffer指針,每個buffer描述符應該由以下元素組成:

l  一個指向buffer的指針和整個buffer的長度

l  一個指向buffer中真實數據的指針和真實數據的長度,或者長度的偏移

l  以雙向鏈表的形式提供指向其它buffer的指針

l  一個引用計數

現在,代碼可以簡單的在相應的描述符上增加引用計數來代替內存中數據的拷貝。這種做法在某些條件下表現的相當好,包括在典型的網絡協議棧的操作上,但有些情況下這做法也令人很頭大。一般來說,在buffer chains的開頭和結尾增加buffer很容易,對整個buffer增加引用計數,以及對buffer chains的即刻釋放也很容易。在chains的中間增加buffer,一塊一塊的釋放buffer,或者對部分buffer增加引用技術則比較困難。而分割,組合chains會讓人立馬崩潰。

我不建議在任何情況下都使用這種技術,因爲當你想在鏈上搜索你想要的一個塊時,就不得不遍歷一遍描述符鏈,這甚至比數據拷貝更糟糕。最適用這種技術地方是在程序中大的數據塊上,這些大數據塊應該按照上面所說的那樣獨立的分配描述符,以避免發生拷貝,也能避免影響服務器其它部分的工作.(大數據塊拷貝很消耗CPU,會影響其它併發線程的運行)。

關於數據拷貝最後要指出的是:在避免數據拷貝時不要走極端。我看到過太多的代碼爲了避免數據拷貝,最後結果反而比拷貝數據更糟糕,比如產生環境切換或者一個大的I/O請求被分解了。數據拷貝是昂貴的,但是在避免它時,是收益遞減的(意思是做過頭了,效果反而不好)。爲了除去最後少量的數據拷貝而改變代碼,繼而讓代碼複雜度翻番,不如把時間花在其它方面。

上下文切換(Context Switches)
相對於數據拷貝影響的明顯,非常多的人會忽視了上下文切換對性能的影響。在我的經驗裏,比起數據拷貝,上下文切換是讓高負載應用徹底完蛋的真正殺手。系統更多的時間都花費在線程切換上,而不是花在真正做有用工作的線程上。令人驚奇的是,(和數據拷貝相比)在同一個水平上,導致上下文切換原因總是更常見。引起環境切換的第一個原因往往是活躍線程數比CPU個數多。隨着活躍線程數相對於CPU個數的增加,上下文切換的次數也在增加,如果你夠幸運,這種增長是線性的,但更常見是指數增長。這個簡單的事實解釋了爲什麼每個連接一個線程的多線程設計的可伸縮性更差。對於一個可伸縮性的系統來說,限制活躍線程數少於或等於CPU個數是更有實際意義的方案。曾經這種方案的一個變種是隻使用一個活躍線程,雖然這種方案避免了環境爭用,同時也避免了鎖,但它不能有效利用多CPU在增加總吞吐量上的價值,因此除非程序無CPU限制(non-CPU-bound),(通常是網絡I/O限制 network-I/O-bound),應該繼續使用更實際的方案。

一個有適量線程的程序首先要考慮的事情是規劃出如何創建一個線程去管理多連接。這通常意味着前置一個select/poll, 異步I/O,信號或者完成端口,而後臺使用一個事件驅動的程序框架。關於哪種前置API是最好的有很多爭論。 Dan Kegel的C10K在這個領域是一篇不錯的論文。個人認爲,select/poll和信號通常是一種醜陋的方案,因此我更傾向於使用AIO或者完成端口,但是實際上它並不會好太多。也許除了select(),它們都還不錯。所以不要花太多精力去探索前置系統最外層內部到底發生了什麼。

對於最簡單的多線程事件驅動服務器的概念模型, 其內部有一個請求緩存隊列,客戶端請求被一個或者多個監聽線程獲取後放到隊列裏,然後一個或者多個工作線程從隊列裏面取出請求並處理。從概念上來說,這是一個很好的模型,有很多用這種方式來實現他們的代碼。這會產生什麼問題嗎?引起環境切換的第二個原因是把對請求的處理從一個線程轉移到另一個線程。有些人甚至把對請求的迴應又切換回最初的線程去做,這真是雪上加霜,因爲每一個請求至少引起了2次環境切換。把一個請求從監聽線程轉換到成工作線程,又轉換回監聽線程的過程中,使用一種“平滑”的方法來避免環境切換是非常重要的。此時,是否把連接請求分配到多個線程,或者讓所有線程依次作爲監聽線程來服務每個連接請求,反而不重要了。

即使在將來,也不可能有辦法知道在服務器中同一時刻會有多少激活線程.畢竟,每時每刻都可能有請求從任意連接發送過來,一些進行特殊任務的“後臺”線程也會在任意時刻被喚醒。那麼如果你不知道當前有多少線程是激活的,又怎麼能夠限制激活線程的數量呢?根據我的經驗,最簡單同時也是最有效的方法之一是:用一個老式的帶計數的信號量,每一個線程執行的時候就先持有信號量。如果信號量已經到了最大值,那些處於監聽模式的線程被喚醒的時候可能會有一次額外的環境切換,(監聽線程被喚醒是因爲有連接請求到來, 此時監聽線程持有信號量時發現信號量已滿,所以即刻休眠), 接着它就會被阻塞在這個信號量上,一旦所有監聽模式的線程都這樣阻塞住了,那麼它們就不會再競爭資源了,直到其中一個線程釋放信號量,這樣環境切換對系統的影響就可以忽略不計。更主要的是,這種方法使大部分時間處於休眠狀態的線程避免在激活線程數中佔用一個位置,這種方式比其它的替代方案更優雅。

一旦處理請求的過程被分成兩個階段(監聽和工作),那麼更進一步,這些處理過程在將來被分成更多的階段(更多的線程)就是很自然的事了。最簡單的情況是一個完整的請求先完成第一步,然後是第二步(比如迴應)。然而實際會更復雜:一個階段可能產生出兩個不同執行路徑,也可能只是簡單的生成一個應答(例如返回一個緩存的值)。由此每個階段都需要知道下一步該如何做,根據階段分發函數的返回值有三種可能的做法:

l  請求需要被傳遞到另外一個階段(返回一個描述符或者指針)

l  請求已經完成(返回ok)

l  請求被阻塞(返回"請求阻塞")。這和前面的情況一樣,阻塞到直到別的線程釋放資源

應該注意到在這種模式下,對階段的排隊是在一個線程內完成的,而不是經由兩個線程中完成。這樣避免不斷把請求放在下一階段的隊列裏,緊接着又從該隊列取出這個請求來執行。這種經由很多活動隊列和鎖的階段很沒必要。

這種把一個複雜的任務分解成多個較小的互相協作的部分的方式,看起來很熟悉,這是因爲這種做法確實很老了。我的方法,源於CAR在1978年發明的"通信序列化進程"(Communicating Sequential Processes CSP),它的基礎可以上溯到1963時的Per Brinch Hansen and Matthew Conway--在我出生之前!然而,當Hoare創造出CSP這個術語的時候,“進程”是從抽象的數學角度而言的,而且,這個CSP術語中的進程和操作系統中同名的那個進程並沒有關係。依我看來,這種在操作系統提供的單個線程之內,實現類似多線程一樣協同併發工作的CSP的方法,在可擴展性方面讓很多人頭疼。

一個實際的例子是,Matt Welsh的SEDA,這個例子表明分段執行的(stage-execution)思想朝着一個比較合理的方向發展。SEDA是一個很好的“server Aarchitecture done right”的例子,值得把它的特性評論一下:

1.   SEDA的批處理傾向於強調一個階段處理多個請求,而我的方式傾向於強調一個請求分成多個階段處理。

2.   在我看來SEDA的一個重大缺陷是給每個階段申請一個獨立的在加載響應階段中線程“後臺”重分配的線程池。結果,原因1和原因2引起的環境切換仍然很多。

3.   在純技術的研究項目中,在Java中使用SEDA是有用的,然而在實際應用場合,我覺得這種方法很少被選擇。

內存分配(Memory Allocator)
申請和釋放內存是應用程序中最常見的操作, 因此發明了許多聰明的技巧使得內存的申請效率更高。然而再聰明的方法也不能彌補這種事實:在很多場合中,一般的內存分配方法非常沒有效率。所以爲了減少向系統申請內存,我有三個建議。

建議一是使用預分配。我們都知道由於使用靜態分配而對程序的功能加上人爲限制是一種糟糕的設計。但是還是有許多其它很不錯的預分配方案。通常認爲,通過系統一次性分配內存要比分開幾次分配要好,即使這樣做在程序中浪費了某些內存。如果能夠確定在程序中會有幾項內存使用,在程序啓動時預分配就是一個合理的選擇。即使不能確定,在開始時爲請求句柄預分配可能需要的所有內存也比在每次需要一點的時候才分配要好。通過系統一次性連續分配多項內存還能極大減少錯誤處理代碼。在內存比較緊張時,預分配可能不是一個好的選擇,但是除非面對最極端的系統環境,否則預分配都是一個穩賺不賠的選擇。

建議二是使用一個內存釋放分配的lookaside list(監視列表或者後備列表)。基本的概念是把最近釋放的對象放到鏈表裏而不是真的釋放它,當不久再次需要該對象時,直接從鏈表上取下來用,不用通過系統來分配。使用lookaside list的一個額外好處是可以避免複雜對象的初始化和清理.

通常,讓lookaside list不受限制的增長,即使在程序空閒時也不釋放佔用的對象是個糟糕的想法。在避免引入複雜的鎖或競爭情況下,不定期的“清掃"非活躍對象是很有必要的。一個比較妥當的辦法是,讓lookaside list由兩個可以獨立鎖定的鏈表組成:一個"新鏈"和一個"舊鏈".使用時優先從"新"鏈分配,然後最後才依靠"舊"鏈。對象總是被釋放的"新"鏈上。清除線程則按如下規則運行:

1.   鎖住兩個鏈

2.   保存舊鏈的頭結點

3.   把前一個新鏈掛到舊鏈的前頭

4.   解鎖

5.   在空閒時通過第二步保存的頭結點開始釋放舊鏈的所有對象。

使用了這種方式的系統中,對象只有在真的沒用時纔會釋放,釋放至少延時一個清除間隔期(指清除線程的運行間隔),但同常不會超過兩個間隔期。清除線程不會和普通線程發生鎖競爭。理論上來說,同樣的方法也可以應用到請求的多個階段,但目前我還沒有發現有這麼用的。

使用lookaside lists有一個問題是,保持分配對象需要一個鏈表指針(鏈表結點),這可能會增加內存的使用。但是即使有這種情況,使用它帶來的好處也能夠遠遠彌補這些額外內存的花銷。

第三條建議與我們還沒有討論的鎖有關係。先拋開它不說。即使使用lookaside list,內存分配時的鎖競爭也常常是最大的開銷。解決方法是使用線程私有的lookasid list, 這樣就可以避免多個線程之間的競爭。更進一步,每個處理器一個鏈會更好,但這樣只有在非搶先式線程環境下才有用。基於極端考慮,私有lookaside list甚至可以和一個共用的鏈工作結合起來使用。

鎖競爭(Lock Contention)
高效率的鎖是非常難規劃的, 以至於我把它稱作卡律布狄斯和斯庫拉(參見附錄)。一方面, 鎖的簡單化(粗粒度鎖)會導致並行處理的串行化,因而降低了併發的效率和系統可伸縮性; 另一方面, 鎖的複雜化(細粒度鎖)在空間佔用上和操作時的時間消耗上都可能產生對性能的侵蝕。偏向於粗粒度鎖會有死鎖發生,而偏向於細粒度鎖則會產生競爭。在這兩者之間,有一個狹小的路徑通向正確性和高效率,但是路在哪裏?

由於鎖傾向於對程序邏輯產生束縛,所以如果要在不影響程序正常工作的基礎上規劃出鎖方案基本是不可能的。這也就是人們爲什麼憎恨鎖,並且爲自己設計的不可擴展的單線程方案找藉口了。

幾乎我們每個系統中鎖的設計都始於一個"鎖住一切的超級大鎖",並寄希望於它不會影響性能,當希望落空時(幾乎是必然), 大鎖被分成多個小鎖,然後我們繼續禱告(性能不會受影響),接着,是重複上面的整個過程(許多小鎖被分成更小的鎖), 直到性能達到可接受的程度。通常,上面過程的每次重複都回增加大於20%-50%的複雜性和鎖負荷,並減少5%-10%的鎖競爭。最終結果是取得了適中的效率,但是實際效率的降低是不可避免的。設計者開始抓狂:"我已經按照書上的指導設計了細粒度鎖,爲什麼系統性能還是很糟糕?"

在我的經驗裏,上面的方法從基礎上來說就不正確。設想把解決方案當成一座山,優秀的方案表示山頂,糟糕的方案表示山谷。上面始於"超級鎖"的解決方案就好像被形形色色的山谷,凹溝,小山頭和死衚衕擋在了山峯之外的登山者一樣,是一個典型的糟糕爬山法;從這樣一個地方開始登頂,還不如下山更容易一些。那麼登頂正確的方法是什麼?

首要的事情是爲你程序中的鎖形成一張圖表,有兩個軸:

l  圖表的縱軸表示代碼。如果你正在應用剔出了分支的階段架構(指前面說的爲請求劃分階段),你可能已經有這樣一張劃分圖了,就像很多人見過的OSI七層網絡協議架構圖一樣。

l  圖表的水平軸表示數據集。在請求的每個階段都應該有屬於該階段需要的數據集。

現在,你有了一張網格圖,圖上每個單元格表示一個特定階段需要的特定數據集。下面是應該遵守的最重要的規則:兩個請求不應該產生競爭,除非它們在同一個階段需要同樣的數據集。如果你嚴格遵守這個規則,那麼你已經成功了一半。

一旦你定義出了上面那個網格圖,在你的系統中的每種類型的鎖就都可以被標識出來了。你的下一個目標是確保這些標識出來的鎖儘可能在兩個軸之間均勻的分佈, 這部分工作是和具體應用相關的。你得像個鑽石切割工一樣,根據你對程序的瞭解,找出請求階段和數據集之間的自然“紋理線”。有時候它們很容易發現,有時候又很難找出來,此時需要不斷回顧來發現它。在程序設計時,把代碼分隔成不同階段是很複雜的事情,我也沒有好的建議,但是對於數據集的定義,有一些建議給你:

l  如果你能對請求按順序編號,或者能對請求進行哈希,或者能把請求和事物ID關聯起來,那麼根據這些編號或者ID就能對數據更好的進行分隔。

l  有時,基於數據集的資源最大化利用,把請求動態的分配給數據,相對於依據請求的固有屬性來分配會更有優勢。就好像現代CPU的多個整數運算單元知道把請求分離一樣。

l  確定每個階段指定的數據集是不一樣的是非常有用的,以便保證一個階段爭奪的數據在另外階段不會爭奪。

如果你在縱向和橫向上把“鎖空間(這裏實際指鎖的分佈)" 分隔了,並且確保了鎖均勻分佈在網格上,那麼恭喜你獲得了一個好方案。現在你處在了一個好的登山點,打個比喻,你面有了一條通向頂峯的緩坡,但你還沒有到山頂。現在是時候對鎖競爭進行統計,看看該如何改進了。以不同的方式分隔階段和數據集,然後統計鎖競爭,直到獲得一個滿意的分隔。當你做到這個程度的時候,那麼無限風景將呈現在你腳下。

其他方面
我已經闡述完了影響性能的四個主要方面。然而還有一些比較重要的方面需要說一說,大所屬都可歸結於你的平臺或系統環境:

l  你的存儲子系統在大數據讀寫和小數據讀寫,隨即讀寫和順序讀寫方面是如何進行?在預讀和延遲寫入方面做得怎樣?

l  你使用的網絡協議效率如何?是否可以通過修改參數改善性能?是否有類似於TCP_CORK, MSG_PUSH,Nagle-toggling算法的手段來避免小消息產生?

l  你的系統是否支持Scatter-Gather I/O(例如readv/writev)? 使用這些能夠改善性能,也能避免使用緩衝鏈(見第一節數據拷貝的相關敘述)帶來的麻煩。(說明:在dma傳輸數據的過程中,要求源物理地址和目標物理地址必須是連續的。但在有的計算機體系中,如IA,連續的存儲器地址在物理上不一定是連續的,則dma傳輸要分成多次完成。如果傳輸完一塊物理連續的數據後發起一次中斷,同時主機進行下一塊物理連續的傳輸,則這種方式即爲block dma方式。scatter/gather方式則不同,它是用一個鏈表描述物理不連續的存儲器,然後把鏈表首地址告訴dma master。dma master傳輸完一塊物理連續的數據後,就不用再發中斷了,而是根據鏈表傳輸下一塊物理連續的數據,最後發起一次中斷。很顯然 scatter/gather方式比block dma方式效率高) 

l  你的系統的頁大小是多少?高速緩存大小是多少?向這些大小邊界進行對起是否有用?系統調用和上下文切換花的代價是多少?

l  你是否知道鎖原語的飢餓現象?你的事件機制有沒有"驚羣"問題?你的喚醒/睡眠機制是否有這樣糟糕的行爲: 當X喚醒了Y, 環境立刻切換到了Y,但是X還有沒完成的工作?

我在這裏考慮的了很多方面,相信你也考慮過。在特定情況下,應用這裏提到的某些方面可能沒有價值,但能考慮這些因素的影響還是有用的。如果在系統手冊中,你沒有找到這些方面的說明,那麼就去努力找出答案。寫一個測試程序來找出答案;不管怎樣,寫這樣的測試代碼都是很好的技巧鍛鍊。如果你寫的代碼在多個平臺上都運行過,那麼把這些相關的代碼抽象爲一個平臺相關的庫,將來在某個支持這裏提到的某些功能的平臺上,你就贏得了先機。

對你的代碼,“知其所以然”, 弄明白其中高級的操作, 以及在不同條件下的花銷.這不同於傳統的性能分析, 不是關於具體的實現,而是關乎設計. 低級別的優化永遠是蹩腳設計的最後救命稻草.

(map注:下面這段文字原文沒有,這是譯者對於翻譯的理)

[附錄:奧德修斯(Odysseus,又譯“奧德賽”),神話中伊塔刻島國王,《伊利亞特》和《奧德賽》兩大史詩中的主人公(公元前11世紀到公元前9世紀的希臘史稱作“荷馬時代”。包括《伊利亞特》和《奧德賽》兩部分的《荷馬史詩》,是古代世界一部著名的傑作)。奧德修斯曾參加過著名的特洛伊戰爭,在戰爭中他以英勇善戰、足智多謀而著稱,爲贏得戰爭的勝利,他設計製造了著名的“特洛伊木馬”(後來在西方成了“爲毀滅敵人而送的禮物”的代名詞)。特洛伊城毀滅後,他在回國途中又經歷了許多風險,荷馬的《奧德賽》就是奧德修斯歷險的記述。“斯庫拉和卡律布狄斯”的故事是其中最驚險、最恐怖的一幕。

相傳,斯庫拉和卡律布狄斯是古希臘神話中的女妖和魔怪,女妖斯庫拉住在意大利和西西里島之間海峽中的一個洞穴裏,她的對面住着另一個妖怪卡律布狄斯。它們爲害所有過往航海的人。據荷馬說,女妖斯庫拉長着12只不規則的腳,有6個蛇一樣的脖子,每個脖子上各有一顆可怕的頭,張着血盆大口,每張嘴有3 排毒牙,隨時準備把獵物咬碎。它們每天在意大利和西西里島之間海峽中興風作浪,航海者在兩個妖怪之間通過是異常危險的,它們時刻在等待着穿過西西里海峽的船舶。在海峽中間,卡律布狄斯化成一個大旋渦,波濤洶涌、水花飛濺,每天3次從懸崖上奔涌而出,在退落時將通過此處的船隻全部淹沒。當奧德修斯的船接近卡律布狄斯大旋渦時,它像火爐上的一鍋沸水,波濤滔天,激起漫天雪白的水花。當潮退時,海水混濁,濤聲如雷,驚天動地。這時,黑暗泥濘的巖穴一見到底。正當他們驚恐地注視着這一可怕的景象時,正當舵手小心翼翼地駕駛着船隻從左繞過旋渦時,突然海怪斯庫拉出現在他們面前,她一口叼住了6個同伴。奧德修斯親眼看見自己的同伴在妖怪的牙齒中間扭動着雙手和雙腳,掙扎了一會兒,他們便被嚼碎,成了血肉模糊的一團。其餘的人僥倖通過了卡律布狄斯大旋渦和海怪斯庫拉之間的危險的隘口。後來又歷經種種災難,最後終於回到了故鄉——伊塔刻島。

這個故事在語言學界和翻譯界被廣爲流傳。前蘇聯著名翻譯家巴爾胡達羅夫就曾把 “斯庫拉和卡律布狄斯”比作翻譯中“直譯和意譯”。他說:“形象地說,譯者總是不得不在直譯和意譯之間迂迴應變,猶如在斯庫拉和卡律布狄斯之間曲折前行,以求在這海峽兩岸之間找到一條狹窄然而卻足夠深邃的航道,以便達到理想的目的地——最大限度的等值翻譯。”

德國著名語言學家洪堡特也說過類似的話:“我確信任何翻譯無疑地都是企圖解決不可能解決的任務。因爲任何一個翻譯家都會碰到一個暗礁而遭到失敗,他們不是由於十分準確地遵守了原文的形式而破壞了譯文語言的特點,就是爲了照顧譯文語言的特點而損壞了原文。介於兩者之間的做法不僅難於辦到,而且簡直是不可能辦到。”

歷史上長久以來都認爲,翻譯只能選擇兩個極端的一種:或者這種——逐字翻譯(直譯);或者那種——自由翻譯(意譯)。就好像翻譯中的斯庫拉和卡律布狄斯”一樣。如今 “斯庫拉和卡律布狄斯”已成爲表示雙重危險——海怪和旋渦的代名詞,人們常說“介於斯庫拉和卡律布狄斯之間”,這就是說:處於兩面受敵的險境,比喻“危機四伏”,用來喻指譯者在直譯和意譯之間反覆作出抉擇之艱難。

 

Introduction

The purpose of this document is to share some ideas that I've developed over the years about how to develop a certain kind of application for which the term "server" is only a weak approximation. More accurately, I'll be writing about a broad class of programs that are designed to handle very large numbers of discrete messages or requests per second. Network servers most commonly fit this definition, but not all programs that do are really servers in any sense of the word. For the sake of simplicity, though, and because "High-Performance Request-Handling Programs" is a really lousy title, we'll just say "server" and be done with it.

I will not be writing about "mildly parallel" applications, even though multitasking within a single program is now commonplace. The browser you're using to read this probably does some things in parallel, but such low levels of parallelism really don't introduce many interesting challenges. The interesting challenges occur when the request-handling infrastructure itself is the limiting factor on overall performance, so that improving the infrastructure actually improves performance. That's not often the case for a browser running on a gigahertz processor with a gigabyte of memory doing six simultaneous downloads over a DSL line. The focus here is not on applications that sip through a straw but on those that drink from a firehose, on the very edge of hardware capabilities where how you do it really does matter.

Some people will inevitably take issue with some of my comments and suggestions, or think they have an even better way. Fine. I'm not trying to be the Voice of God here; these are just methods that I've found to work for me, not only in terms of their effects on performance but also in terms of their effects on the difficulty of debugging or extending code later. Your mileage may vary. If something else works better for you that's great, but be warned that almost everything I suggest here exists as an alternative to something else that I tried once only to be disgusted or horrified by the results. Your pet idea might very well feature prominently in one of these stories, and innocent readers might be bored to death if you encourage me to start telling them. You wouldn't want to hurt them, would you?

The rest of this article is going to be centered around what I'll call the Four Horsemen of Poor Performance:

  1. Data copies
  2. Context switches
  3. Memory allocation
  4. Lock contention

There will also be a catch-all section at the end, but these are the biggest performance-killers. If you can handle most requests without copying data, without a context switch, without going through the memory allocator and without contending for locks, you'll have a server that performs well even if it gets some of the minor parts wrong.

Data Copies

This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it's obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that's true for me, but I digress. Nowadays it's covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that "zero copy" is a good buzzword.

Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It's probably more than you think. Guess what "Programmed I/O" on a PC refers to. An example of a copy that's disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it's pointed out that hashing is effectively "copying plus" it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put "zero-copy operation" on your hacker-conference slides, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such.

The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:

  • A pointer and length for the whole buffer.
  • A pointer and length, or offset and length, for the part of the buffer that's actually filled.
  • Forward and back pointers to other buffer descriptors in a list.
  • A reference count.

Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it's easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.

I don't actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don't need to be copied, and not sweat too much about the other stuff.

This brings me to my last point about data copies: don't go overboard avoiding them. I've seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you're looking for places to avoid redundant operations they're one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.

Context Switches

Whereas everyone thinks it's obvious that data copies are bad, I'm often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total "meltdowns" at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.

The first thing that a "thread-frugal" program has to do is figure out how it's going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many "religious wars" have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel's C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn't matter that much. They all - except maybe select() - work reasonably well, and don't really do much to address the matter of what happens past the very outermost layer of your program's front end.

The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more "listener" threads and put on queues, from which one or more "worker" threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error by requiring that the response to a request be sent by the original thread - guaranteeing not one but two context switches per request. It's very important to use a "symmetric" approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Usually, it's not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or "background" threads dedicated to various maintenance tasks could pick that moment to wake up. If you don't know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it's doing "real work". If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won't continue contending for resources until one of the existing threads "retires" so the system effect is negligible. More importantly, this method handles maintenance threads - which sleep most of the time and therefore dont' count against the active thread count - more gracefully than most alternatives.

Once the processing of requests has been broken up into two stages (listener and worker) with multiple threads to service the stages, it's natural to break up the processing even further into more than two stages. In its simplest form, processing a request thus becomes a matter of invoking stages successively in one direction, and then in the other (for replies). However, things can get more complicated; a stage might represent a "fork" between two processing paths which involve different stages, or it might generate a reply (e.g. a cached value) itself without invoking further stages. Therefore, each stage needs to be able to specify "what should happen next" for a request. There are three possibilities, represented by return values from the stage's dispatch function:

  • The request needs to be passed on to another stage (an ID or pointer in the return value).
  • The request has been completed (a special "request done" return value)
  • The request was blocked (a special "request blocked" return value). This is equivalent to the previous case, except that the request is not freed and will be continued later from another thread.

Note that, in this model, queuing of requests is done within stages, not between stages. This avoids the common silliness of constantly putting a request on a successor stage's queue, then immediately invoking that successor stage and dequeuing the request again; I call that lots of queue activity - and locking - for nothing.

If this idea of separating a complex task into multiple smaller communicating parts seems familiar, that's because it's actually very old. My approach has its roots in the Communicating Sequential Processes concept elucidated by C.A.R. Hoare in 1978, based in turn on ideas from Per Brinch Hansen and Matthew Conway going back to 1963 - before I was born! However, when Hoare coined the term CSP he meant "process" in the abstract mathematical sense, and a CSP process need bear no relation to the operating-system entities of the same name. In my opinion, the common approach of implementing CSP via thread-like coroutines within a single OS thread gives the user all of the headaches of concurrency with none of the scalability.

A contemporary example of the staged-execution idea evolved in a saner direction is Matt Welsh's SEDA. In fact, SEDA is such a good example of "server architecture done right" that it's worth commenting on some of its specific characteristics (especially where those differ from what I've outlined above).

  1. SEDA's "batching" tends to emphasize processing multiple requests through a stage at once, while my approach tends to emphasize processing a single request through multiple stages at once.
  2. SEDA's one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only "background" reallocation of threads between stages in response to load. As a result, the #1 and #2 causes of context switches noted above are still very much present.
  3. In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.

Memory Allocation

Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.

Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process. Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.

Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they're needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.

It's generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it's usually necessary to have some sort of periodic "sweeper" task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked "old" and "new" lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:

  1. Lock both lists.
  2. Save the head for the old list.
  3. Make the (previously) new list into the old list by assigning list heads.
  4. Unlock.
  5. Free everything on the saved old list at leisure.

Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.

One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I'd use lookaside lists for already contain list pointers anyway, so it's kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.

Suggestion #3 actually has to do with locking, which we haven't discussed yet, but I'll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there's absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.

Lock Contention

Efficient locking schemes are notoriously hard to design, because of what I call Scylla and Charybdis after the monsters in the Odyssey. Scylla is locking that's too simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability; Charybdis is overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance. Near Scylla are shoals representing deadlock and livelock conditions; near Charybdis are shoals representing race conditions. In between, there's a narrow channel that represents locking which is both efficient and correct...or is there? Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.

Almost every locking scheme starts off as "one big lock around everything" and a vague hope that performance won't suck. When that hope is dashed, and it almost always is, the big lock is broken up into smaller ones and the prayer is repeated, and then the whole process is repeated, presumably until performance is adequate. Often, though, each iteration increases complexity and locking overhead by 20-50% in return for a 5-10% decrease in lock contention. With luck, the net result is still a modest increase in performance, but actual decreases are not uncommon. The designer is left scratching his head (I use "his" because I'm a guy myself; get over it). "I made the locks finer grained like all the textbooks said I should," he thinks, "so why did performance get worse?"

 

In my opinion, things got worse because the aforementioned approach is fundamentally misguided. Imagine the "solution space" as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the "one big lock" starting point is almost always separated from the higher peaks by all manner of valleys, saddles, lesser peaks and dead ends. It's a classic hill-climbing problem; trying to get from such a starting point to the higher peaks only by taking small steps and never going downhill almost never works. What's needed is a fundamentally different way of approaching the peaks.

The first thing you have to do is form a mental map of your program's locking. This map has two axes:

  • The vertical axis represents code. If you're using a staged architecture with non-branching stages, you probably already have a diagram showing these divisions, like the ones everybody uses for OSI-model network protocol stacks.
  • The horizontal axis represents data. In every stage, each request should be assigned to a data set with its own resources separate from any other set.

You now have a grid, where each cell represents a particular data set in a particular processing stage. What's most important is the following rule: two requests should not be in contention unless they are in the same data set and the same processing stage. If you can manage that, you've already won half the battle.

Once you've defined the grid, every type of locking your program does can be plotted, and your next goal is to ensure that the resulting dots are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond-cutter, using your knowledge of what the program does to find the natural "cleavage lines" between stages and data sets. Sometimes they're obvious to start with. Sometimes they're harder to find, but seem more obvious in retrospect. Dividing code into stages is a complicated matter of program design, so there's not much I can offer there, but here are some suggestions for how to define data sets:

  • If you have some sort of a block number or hash or transaction ID associated with requests, you can rarely do better than to divide that value by the number of data sets.
  • Sometimes, it's better to assign requests to data sets dynamically, based on which data set has the most resources available rather than some intrinsic property of the request. Think of it like multiple integer units in a modern CPU; those guys know a thing or two about making discrete requests flow through a system.
  • It's often helpful to make sure that the data-set assignment is different for each stage, so that requests which would contend at one stage are guaranteed not to do so at another stage.

If you've divided your "locking space" both vertically and horizontally, and made sure that lock activity is spread evenly across the resulting cells, you can be pretty sure that your locking is in pretty good shape. There's one more step, though. Do you remember the "small steps" approach I derided a few paragraphs ago? It still has its place, because now you're at a good starting point instead of a terrible one. In metaphorical terms you're probably well up the slope on one of the mountain range's highest peaks, but you're probably not at the top of one. Now is the time to collect contention statistics and see what you need to do to improve, splitting stages and data sets in different ways and then collecting more statistics until you're satisfied. If you do all that, you're sure to have a fine view from the mountaintop.

Other Stuff

As promised, I've covered the four biggest performance problems in server design. There are still some important issues that any particular server will need to address, though. Mostly, these come down to knowing your platform/environment:

  • How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
  • How efficient is the network protocol you're using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
  • Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
  • What's your page size? What's your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
  • Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have "thundering herd" problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?

I'm sure I could think of many more questions in this vein. I'm sure you could too. In any particular situation it might not be worthwhile to do anything about any one of these issues, but it's usually worth at least thinking about them. If you don't know the answers - many of which you will not find in the system documentation - find out. Write a test program or micro-benchmark to find the answers empirically; writing such code is a useful skill in and of itself anyway. If you're writing code to run on multiple platforms, many of these questions correlate with points where you should probably be abstracting functionality into per-platform libraries so you can realize a performance gain on that one platform that supports a particular feature.

The "know the answers" theory applies to your own code, too. Figure out what the important high-level operations in your code are, and time them under different conditions. This is not quite the same as traditional profiling; it's about measuring design elements, not actual implementations. Low-level optimization is generally the last resort of someone who screwed up the design.

發佈了46 篇原創文章 · 獲贊 4 · 訪問量 9萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章