深入分析Linux操作系統對於TCP/IP棧的實現原理與具體過程

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、Linux內核與網絡體系結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們瞭解整個linux系統的網絡體系結構之前,我們需要對整個網絡體系調用,初始化和交互的位置,同時也是Linux操作系統中最爲關鍵的一部分代碼-------內核,有一個初步的認知。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"文章相關視頻講解:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//ke.qq.com/course/417774%3FflowToken%3D1033508","title":null,"type":null},"content":[{"type":"text","text":"c/c++Linux後臺服務器開發高級架構師學習視頻資料","attrs":{}}]}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//www.bilibili.com/video/BV1Lb4y1D7cn/","title":null,"type":null},"content":[{"type":"text","text":"Linux內核網絡協議棧詳解","attrs":{}}]}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//www.bilibili.com/video/BV1HX4y1G7ER","title":null,"type":null},"content":[{"type":"text","text":"詳解epoll源碼及原理","attrs":{}}]}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//www.bilibili.com/video/BV1Av41187M8","title":null,"type":null},"content":[{"type":"text","text":"網絡編程的細節處理","attrs":{}}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1、Linux內核的結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,從功能上,我們將linux內核劃分爲五個不同的部分,分別是","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)進程管理:主要負載CPU的訪問控制,對CPU進行調度管理;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)內存管理:主要提供對內存資源的訪問控制;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)文件系統:將硬盤的扇區組織成文件系統,實現文件讀寫等操作;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4)設備管理:用於控制所有的外部設備及控制器;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(5)網洛:主要負責管理各種網絡設備,並實現各種網絡協議棧,最終實現通過網絡連接其它系統的功能;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個部分分別處理一項明確的功能,又向其它各個部分提供自己所完成的功能,相互協調,共同完成操作系統的任務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux內核架構如下圖所示:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4d72a4b35747132d710db660295290d1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux內核架構圖","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2、Linux網絡子系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內核的基本架構我們已經瞭解清楚了,接下來我們重點關注到內核中的網絡模塊,觀察在linux內核中,我們是如何實現及運用TCP/IP協議,並完成網絡的初始化及各個模塊調用調度。我們將內核中的網絡部分抽出,通過對比TCP/IP分層協議,與Linux網絡實現體系相對比,深入的瞭解學習linux內核是怎樣具體的實現TCP/IP協議棧的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux網絡體系與TCP/IP協議棧如下圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d8/d8c2817d8d27a4d52fd3c281ac0b3b14.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" linux網絡體系","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/60/604b0e2563febd1868900a074a5799ea.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TCP/IP協議棧和OSI參考模型對應關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到,在圖中,linux爲了抽象與實現相分離,將內核中的網絡部分劃分爲五層:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統調用接口:系統調用接口是用戶空間的應用程序正常訪問內核的唯一途徑,系統調用一般以sys開頭。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協議無關接口:協議無關接口是由socket來實現的,它提供一組通用函數來支持各種不同的協議。Linux中socket結構是struct sock,這個結構定義了socket所需要的所 有狀態信息,包括socke所使用的協議以及可以在socket上執行的操作。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡協議:Linux支持多種協議,每一個協議都對應net_family[]數組中的一項,net_family[]的元素爲一個結構體指針,指向一個包含註冊協議信息的結構體 net_proto_family;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設備無關接口:設備無關接口net_device實現的,任何設備與上層通信都是通過net_device設備無關接口。它將設備與具有很多功能的不同硬件連接在一起,這一層提供一組通用函數供底層網絡設備驅動程序使用,讓它們可以對高層協議棧進行操作。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設備驅動程序:網絡體系結構的最底部是負責管理物理網絡設備的設備驅動程序層。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux網絡子系統通過這五層結構的相互交互,共同完成TCP/IP協議棧的運行。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3、TCP/IP協議棧","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 網絡架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux網絡協議棧的架構如下圖所示。該圖展示瞭如何實現Internet模型,在最上面的是用戶空間中實現的應用層,而中間爲內核空間中實現的網絡子系統,底部爲物理設備,提供了對網絡的連接能力。在網絡協議棧內部流動的是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"套接口緩衝區","attrs":{}},{"type":"text","text":"(SKB),用於在協議棧的底層、上層以及應用層之間傳遞報文數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡協議棧頂部是系統調用接口,爲用戶空間中的應用程序提供一種訪問內核網絡子系統的接口。下面是一個協議無關層,它提供了一種通用方法來使用傳輸層協議。然後是傳輸層的具體協議,包括TCP、UDP。在傳輸層下面是網絡層,之後是鄰居子系統,再下面是網絡設備接口,提供了與各個設備驅動通信的通用接口。最底層是設備驅動程序。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0f/0f98c415bd3d227953304f175f6506a1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Linux網絡協議棧的架構","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 協議無關接口","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過網絡協議棧通信需要對套接口進行操作,套接口是一個與協議無關的接口,它提供了一組接口來支持各種協議,套接口層不但可以支持典型的TCP和UDP協議,還可以支持RAW套接口、RAW以太網以及其他傳輸協議。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux中使用socket結構描述套接口,代表一條通信鏈路的一端,用來存儲與該鏈路有關的所有信息,這些信息中包括:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所使用的協議","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協議的狀態信息(包括源地址和目標地址)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到達的連接隊列","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據緩存和可選標誌等等","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其示意圖如下所示:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/97/97a0007770f5bcfea48e2cb2060647c0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用socket結構描述套接口","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中最關鍵的成員是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"sk和ops","attrs":{}},{"type":"text","text":",sk指向與該套接口相關的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"傳輸控制塊","attrs":{}},{"type":"text","text":",ops指向特定的傳輸協議的操作集。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖詳細展示了socket結構體中的sk和ops字段,以TCP爲例。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e2d2d829ee4d2f4c671ecbd365bdf96c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TCP的socket結構體中的sk和ops字段","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk字段指向與該套接口相關的傳輸控制塊,傳輸層使用傳輸控制塊來存放套接口所需的信息,在上圖中即爲TCP傳輸控制塊,即tcp_sock結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ops字段指向特定傳輸協議的操作集接口,proto_pos結構中定義的接口函數是從套接口系統調用到傳輸層調用的入口,因此其成員與socket系統調用基本上是一一對應的。整個proto_ops結構就是一張套接口系統調用的跳轉表,TCP、UDP、RAW套接口的傳輸層操作集分別爲inet_stream_ops、inet_dgram_ops、inet_sockraw_ops。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3 套接口緩存","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖所示,網絡子系統中用來存儲數據的緩衝區叫做套接口緩存,簡稱爲SKB,該緩存區能夠處理可變長數據,即能夠很容易地在數據區頭尾部添加和移除數據,且儘量避免數據的複製,通常每個報文使用一個SKB表示,各協議棧報文頭通過一組指針進行定位,由於SKB是網絡子系統中數據管理的核心,因此有很多管理函數是對它進行操作的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SKB主要用於在網絡驅動程序和應用程序之間傳遞、複製數據包。當應用程序要發送一個數據包時:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據通過系統調用提交到內核中","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系統會分配一個SKB來存儲數據","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後向下層傳遞","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再傳遞給網絡驅動後纔將其釋放","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當網絡設備接收到數據包也要分配一個SKB來對數據進行存儲,之後再向上傳遞,最終將數據複製到應用程序後進行釋放。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/39/3905c19e1da640b81c70320376be913d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SKB原理","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.4 重要的數據結構","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.4.1 sk_buf","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk_buf是Linux網絡協議棧最重要的數據結構之一,該數據結構貫穿於整個數據包處理的流程。由於協議採用分層結構,上層向下層傳遞數據時需要增加包頭,下層向上層數據時又需要去掉包頭。sk_buff中保存了L2,L3,L4層的頭指針,這樣在層傳遞時只需要對數據緩衝區改變頭部信息,並調整sk_buff中的指針,而不需要拷貝數據,這樣大大減少了內存拷貝的需要。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk_buf的示意圖如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a8a660a47982ec34d49d625b8f4fef0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk_buf示意圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各字段含義如下:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"head:指向分配給的線性數據內存首地址。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"data:指向保存數據內容的首地址。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tail:指向數據的結尾。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"end:指向分配的內存塊的結尾。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"len:數據的長度。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"head room: 位於head至data之間的空間,用於存儲:protocol header,例如:TCP header, IP header, Ethernet header等。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"user data: 位於data至tail之間的空間,用於存儲:應用層數據,一般系統調用時會使用到。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tail room: 位於tail至end之間的空間,用於填充用戶數據未使用完的空間。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"skb_shared_info: 位於end之後,用於存儲特殊數據結構skb_shared_info,該結構用於描述分片信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk_buf的常用操作函數如下:","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"alloc_skb:分配sk_buf。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"skb_reserve:爲sk_buff設置header空間。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"skb_put:添加用戶層數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"skb_push:向header空間添加協議頭。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"skb_pull:復位data至數據區。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作sk_buf的簡單示意圖如下:","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作sk_buf的簡單示意圖如下:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/46/461f2128ec9d492f68102e9c322e0df6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk_buf操作前後示意圖","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.4.2 net_device","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在網絡適配器硬件和軟件協議棧之間需要一個接口,共同完成操作系統內核中協議棧數據處理與異步收發的功能。在Linux網絡體系結構中,這個接口要滿足以下要求:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  (1)抽象出網絡適配器的硬件特性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  (2)爲協議棧提供統一的調用接口。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上兩個要求在Linux內核的網絡體系結構中分別由兩個軟件(設備獨立接口文件dev.c和網絡設備驅動程序)和一個主要的數據結構net_device實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"設備獨立接口文件dev.c中實現了對上層協議的統一調用接口,dev.c文件中的函數實現了以下主要功能。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"協議調用與驅動程序函數對應:dev.c文件中的函數查看數據包由哪個網絡設備(由sk_buff結構中*dev數據域指明該數據包由哪個網絡設備net_device實例接收/發送)傳送,根據系統中註冊的設備實例,調用網絡設備驅動程序函數,實現硬件的收發。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對net_device數據結構的數據域統一初始化:dev.c提供了一些常規函數,來初始化net_device結構中的這樣一些數據域:它們的值對所有類型的設備都一樣,驅動程序可以調用這些函數來設置其設備實例的默認值,也可以重寫由內核初始化的值。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每一個網絡設備都必須有一個驅動程序,並提供一個初始化函數供內核啓動時調用,或在裝載網絡驅動程序模塊時調用。不管網絡設備內部有什麼不同,有一件事是所有網絡設備驅動程序必須首先完成的任務:初始化一個net_device數據結構的實例作爲網絡設備在內核中的實體,並將net_device數據結構實例的各數據域初始化爲可工作的狀態,然後將設備實例註冊到內核中,爲協議棧提供傳送服務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"net_device數據結構從以下兩個方面描述了網絡設備的硬件特性在內核中的表示。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"描述設備屬性","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"net_device數據結構實例是網絡設備在內核中的表示,它是每個網絡設備在內核中的基礎數據結構,它包含的信息不僅僅是網絡設備的硬件屬性(中斷、端口地址、驅動程序函數等),還包括網絡中與設備有關的上層協議棧的配置信息(如IP地址、子網掩碼等)。它跟蹤連接到 TCP/IP協議棧上的所有設備的狀態信息。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現設備驅動程序接口","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"net_device數據結構代表了上層的網絡協議和硬件之間的一個通用接口,使我們可以將網絡協議層的實現從具體的網絡硬件部件中抽象出來,獨立於硬件設備。爲了有效地實現這種抽象,net_device中使用了大量函數指針,這樣相對於上層的協議棧,它們在做數據收發操作時調用的函數的名字是相同的,但具體的函數實現細節可以根據不同的網絡適配器而不同,由設備驅動程序提供,對網絡協議棧透明。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1c/1c599c32e3b2375afeb736e84fe976a2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡設備抽象數據結構","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.4.3 socket","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內核中的進程可以通過socket結構體來訪問linux內核中的網絡系統中的傳輸層、網絡層以及數據鏈路層,也可以說socket是內核中的進程與內核中的網絡系統的橋樑。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們知道在TCP層中使用兩個協議:tcp協議和udp協議。而在將TCP層中的數據往下傳輸時,要使用網絡層的協議,而網絡層的協議很多,不同的網絡使用不同的網絡層協議。我們常用的因特網中,網絡層使用的是IPV4和IPV6協議。所以在內核中的進程在使用struct socket提取內核網絡系統中的數據時,不光要指明struct socket的類型(用於說明是提取TCP層中tcp協議負載的數據,還是udp層負載的數據),還要指明網絡層的協議類型(網絡層的協議用於負載TCP層中的數據)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"linux內核中的網絡系統中的網絡層的協議,在linux中被稱爲address family(地址簇,通常以AF_XXX表示)或protocol family(協議簇,通常以PF_XXX表示)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"文章福利 Linux後端開發網絡底層原理知識學習提升 點擊","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//jq.qq.com/%3F_wv%3D1027%26k%3DsSXkvDJ2","title":null,"type":null},"content":[{"type":"text","text":"學習資料","attrs":{}}]},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"獲取,完善技術棧,內容知識點包括Linux,Nginx,ZeroMQ,MySQL,Redis,線程池,MongoDB,ZK,Linux內核,CDN,P2P,epoll,Docker,TCP/IP,協程,DPDK等等。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/de89c4cd8d14ccd955843ef34d792d16.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、網絡信息處理流程","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、硬中斷處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  首先當數據幀從網線到達網卡上的時候,第一站是網卡的接收隊列。網卡在分配給自己的RingBuffer中尋找可用的內存位置,找到後DMA引擎會把數據DMA到網卡之前關聯的內存裏,這個時候CPU都是無感的。當DMA操作完成以後,網卡會像CPU發起一個硬中斷,通知CPU有數據到達。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/48/48d33de04f70bcf49e5af49c97314065.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 網卡數據硬中斷處理過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意,當RingBuffer滿的時候,新來的數據包將給丟棄。ifconfig查看網卡的時候,可以裏面有個overruns,表示因爲環形隊列滿被丟棄的包。如果發現有丟包,可能需要通過ethtool命令來加大環形隊列的長度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網卡的硬中斷註冊的處理函數是igb_msix_ring。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: drivers/net/ethernet/intel/igb/igb_main.c\nstatic irqreturn_t igb_msix_ring(int irq, void *data)\n{\n struct igb_q_vector *q_vector = data;\n\n /* Write the ITR value calculated from the previous interrupt. */\n igb_write_itr(q_vector);\n\n napi_schedule(&q_vector->napi);\n\n return IRQ_HANDLED;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"igb_write_itr只是記錄一下硬件中斷頻率(據說目的是在減少對CPU的中斷頻率時用到)。順着napi_schedule調用一路跟蹤下去,__napi_schedule=>____napi_schedule。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* Called with irq disabled */\nstatic inline void ____napi_schedule(struct softnet_data *sd,\n struct napi_struct *napi)\n{\n list_add_tail(&napi->poll_list, &sd->poll_list);\n __raise_softirq_irqoff(NET_RX_SOFTIRQ);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們看到,list_add_tail修改了CPU變量softnet_data裏的poll_list,將驅動napi_struct傳過來的poll_list添加了進來。 其中softnet_data中的poll_list是一個雙向列表,其中的設備都帶有輸入幀等着被處理。緊接着__raise_softirq_irqoff觸發了一個軟中斷NET_RX_SOFTIRQ, 這個所謂的觸發過程只是對一個變量進行了一次或運算而已。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"void __raise_softirq_irqoff(unsigned int nr)\n{\n trace_softirq_raise(nr);\n or_softirq_pending(1UL << nr);\n}\n//file: include/linux/irq_cpustat.h\n#define or_softirq_pending(x) (local_softirq_pending() |= (x))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Linux在硬中斷裏只完成簡單必要的工作,剩下的大部分的處理都是轉交給軟中斷的。通過上面代碼可以看到,硬中斷處理過程真的是非常短。只是記錄了一個寄存器,修改了一下下CPU的poll_list,然後發出個軟中斷。就這麼簡單,硬中斷工作就算是完成了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、ksoftirqd內核線程處理軟中斷","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1d/1dbfb2be3a105fef030713ce120fcd93.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ksoftirqd內核線程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ksoftirqd_should_run代碼如下:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static int ksoftirqd_should_run(unsigned int cpu)\n{\n return local_softirq_pending();\n}\n\n#define local_softirq_pending() \\\n __IRQ_STAT(smp_processor_id(), __softirq_pending)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏看到和硬中斷中調用了同一個函數local_softirq_pending。使用方式不同的是硬中斷位置是爲了寫入標記,這裏僅僅只是讀取。如果硬中斷中設置了NET_RX_SOFTIRQ,這裏自然能讀取的到。接下來會真正進入線程函數中run_ksoftirqd處理:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static void run_ksoftirqd(unsigned int cpu)\n{\n local_irq_disable();\n if (local_softirq_pending()) {\n __do_softirq();\n rcu_note_context_switch(cpu);\n local_irq_enable();\n cond_resched();\n return;\n }\n local_irq_enable();\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在__do_softirq中,判斷根據當前CPU的軟中斷類型,調用其註冊的action方法。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"asmlinkage void __do_softirq(void)\n{\n do {\n if (pending & 1) {\n unsigned int vec_nr = h - softirq_vec;\n int prev_count = preempt_count();\n\n ...\n trace_softirq_entry(vec_nr);\n h->action(h);\n trace_softirq_exit(vec_nr);\n ...\n }\n h++;\n pending >>= 1;\n } while (pending);\n}  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在網絡子系統初始化小節, 我們看到我們爲NET_RX_SOFTIRQ註冊了處理函數net_rx_action。所以net_rx_action函數就會被執行到了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏需要注意一個細節,硬中斷中設置軟中斷標記,和ksoftirq的判斷是否有軟中斷到達,都是基於smp_processor_id()的。這意味着只要硬中斷在哪個CPU上被響應,那麼軟中斷也是在這個CPU上處理的。所以說,如果你發現你的Linux軟中斷CPU消耗都集中在一個核上的話,做法是要把調整硬中斷的CPU親和性,來將硬中斷打散到不通的CPU核上去。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們再來把精力集中到這個核心函數net_rx_action上來。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static void net_rx_action(struct softirq_action *h)\n{\n struct softnet_data *sd = &__get_cpu_var(softnet_data);\n unsigned long time_limit = jiffies + 2;\n int budget = netdev_budget;\n void *have;\n\n local_irq_disable();\n\n while (!list_empty(&sd->poll_list)) {\n ......\n n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);\n\n work = 0;\n if (test_bit(NAPI_STATE_SCHED, &n->state)) {\n work = n->poll(n, weight);\n trace_napi_poll(n);\n }\n\n budget -= work;\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"函數開頭的time_limit和budget是用來控制net_rx_action函數主動退出的,目的是保證網絡包的接收不霸佔CPU不放。 等下次網卡再有硬中斷過來的時候再處理剩下的接收數據包。其中budget可以通過內核參數調整。 這個函數中剩下的核心邏輯是獲取到當前CPU變量softnet_data,對其poll_list進行遍歷, 然後執行到網卡驅動註冊到的poll函數。對於igb網卡來說,就是igb驅動力的igb_poll函數了。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/**\n * igb_poll - NAPI Rx polling callback\n * @napi: napi polling structure\n * @budget: count of how many packets we should handle\n **/\nstatic int igb_poll(struct napi_struct *napi, int budget)\n{\n ...\n if (q_vector->tx.ring)\n clean_complete = igb_clean_tx_irq(q_vector);\n\n if (q_vector->rx.ring)\n clean_complete &= igb_clean_rx_irq(q_vector, budget);\n ...\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在讀取操作中,igb_poll的重點工作是對igb_clean_rx_irq的調用。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)\n{\n ...\n\n do {\n\n /* retrieve a buffer from the ring */\n skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);\n\n /* fetch next buffer in frame if non-eop */\n if (igb_is_non_eop(rx_ring, rx_desc))\n continue;\n }\n\n /* verify the packet layout is correct */\n if (igb_cleanup_headers(rx_ring, rx_desc, skb)) {\n skb = NULL;\n continue;\n }\n\n /* populate checksum, timestamp, VLAN, and protocol */\n igb_process_skb_fields(rx_ring, rx_desc, skb);\n\n napi_gro_receive(&q_vector->napi, skb);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"igb_fetch_rx_buffer和igb_is_non_eop的作用就是把數據幀從RingBuffer上取下來。爲什麼需要兩個函數呢?因爲有可能幀要佔多多個RingBuffer,所以是在一個循環中獲取的,直到幀尾部。獲取下來的一個數據幀用一個sk_buff來表示。收取完數據以後,對其進行一些校驗,然後開始設置sbk變量的timestamp, VLAN id, protocol等字段。接下來進入到napi_gro_receive中:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/core/dev.c\ngro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)\n{\n skb_gro_reset_offset(skb);\n\n return napi_skb_finish(dev_gro_receive(napi, skb), skb);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dev_gro_receive這個函數代表的是網卡GRO特性,可以簡單理解成把相關的小包合併成一個大包就行,目的是減少傳送給網絡棧的包數,這有助於減少 CPU 的使用量。我們暫且忽略,直接看napi_skb_finish, 這個函數主要就是調用了netif_receive_skb。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/core/dev.c\nstatic gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)\n{\n switch (ret) {\n case GRO_NORMAL:\n if (netif_receive_skb(skb))\n ret = GRO_DROP;\n break;\n ......\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在netif_receive_skb中,數據包將被送到協議棧中。聲明,以下的3.3, 3.4, 3.5也都屬於軟中斷的處理過程,只不過由於篇幅太長,單獨拿出來成小節。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、網絡協議棧處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"netif_receive_skb函數會根據包的協議,假如是udp包,會將包依次送到ip_rcv(),udp_rcv()協議處理函數中進行處理。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2d/2d69d1e95af334901b2b7731e4bbd501.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡協議棧處理","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/core/dev.c\nint netif_receive_skb(struct sk_buff *skb)\n{\n //RPS處理邏輯,先忽略\n ......\n\n return __netif_receive_skb(skb);\n}\n\nstatic int __netif_receive_skb(struct sk_buff *skb)\n{\n ...... \n ret = __netif_receive_skb_core(skb, false);\n}\n\nstatic int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)\n{\n ......\n\n //pcap邏輯,這裏會將數據送入抓包點。tcpdump就是從這個入口獲取包的\n list_for_each_entry_rcu(ptype, &ptype_all, list) {\n if (!ptype->dev || ptype->dev == skb->dev) {\n if (pt_prev)\n ret = deliver_skb(skb, pt_prev, orig_dev);\n pt_prev = ptype;\n }\n }\n\n ......\n\n list_for_each_entry_rcu(ptype,\n &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {\n if (ptype->type == type &&\n (ptype->dev == null_or_dev || ptype->dev == skb->dev ||\n ptype->dev == orig_dev)) {\n if (pt_prev)\n ret = deliver_skb(skb, pt_prev, orig_dev);\n pt_prev = ptype;\n }\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在__netif_receive_skb_core中,我看着原來經常使用的tcpdump的抓包點,很是激動,看來讀一遍源代碼時間真的沒白浪費。接着__netif_receive_skb_core取出protocol,它會從數據包中取出協議信息,然後遍歷註冊在這個協議上的回調函數列表。ptype_base 是一個 hash table,在協議註冊小節我們提到過。ip_rcv 函數地址就是存在這個 hash table中的。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/core/dev.c\nstatic inline int deliver_skb(struct sk_buff *skb,\n struct packet_type *pt_prev,\n struct net_device *orig_dev)\n{\n ......\n return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pt_prev->func這一行就調用到了協議層註冊的處理函數了。對於ip包來講,就會進入到ip_rcv(如果是arp包的話,會進入到arp_rcv)。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、IP協議層處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們再來大致看一下linux在ip協議層都做了什麼,包又是怎麼樣進一步被送到udp或tcp協議處理函數中的。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/ipv4/ip_input.c\nint ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)\n{\n ......\n\n return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,\n ip_rcv_finish);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏NF_HOOK是一個鉤子函數,當執行完註冊的鉤子後就會執行到最後一個參數指向的函數ip_rcv_finish。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static int ip_rcv_finish(struct sk_buff *skb)\n{\n ......\n\n if (!skb_dst(skb)) {\n int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,\n iph->tos, skb->dev);\n ...\n }\n\n ......\n\n return dst_input(skb);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跟蹤ip_route_input_noref 後看到它又調用了 ip_route_input_mc。 在ip_route_input_mc中,函數ip_local_deliver被賦值給了dst.input, 如下:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/ipv4/route.c\nstatic int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,\n u8 tos, struct net_device *dev, int our)\n{\n if (our) {\n rth->dst.input= ip_local_deliver;\n rth->rt_flags |= RTCF_LOCAL;\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以回到ip_rcv_finish中的return dst_input(skb)。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"/* Input packet from network to transport. */\nstatic inline int dst_input(struct sk_buff *skb)\n{\n return skb_dst(skb)->input(skb);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"skb_dst(skb)->input調用的input方法就是路由子系統賦的ip_local_deliver。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/ipv4/ip_input.c\nint ip_local_deliver(struct sk_buff *skb)\n{\n /*\n * Reassemble IP fragments.\n */\n\n if (ip_is_fragment(ip_hdr(skb))) {\n if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))\n return 0;\n }\n\n return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,\n ip_local_deliver_finish);\n}\nstatic int ip_local_deliver_finish(struct sk_buff *skb)\n{\n ......\n\n int protocol = ip_hdr(skb)->protocol;\n const struct net_protocol *ipprot;\n\n ipprot = rcu_dereference(inet_protos[protocol]);\n if (ipprot != NULL) {\n ret = ipprot->handler(skb);\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如協議註冊小節看到inet_protos中保存着tcp_rcv()和udp_rcv()的函數地址。這裏將會根據包中的協議類型選擇進行分發,在這裏skb包將會進一步被派送到更上層的協議中,udp和tcp。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、UDP協議層處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"udp協議的處理函數是udp_rcv。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/ipv4/udp.c\nint udp_rcv(struct sk_buff *skb)\n{\n return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);\n}\n\n\nint __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,\n int proto)\n{\n sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);\n\n if (sk != NULL) {\n int ret = udp_queue_rcv_skb(sk, skb\n }\n\n icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__udp4_lib_lookup_skb是根據skb來尋找對應的socket,當找到以後將數據包放到socket的緩存隊列裏。如果沒有找到,則發送一個目標不可達的icmp包。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"//file: net/ipv4/udp.c\nint udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)\n{ \n ......\n\n if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf))\n goto drop;\n\n rc = 0;\n\n ipv4_pktinfo_prepare(skb);\n bh_lock_sock(sk);\n if (!sock_owned_by_user(sk))\n rc = __udp_queue_rcv_skb(sk, skb);\n else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {\n bh_unlock_sock(sk);\n goto drop;\n }\n bh_unlock_sock(sk);\n\n return rc;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sock_owned_by_user判斷的是用戶是不是正在這個socker上進行系統調用(socket被佔用),如果沒有,那就可以直接放到socket的接收隊列中。如果有,那就通過sk_add_backlog把數據包添加到backlog隊列。 當用戶釋放的socket的時候,內核會檢查backlog隊列,如果有數據再移動到接收隊列中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sk_rcvqueues_full接收隊列如果滿了的話,將直接把包丟棄。接收隊列大小受內核參數net.core.rmem_max和net.core.rmem_default影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"文章福利 Linux後端開發網絡底層原理知識學習提升 點擊","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://link.zhihu.com/?target=https%3A//jq.qq.com/%3F_wv%3D1027%26k%3DsSXkvDJ2","title":null,"type":null},"content":[{"type":"text","text":"學習資料","attrs":{}}]},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"獲取,完善技術棧,內容知識點包括Linux,Nginx,ZeroMQ,MySQL,Redis,線程池,MongoDB,ZK,Linux內核,CDN,P2P,epoll,Docker,TCP/IP,協程,DPDK等等。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/de89c4cd8d14ccd955843ef34d792d16.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、send分析","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、傳輸層分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"send的定義如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ssize_t send(int sockfd, const void *buf, size_t len, int flags)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當在調用send函數的時候,內核封裝send()爲sendto(),然後發起系統調用。其實也很好理解,send()就是sendto()的一種特殊情況,而sendto()在內核的系統調用服務程序爲sys_sendto,sys_sendto的代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,\n struct sockaddr __user *addr, int addr_len)\n{\n struct socket *sock;\n struct sockaddr_storage address;\n int err;\n struct msghdr msg; //用來表示要發送的數據的一些屬性\n struct iovec iov;\n int fput_needed;\n err = import_single_range(WRITE, buff, len, &iov, &msg.msg_iter);\n if (unlikely(err))\n return err;\n sock = sockfd_lookup_light(fd, &err, &fput_needed);\n if (!sock)\n goto out;\n\n msg.msg_name = NULL;\n msg.msg_control = NULL;\n msg.msg_controllen = 0;\n msg.msg_namelen = 0;\n if (addr) {\n err = move_addr_to_kernel(addr, addr_len, &address);\n if (err < 0)\n goto out_put;\n msg.msg_name = (struct sockaddr *)&address;\n msg.msg_namelen = addr_len;\n }\n if (sock->file->f_flags & O_NONBLOCK)\n flags |= MSG_DONTWAIT;\n msg.msg_flags = flags;\n err = sock_sendmsg(sock, &msg); //實際的發送函數\n\nout_put:\n fput_light(sock->file, fput_needed);\nout:\n return err;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__sys_sendto函數其實做了3件事:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"通過fd獲取了對應的struct socket","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"創建了用來描述要發送的數據的結構體struct msghdr","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"調用了sock_sendmsg來執行實際的發送","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"繼續追蹤sock_sendmsg,發現其最終調用的是sock->ops->sendmsg(sock, msg, msg_data_left(msg)),即socet在初始化時賦值給結構體struct proto tcp_prot的函數tcp_sendmsg,如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"struct proto tcp_prot = {\n .name = \"TCP\",\n .owner = THIS_MODULE,\n .close = tcp_close,\n .pre_connect = tcp_v4_pre_connect,\n .connect = tcp_v4_connect,\n .disconnect = tcp_disconnect,\n .accept = inet_csk_accept,\n .ioctl = tcp_ioctl,\n .init = tcp_v4_init_sock,\n .destroy = tcp_v4_destroy_sock,\n .shutdown = tcp_shutdown,\n .setsockopt = tcp_setsockopt,\n .getsockopt = tcp_getsockopt,\n .keepalive = tcp_set_keepalive,\n .recvmsg = tcp_recvmsg,\n .sendmsg = tcp_sendmsg,\n ...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而tcp_send函數實際調用的是tcp_sendmsg_locked函數,該函數的定義如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)\n{\n struct tcp_sock *tp = tcp_sk(sk);/*進行了強制類型轉換*/\n struct sk_buff *skb;\n flags = msg->msg_flags;\n ......\n if (copied)\n tcp_push(sk, flags & ~MSG_MORE, mss_now,\n TCP_NAGLE_PUSH, size_goal);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在tcp_sendmsg_locked中,完成的是將所有的數據組織成發送隊列,這個發送隊列是struct sock結構中的一個域sk_write_queue,這個隊列的每一個元素是一個skb,裏面存放的就是待發送的數據。在該函數中通過調用tcp_push()函數將數據加入到發送隊列中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sock結構體的部分代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"struct sock{\n ...\n struct sk_buff_head sk_write_queue;/*指向skb隊列的第一個元素*/\n ...\n struct sk_buff *sk_send_head;/*指向隊列第一個還沒有發送的元素*/\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tcp_push的代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static void tcp_push(struct sock *sk, int flags, int mss_now,\n int nonagle, int size_goal)\n{\n struct tcp_sock *tp = tcp_sk(sk);\n struct sk_buff *skb;\n\n skb = tcp_write_queue_tail(sk);\n if (!skb)\n return;\n if (!(flags & MSG_MORE) || forced_push(tp))\n tcp_mark_push(tp, skb);\n\n tcp_mark_urg(tp, flags);\n\n if (tcp_should_autocork(sk, skb, size_goal)) {\n\n /* avoid atomic op if TSQ_THROTTLED bit is already set */\n if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {\n NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);\n set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);\n }\n /* It is possible TX completion already happened\n * before we set TSQ_THROTTLED.\n */\n if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize)\n return;\n }\n\n if (flags & MSG_MORE)\n nonagle = TCP_NAGLE_CORK;\n\n __tcp_push_pending_frames(sk, mss_now, nonagle); //最終通過調用該函數發送數據\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在之後tcp_push調用了__tcp_push_pending_frames(sk, mss_now, nonagle);來發送數據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__tcp_push_pending_frames的代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,\n int nonagle)\n{\n\n if (tcp_write_xmit(sk, cur_mss, nonagle, 0,\n sk_gfp_mask(sk, GFP_ATOMIC))) //調用該函數發送數據\n tcp_check_probe_timer(sk);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在__tcp_push_pending_frames又調用了tcp_write_xmit來發送數據,代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,\n int push_one, gfp_t gfp)\n{\n struct tcp_sock *tp = tcp_sk(sk);\n struct sk_buff *skb;\n unsigned int tso_segs, sent_pkts;\n int cwnd_quota;\n int result;\n bool is_cwnd_limited = false, is_rwnd_limited = false;\n u32 max_segs;\n /*統計已發送的報文總數*/\n sent_pkts = 0;\n ......\n\n /*若發送隊列未滿,則準備發送報文*/\n while ((skb = tcp_send_head(sk))) {\n unsigned int limit;\n\n if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {\n /* \"skb_mstamp_ns\" is used as a start point for the retransmit timer */\n skb->skb_mstamp_ns = tp->tcp_wstamp_ns = tp->tcp_clock_cache;\n list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);\n tcp_init_tso_segs(skb, mss_now);\n goto repair; /* Skip network transmission */\n }\n\n if (tcp_pacing_check(sk))\n break;\n\n tso_segs = tcp_init_tso_segs(skb, mss_now);\n BUG_ON(!tso_segs);\n /*檢查發送窗口的大小*/\n cwnd_quota = tcp_cwnd_test(tp, skb);\n if (!cwnd_quota) {\n if (push_one == 2)\n /* Force out a loss probe pkt. */\n cwnd_quota = 1;\n else\n break;\n }\n\n if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {\n is_rwnd_limited = true;\n break;\n ......\n limit = mss_now;\n if (tso_segs > 1 && !tcp_urg_mode(tp))\n limit = tcp_mss_split_point(sk, skb, mss_now,\n min_t(unsigned int,\n cwnd_quota,\n max_segs),\n nonagle);\n\n if (skb->len > limit &&\n unlikely(tso_fragment(sk, TCP_FRAG_IN_WRITE_QUEUE,\n skb, limit, mss_now, gfp)))\n break;\n\n if (tcp_small_queue_check(sk, skb, 0))\n break;\n\n if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) //調用該函數發送數據\n break;\n ......","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tcp_write_xmit位於tcpoutput.c中,它實現了tcp的擁塞控制,然後調用了tcp_transmit_skb(sk, skb, 1, gfp)傳輸數據,實際上調用的是__tcp_transmit_skb。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__tcp_transmit_skb的部分代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,\n int clone_it, gfp_t gfp_mask, u32 rcv_nxt)\n{\n \n skb_push(skb, tcp_header_size);\n skb_reset_transport_header(skb);\n ......\n /* 構建TCP頭部和校驗和 */\n th = (struct tcphdr *)skb->data;\n th->source = inet->inet_sport;\n th->dest = inet->inet_dport;\n th->seq = htonl(tcb->seq);\n th->ack_seq = htonl(rcv_nxt);\n\n tcp_options_write((__be32 *)(th + 1), tp, &opts);\n skb_shinfo(skb)->gso_type = sk->sk_gso_type;\n if (likely(!(tcb->tcp_flags & TCPHDR_SYN))) {\n th->window = htons(tcp_select_window(sk));\n tcp_ecn_send(sk, skb, th, tcp_header_size);\n } else {\n /* RFC1323: The window in SYN & SYN/ACK segments\n * is never scaled.\n */\n th->window = htons(min(tp->rcv_wnd, 65535U));\n }\n ......\n icsk->icsk_af_ops->send_check(sk, skb);\n\n if (likely(tcb->tcp_flags & TCPHDR_ACK))\n tcp_event_ack_sent(sk, tcp_skb_pcount(skb), rcv_nxt);\n\n if (skb->len != tcp_header_size) {\n tcp_event_data_sent(tp, sk);\n tp->data_segs_out += tcp_skb_pcount(skb);\n tp->bytes_sent += skb->len - tcp_header_size;\n }\n\n if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)\n TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,\n tcp_skb_pcount(skb));\n\n tp->segs_out += tcp_skb_pcount(skb);\n /* OK, its time to fill skb_shinfo(skb)->gso_{segs|size} */\n skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);\n skb_shinfo(skb)->gso_size = tcp_skb_mss(skb);\n\n /* Leave earliest departure time in skb->tstamp (skb->skb_mstamp_ns) */\n\n /* Cleanup our debris for IP stacks */\n memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),\n sizeof(struct inet6_skb_parm)));\n\n err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); //調用網絡層的發送接口\n ......\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__tcp_transmit_skb是位於傳輸層發送tcp數據的最後一步,這裏首先對TCP數據段的頭部進行了處理,然後調用了網絡層提供的發送接口:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);實現了數據的發送,自此,數據離開了傳輸層,傳輸層的任務也就結束了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳輸層時序圖如下圖所示:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4bf192f7f2ba338ae2f984dce68b742f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳輸層時序圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試如下所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/61/61a149a37bf85259c385de6ae3f16cb0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、網絡層分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將TCP傳輸過來的數據包打包成IP數據報,將數據打包成IP數據包之後,通過調用ip_local_out函數,在該函數內部調用了__ip_local_out,該函數返回了一個nf_hook函數,在該函數內部調用了dst_output","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)\n{\n struct inet_sock *inet = inet_sk(sk);\n struct net *net = sock_net(sk);\n struct ip_options_rcu *inet_opt;\n struct flowi4 *fl4;\n struct rtable *rt;\n struct iphdr *iph;\n int res;\n\n /* Skip all of this if the packet is already routed,\n * f.e. by something like SCTP.\n */\n rcu_read_lock();\n /*\n * 如果待輸出的數據包已準備好路由緩存,\n * 則無需再查找路由,直接跳轉到packet_routed\n * 處作處理。\n */\n inet_opt = rcu_dereference(inet->inet_opt);\n fl4 = &fl->u.ip4;\n rt = skb_rtable(skb);\n if (rt)\n goto packet_routed;\n\n /* Make sure we can route this packet. */\n /*\n * 如果輸出該數據包的傳輸控制塊中\n * 緩存了輸出路由緩存項,則需檢測\n * 該路由緩存項是否過期。\n * 如果過期,重新通過輸出網絡設備、\n * 目的地址、源地址等信息查找輸出\n * 路由緩存項。如果查找到對應的路\n * 由緩存項,則將其緩存到傳輸控制\n * 塊中,否則丟棄該數據包。\n * 如果未過期,則直接使用緩存在\n * 傳輸控制塊中的路由緩存項。\n */\n rt = (struct rtable *)__sk_dst_check(sk, 0);\n if (!rt) { /* 緩存過期 */\n __be32 daddr;\n\n /* Use correct destination address if we have options. */\n daddr = inet->inet_daddr; /* 目的地址 */\n if (inet_opt && inet_opt->opt.srr)\n daddr = inet_opt->opt.faddr; /* 嚴格路由選項 */\n\n /* If this fails, retransmit mechanism of transport layer will\n * keep trying until route appears or the connection times\n * itself out.\n */ /* 查找路由緩存 */\n rt = ip_route_output_ports(net, fl4, sk,\n daddr, inet->inet_saddr,\n inet->inet_dport,\n inet->inet_sport,\n sk->sk_protocol,\n RT_CONN_FLAGS(sk),\n sk->sk_bound_dev_if);\n if (IS_ERR(rt))\n goto no_route;\n sk_setup_caps(sk, &rt->dst); /* 設置控制塊的路由緩存 */\n }\n skb_dst_set_noref(skb, &rt->dst);/* 將路由設置到skb中 */\n\npacket_routed:\n if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_uses_gateway)\n goto no_route;\n\n /* OK, we know where to send it, allocate and build IP header. */\n /*\n * 設置IP首部中各字段的值。如果存在IP選項,\n * 則在IP數據包首部中構建IP選項。\n */\n skb_push(skb, sizeof(struct iphdr) + (inet_opt ? inet_opt->opt.optlen : 0));\n skb_reset_network_header(skb);\n iph = ip_hdr(skb);/* 構造ip頭 */\n *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));\n if (ip_dont_fragment(sk, &rt->dst) && !skb->ignore_df)\n iph->frag_off = htons(IP_DF);\n else\n iph->frag_off = 0;\n iph->ttl = ip_select_ttl(inet, &rt->dst);\n iph->protocol = sk->sk_protocol;\n ip_copy_addrs(iph, fl4);\n\n /* Transport layer set skb->h.foo itself. */\n /* 構造ip選項 */\n if (inet_opt && inet_opt->opt.optlen) {\n iph->ihl += inet_opt->opt.optlen >> 2;\n ip_options_build(skb, &inet_opt->opt, inet->inet_daddr, rt, 0);\n }\n\n ip_select_ident_segs(net, skb, sk,\n skb_shinfo(skb)->gso_segs ?: 1);\n\n /* TODO : should we use skb->sk here instead of sk ? */\n /*\n * 設置輸出數據包的QoS類型。\n */\n skb->priority = sk->sk_priority;\n skb->mark = sk->sk_mark;\n\n res = ip_local_out(net, sk, skb); /* 輸出 */\n rcu_read_unlock();\n return res;\n\nno_route:\n rcu_read_unlock();\n /*\n * 如果查找不到對應的路由緩存項,\n * 在此處理,將該數據包丟棄。\n */\n IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);\n kfree_skb(skb);\n return -EHOSTUNREACH;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"dst_output()實際調用skb_dst(skb)->output(skb),skb_dst(skb)就是skb所對應的路由項。skb_dst(skb)指向的是路由項dst_entry,它的input在收到報文時賦值ip_local_deliver(),而output在發送報文時賦值ip_output(),該函數的作用是處理單播數據報,設置數據報的輸出網絡設備以及網絡層協議類型參數。隨後調用ip_finish_output,觀察數據報長度是否大於MTU,若大於,則調用ip_fragment分片,否則調用ip_finish_output2輸出。在ip_finish_output2函數中會檢測skb的前部空間是否還能存儲鏈路層首部。如果不夠,就會申請更大的存儲空間,最終會調用鄰居子系統的輸出函數neigh_output進行輸出,輸出分爲有二層頭緩存和沒有兩種情況,有緩存時調用neigh_hh_output進行快速輸出,沒有緩存時,則調用鄰居子系統的輸出回調函數進行慢速輸出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡層時序圖如下圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8c85c7a7d1072d0544a24af4fe5b76f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡層時序圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試結果如下。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/eb/eba762cacebd82e58142def1e88c9646.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試結果","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、數據鏈路層分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡層最終會通過調用dev_queue_xmit來發送報文,在該函數中調用的是__dev_queue_xmit(skb, NULL);,如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int dev_queue_xmit(struct sk_buff *skb)\n{\n return __dev_queue_xmit(skb, NULL);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直接調用__dev_queue_xmit傳入的參數是一個skb 數據包","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__dev_queue_xmit函數會根據不同的情況會調用__dev_xmit_skb或者sch_direct_xmit函數,最終會調用dev_hard_start_xmit函數,該函數最終會調用xmit_one來發送一到多個數據包。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據鏈路層時序圖如下所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4ba10e46e236e714ab5e18ce38e1f072.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 數據鏈路層時序圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試結果。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3b911be928aac8f843c2b5e6fe137275.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試結果","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、recv分析","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、數據鏈路層分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據鏈路層接受數據並傳遞給上層的步驟如下所示:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"一個 package 到達機器的物理網絡適配器,當它接收到數據幀時,就會觸發一箇中斷,並將通過 DMA 傳送到位於 linux kernel 內存中的 rx_ring。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"網卡發出中斷,通知 CPU 有個 package 需要它處理。中斷處理程序主要進行以下一些操作,包括分配 skb_buff 數據結構,並將接收到的數據幀從網絡適配器I/O端口拷貝到skb_buff 緩衝區中;從數據幀中提取出一些信息,並設置 skb_buff相應的參數,這些參數將被上層的網絡協議使用,例如skb->protocol;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"終端處理程序經過簡單處理後,發出一個軟中斷(NET_RX_SOFTIRQ),通知內核接收到新的數據幀。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"內核 2.5 中引入一組新的 API 來處理接收的數據幀,即 NAPI。所以,驅動有兩種方式通知內核:(1) 通過以前的函數netif_rx;(2)通過NAPI機制。該中斷處理程序調用 Network device的 netif_rx_schedule函數,進入軟中斷處理流程,再調用net_rx_action函數。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"該函數關閉中斷,獲取每個 Network device 的 rx_ring 中的所有 package,最終 pacakage 從 rx_ring 中被刪除,進入netif _receive_skb處理流程。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"netif_receive_skb是鏈路層接收數據報的最後一站。它根據註冊在全局數組 ptype_all 和 ptype_base 裏的網絡層數據報類型,把數據報遞交給不同的網絡層協議的接收函數(INET域中主要是ip_rcv和arp_rcv)。該函數主要就是調用第三層協議的接收函數處理該skb包,進入第三層網絡層處理。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據鏈路層的時序圖如下所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/25/25efa36586f3d3b0cb3958ce7f59f09b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據鏈路層時序圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試如下。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/13e1249f54c43d8d8a5b8b8b478f85f5.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、網絡層分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ip層的入口在ip_rcv函數,該函數首先會做包括 package checksum 在內的各種檢查,如果需要的話會做 IP defragment(將多個分片合併),然後 packet 調用已經註冊的 Pre-routing netfilter hook ,完成後最終到達ip_rcv_finish函數。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,\n struct net_device *orig_dev)\n{\n struct net *net = dev_net(dev);\n\n skb = ip_rcv_core(skb, net);\n if (skb == NULL)\n return NET_RX_DROP;\n\n return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,\n net, NULL, skb, dev, NULL,\n ip_rcv_finish);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ip_rcv_finish函數如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)\n{\n struct net_device *dev = skb->dev;\n int ret;\n\n /* if ingress device is enslaved to an L3 master device pass the\n * skb to its handler for processing\n */\n skb = l3mdev_ip_rcv(skb);\n if (!skb)\n return NET_RX_SUCCESS;\n\n ret = ip_rcv_finish_core(net, sk, skb, dev, NULL);\n if (ret != NET_RX_DROP)\n ret = dst_input(skb);\n return ret;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ip_rcv_finish 函數最終會調用ip_route_input函數,進入路由處理環節。它首先會調用 ip_route_input 來更新路由,然後查找 route,決定該 package 將會被髮到本機還是會被轉發還是丟棄:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"如果是發到本機的話,調用ip_local_deliver 函數,可能會做 de-fragment(合併多個 IP packet),然後調用ip_local_deliver函數。該函數根據 package 的下一個處理層的 protocal number,調用下一層接口,包括 tcp_v4_rcv (TCP), udp_rcv (UDP),icmp_rcv (ICMP),igmp_rcv(IGMP)。對於 TCP 來說,函數 tcp_v4_rcv 函數會被調用,從而處理流程進入 TCP 棧。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"如果需要轉發 (forward),則進入轉發流程。該流程需要處理 TTL,再調用dst_input函數。該函數會 (1)處理 Netfilter Hook (2)執行 IP fragmentation (3)調用 dev_queue_xmit,進入鏈路層處理流程。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡層時序圖如下圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/00/006071f7b3d76fd254b4703432c84f10.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡層時序圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試如下圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8e/8ee52b02d6532be10029aca61557b5be.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、傳輸層分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於recv函數,與send函數類似,調用的系統調用是__sys_recvfrom,其代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,\n struct sockaddr __user *addr, int __user *addr_len)\n{\n ......\n err = import_single_range(READ, ubuf, size, &iov, &msg.msg_iter);\n if (unlikely(err))\n return err;\n sock = sockfd_lookup_light(fd, &err, &fput_needed);\n .....\n msg.msg_control = NULL;\n msg.msg_controllen = 0;\n /* Save some cycles and don't copy the address if not needed */\n msg.msg_name = addr ? (struct sockaddr *)&address : NULL;\n /* We assume all kernel code knows the size of sockaddr_storage */\n msg.msg_namelen = 0;\n msg.msg_iocb = NULL;\n msg.msg_flags = 0;\n if (sock->file->f_flags & O_NONBLOCK)\n flags |= MSG_DONTWAIT;\n err = sock_recvmsg(sock, &msg, flags); //調用該函數接受數據\n\n if (err >= 0 && addr != NULL) {\n err2 = move_addr_to_user(&address,\n msg.msg_namelen, addr, addr_len);\n .....\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"__sys_recvfrom通過調用sock_recvmsg來對數據進行接收,該函數實際調用的是sock->ops->recvmsg(sock, msg, msg_data_left(msg), flags); ,同樣類似send函數中,調用的實際上是socket在初始化時賦值給結構體struct proto tcp_prot的函數tcp_rcvmsg,如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"struct proto tcp_prot = {\n .name = \"TCP\",\n .owner = THIS_MODULE,\n .close = tcp_close,\n .pre_connect = tcp_v4_pre_connect,\n .connect = tcp_v4_connect,\n .disconnect = tcp_disconnect,\n .accept = inet_csk_accept,\n .ioctl = tcp_ioctl,\n .init = tcp_v4_init_sock,\n .destroy = tcp_v4_destroy_sock,\n .shutdown = tcp_shutdown,\n .setsockopt = tcp_setsockopt,\n .getsockopt = tcp_getsockopt,\n .keepalive = tcp_set_keepalive,\n .recvmsg = tcp_recvmsg,\n .sendmsg = tcp_sendmsg,\n ...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tcp_rcvmsg的代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,\n int flags, int *addr_len)\n{\n ......\n if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue) &&\n (sk->sk_state == TCP_ESTABLISHED))\n sk_busy_loop(sk, nonblock); //如果接收隊列爲空,則會在該函數內循環等待\n\n lock_sock(sk);\n .....\n if (unlikely(tp->repair)) {\n err = -EPERM;\n if (!(flags & MSG_PEEK))\n goto out;\n\n if (tp->repair_queue == TCP_SEND_QUEUE) \n goto recv_sndq;\n\n err = -EINVAL;\n if (tp->repair_queue == TCP_NO_QUEUE)\n goto out;\n ......\n last = skb_peek_tail(&sk->sk_receive_queue); \n skb_queue_walk(&sk->sk_receive_queue, skb) {\n last = skb;\n ......\n if (!(flags & MSG_TRUNC)) {\n err = skb_copy_datagram_msg(skb, offset, msg, used); //將接收到的數據拷貝到用戶態\n if (err) {\n /* Exception. Bailout! */\n if (!copied)\n copied = -EFAULT;\n break;\n }\n }\n\n *seq += used;\n copied += used;\n len -= used;\n\n tcp_rcv_space_adjust(sk);","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在連接建立後,若沒有數據到來,接收隊列爲空,進程會在sk_busy_loop函數內循環等待,知道接收隊列不爲空,並調用函數數skb_copy_datagram_msg將接收到的數據拷貝到用戶態,該函數內部實際調用的是__skb_datagram_iter,其代碼如下所示:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"int __skb_datagram_iter(const struct sk_buff *skb, int offset,\n struct iov_iter *to, int len, bool fault_short,\n size_t (*cb)(const void *, size_t, void *, struct iov_iter *),\n void *data)\n{\n int start = skb_headlen(skb);\n int i, copy = start - offset, start_off = offset, n;\n struct sk_buff *frag_iter;\n\n /* 拷貝tcp頭部 */\n if (copy > 0) {\n if (copy > len)\n copy = len;\n n = cb(skb->data + offset, copy, data, to);\n offset += n;\n if (n != copy)\n goto short_copy;\n if ((len -= copy) == 0)\n return 0;\n }\n\n /* 拷貝數據部分 */\n for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {\n int end;\n const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];\n\n WARN_ON(start > offset + len);\n\n end = start + skb_frag_size(frag);\n if ((copy = end - offset) > 0) {\n struct page *page = skb_frag_page(frag);\n u8 *vaddr = kmap(page);\n\n if (copy > len)\n copy = len;\n n = cb(vaddr + frag->page_offset +\n offset - start, copy, data, to);\n kunmap(page);\n offset += n;\n if (n != copy)\n goto short_copy;\n if (!(len -= copy))\n return 0;\n }\n start = end;\n }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳輸層時序圖如下圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/37e03b75087797e0a3598ae6fd3e88c4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 傳輸層時序圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" GDB調試如下圖所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e00d0d023617acb8a8831aae0eee9bfb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GDB調試","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、小結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從Linux操作系統實現入手,深入的分析了Linux操作系統對於TCP/IP棧的實現原理與具體過程,瞭解了Linux網絡子系統的具體構成及流程,通過這次調研,使我對TCP/IP協議的原理及具體實現有了極其深入的理解。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章