Linux內核--網絡棧數據包的傳遞過程

本文轉自： http://blog.csdn.net/yming0221/article/details/7492423

上一篇博文中我們從宏觀上分析了Linux內核中網絡棧的初始化過程，這裏我們再從宏觀上分析一下一個數據包在各網絡層的傳遞的過程。

我們知道網絡的OSI模型和TCP/IP模型層次結構如下：

上文中我們看到了網絡棧的層次結構：

我們就從最底層開始追溯一個數據包的傳遞流程。

1、網絡接口層

* 硬件監聽物理介質，進行數據的接收，當接收的數據填滿了緩衝區，硬件就會產生中斷，中斷產生後，系統會轉向中斷服務子程序。

* 在中斷服務子程序中，數據會從硬件的緩衝區複製到內核的空間緩衝區，幷包裝成一個數據結構（sk_buff），然後調用對驅動層的接口函數netif_rx()將數據包發送給鏈路層。該函數的實現在net/inet/dev.c中，（在整個網絡棧實現中dev.c文件的作用重大，它銜接了其下的驅動層和其上的網絡層，可以稱它爲鏈路層模塊的實現）

該函數的實現如下：

[cpp]view
plaincopyprint?

/* 

 *  Receive a packet from a device driver and queue it for the upper 

 *  (protocol) levels.  It always succeeds. This is the recommended  

 *  interface to use. 

 *    從設備驅動層接受到的數據發送到協議的 

 *    上層，該函數實際是一個接口。 

 */  

void netif_rx(struct sk_buff *skb)  

{  

    static int dropping = 0;  

    /* 

     *  Any received buffers are un-owned and should be discarded 

     *  when freed. These will be updated later as the frames get 

     *  owners. 

     */  

    skb->sk = NULL;  

    skb->free = 1;  

    if(skb->stamp.tv_sec==0)  

        skb->stamp = xtime;  

    /* 

     *  Check that we aren't overdoing things. 

     */  

    if (!backlog_size)  

        dropping = 0;  

    else if (backlog_size > 300)  

        dropping = 1;  

    if (dropping)   

    {  

        kfree_skb(skb, FREE_READ);  

        return;  

    }  

    /* 

     *  Add it to the "backlog" queue.  

     */  

#ifdef CONFIG_SKB_CHECK  

    IS_SKB(skb);  

#endif    

    skb_queue_tail(&backlog,skb);//加入隊列backlog  

    backlog_size++;  

    /* 

     *  If any packet arrived, mark it for processing after the 

     *  hardware interrupt returns. 

     */  

    mark_bh(NET_BH);//下半部分bottom half技術可以減少中斷處理程序的執行時間  

    return;  

}

該函數中用到了bootom half技術，該技術的原理是將中斷處理程序人爲的分爲兩部分，上半部分是實時性要求較高的任務，後半部分可以稍後完成，這樣就可以節省中斷程序的處理時間。可整體的提高系統的性能。該技術將會在後續的博文中詳細分析。

我們從上一篇分析中知道，在網絡棧初始化的時候，已經將NET的下半部分執行函數定義成了net_bh(在socket.c文件中1375行左右)

[cpp]view
plaincopyprint?

bh_base[NET_BH].routine= net_bh;//設置NET 下半部分的處理函數爲net_bh  

* 函數net_bh的實現在net/inet/dev.c中

[cpp]view
plaincopyprint?

/* 

 *  When we are called the queue is ready to grab, the interrupts are 

 *  on and hardware can interrupt and queue to the receive queue a we 

 *  run with no problems. 

 *  This is run as a bottom half after an interrupt handler that does 

 *  mark_bh(NET_BH); 

 */  

void net_bh(void *tmp)  

{  

    struct sk_buff *skb;  

    struct packet_type *ptype;  

    struct packet_type *pt_prev;  

    unsigned short type;  

    /* 

     *  Atomically check and mark our BUSY state.  

     */  

    if (set_bit(1, (void*)&in_bh))//標記BUSY狀態  

        return;  

    /* 

     *  Can we send anything now? We want to clear the 

     *  decks for any more sends that get done as we 

     *  process the input. 

     */  

    dev_transmit();//調用dev_tinit()函數發送數據  

    /* 

     *  Any data left to process. This may occur because a 

     *  mark_bh() is done after we empty the queue including 

     *  that from the device which does a mark_bh() just after 

     */  

    cli();//防止隊列操作錯誤，需要關中斷和開中斷  

    /* 

     *  While the queue is not empty 

     */  

    while((skb=skb_dequeue(&backlog))!=NULL)//出隊直到隊列爲空  

    {  

        /* 

         *  We have a packet. Therefore the queue has shrunk 

         */  

        backlog_size--;//隊列元素個數減一  

        sti();  

           /* 

        *   Bump the pointer to the next structure. 

        *   This assumes that the basic 'skb' pointer points to 

        *   the MAC header, if any (as indicated by its "length" 

        *   field).  Take care now! 

        */  

        skb->h.raw = skb->data + skb->dev->hard_header_len;  

        skb->len -= skb->dev->hard_header_len;  

           /* 

        *   Fetch the packet protocol ID.  This is also quite ugly, as 

        *   it depends on the protocol driver (the interface itself) to 

        *   know what the type is, or where to get it from.  The Ethernet 

        *   interfaces fetch the ID from the two bytes in the Ethernet MAC 

        *   header (the h_proto field in struct ethhdr), but other drivers 

        *   may either use the ethernet ID's or extra ones that do not 

        *   clash (eg ETH_P_AX25). We could set this before we queue the 

        *   frame. In fact I may change this when I have time. 

        */  

        type = skb->dev->type_trans(skb, skb->dev);//取出該數據包所屬的協議類型  

        /* 

         *  We got a packet ID.  Now loop over the "known protocols" 

         *  table (which is actually a linked list, but this will 

         *  change soon if I get my way- FvK), and forward the packet 

         *  to anyone who wants it. 

         * 

         *  [FvK didn't get his way but he is right this ought to be 

         *  hashed so we typically get a single hit. The speed cost 

         *  here is minimal but no doubt adds up at the 4,000+ pkts/second 

         *  rate we can hit flat out] 

         */  

        pt_prev = NULL;  

        for (ptype = ptype_base; ptype != NULL; ptype = ptype->next) //遍歷ptype_base所指向的網絡協議隊列  

        {  

            //判斷協議號是否匹配  

            if ((ptype->type == type || ptype->type == htons(ETH_P_ALL)) && (!ptype->dev || ptype->dev==skb->dev))  

            {  

                /* 

                 *  We already have a match queued. Deliver 

                 *  to it and then remember the new match 

                 */  

                if(pt_prev)  

                {  

                    struct sk_buff *skb2;  

                    skb2=skb_clone(skb, GFP_ATOMIC);//複製數據包結構  

                    /* 

                     *  Kick the protocol handler. This should be fast 

                     *  and efficient code. 

                     */  

                    if(skb2)  

                        pt_prev->func(skb2, skb->dev, pt_prev);//調用相應協議的處理函數，  

                                            //這裏和網絡協議的種類有關係  

                                            //如IP 協議的處理函數就是ip_rcv  

                }  

                /* Remember the current last to do */  

                pt_prev=ptype;  

            }  

        } /* End of protocol list loop */  

        /* 

         *  Is there a last item to send to ? 

         */  

        if(pt_prev)  

            pt_prev->func(skb, skb->dev, pt_prev);  

        /* 

         *  Has an unknown packet has been received ? 

         */  

        else  

            kfree_skb(skb, FREE_WRITE);  

        /* 

         *  Again, see if we can transmit anything now.  

         *  [Ought to take this out judging by tests it slows 

         *   us down not speeds us up] 

         */  

        dev_transmit();  

        cli();  

    }   /* End of queue loop */  

    /* 

     *  We have emptied the queue 

     */  

    in_bh = 0;//BUSY狀態還原  

    sti();  

    /* 

     *  One last output flush. 

     */  

    dev_transmit();//清空緩衝區  

}

2、網絡層
* 就以IP數據包爲例來說明，那麼從鏈路層向網絡層傳遞時將調用ip_rcv函數。該函數完成本層的處理後會根據IP首部中使用的傳輸層協議來調用相應協議的處理函數。

UDP對應udp_rcv、TCP對應tcp_rcv、ICMP對應icmp_rcv、IGMP對應igmp_rcv（雖然這裏的ICMP,IGMP一般成爲網絡層協議，但是實際上他們都封裝在IP協議裏面，作爲傳輸層對待）

這個函數比較複雜，後續會詳細分析。這裏粘貼一下，讓我們對整體瞭解更清楚

[cpp]view
plaincopyprint?

/* 

 *  This function receives all incoming IP datagrams. 

 */  

int ip_rcv(struct sk_buff *skb, struct device *dev, struct packet_type *pt)  

{  

    struct iphdr *iph = skb->h.iph;  

    struct sock *raw_sk=NULL;  

    unsigned char hash;  

    unsigned char flag = 0;  

    unsigned char opts_p = 0;   /* Set iff the packet has options. */  

    struct inet_protocol *ipprot;  

    static struct options opt; /* since we don't use these yet, and they 

                take up stack space. */  

    int brd=IS_MYADDR;  

    int is_frag=0;  

#ifdef CONFIG_IP_FIREWALL  

    int err;  

#endif    

    ip_statistics.IpInReceives++;  

    /* 

     *  Tag the ip header of this packet so we can find it 

     */  

    skb->ip_hdr = iph;  

    /* 

     *  Is the datagram acceptable? 

     * 

     *  1.  Length at least the size of an ip header 

     *  2.  Version of 4 

     *  3.  Checksums correctly. [Speed optimisation for later, skip loopback checksums] 

     *  (4. We ought to check for IP multicast addresses and undefined types.. does this matter ?) 

     */  

    if (skb->len<sizeof(struct iphdr) || iph->ihl<5 || iph->version != 4 ||  

        skb->len<ntohs(iph->tot_len) || ip_fast_csum((unsigned char *)iph, iph->ihl) !=0)  

    {  

        ip_statistics.IpInHdrErrors++;  

        kfree_skb(skb, FREE_WRITE);  

        return(0);  

    }  

    /* 

     *  See if the firewall wants to dispose of the packet.  

     */  

#ifdef  CONFIG_IP_FIREWALL  

    if ((err=ip_fw_chk(iph,dev,ip_fw_blk_chain,ip_fw_blk_policy, 0))!=1)  

    {  

        if(err==-1)  

            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0, dev);  

        kfree_skb(skb, FREE_WRITE);  

        return 0;     

    }  

#endif  

    /* 

     *  Our transport medium may have padded the buffer out. Now we know it 

     *  is IP we can trim to the true length of the frame. 

     */  

    skb->len=ntohs(iph->tot_len);  

    /* 

     *  Next analyse the packet for options. Studies show under one packet in 

     *  a thousand have options.... 

     */  

    if (iph->ihl != 5)  

    {   /* Fast path for the typical optionless IP packet. */  

        memset((char *) &opt, 0, sizeof(opt));  

        if (do_options(iph, &opt) != 0)  

            return 0;  

        opts_p = 1;  

    }  

    /* 

     *  Remember if the frame is fragmented. 

     */  

    if(iph->frag_off)  

    {  

        if (iph->frag_off & 0x0020)  

            is_frag|=1;  

        /* 

         *  Last fragment ? 

         */  

        if (ntohs(iph->frag_off) & 0x1fff)  

            is_frag|=2;  

    }  

    /* 

     *  Do any IP forwarding required.  chk_addr() is expensive -- avoid it someday. 

     * 

     *  This is inefficient. While finding out if it is for us we could also compute 

     *  the routing table entry. This is where the great unified cache theory comes 

     *  in as and when someone implements it 

     * 

     *  For most hosts over 99% of packets match the first conditional 

     *  and don't go via ip_chk_addr. Note: brd is set to IS_MYADDR at 

     *  function entry. 

     */  

    if ( iph->daddr != skb->dev->pa_addr && (brd = ip_chk_addr(iph->daddr)) == 0)  

    {  

        /* 

         *  Don't forward multicast or broadcast frames. 

         */  

        if(skb->pkt_type!=PACKET_HOST || brd==IS_BROADCAST)  

        {  

            kfree_skb(skb,FREE_WRITE);  

            return 0;  

        }  

        /* 

         *  The packet is for another target. Forward the frame 

         */  

#ifdef CONFIG_IP_FORWARD  

        ip_forward(skb, dev, is_frag);  

#else  

/*      printk("Machine %lx tried to use us as a forwarder to %lx but we have forwarding disabled!\n", 

            iph->saddr,iph->daddr);*/  

        ip_statistics.IpInAddrErrors++;  

#endif  

        /* 

         *  The forwarder is inefficient and copies the packet. We 

         *  free the original now. 

         */  

        kfree_skb(skb, FREE_WRITE);  

        return(0);  

    }  

#ifdef CONFIG_IP_MULTICAST    

    if(brd==IS_MULTICAST && iph->daddr!=IGMP_ALL_HOSTS && !(dev->flags&IFF_LOOPBACK))  

    {  

        /* 

         *  Check it is for one of our groups 

         */  

        struct ip_mc_list *ip_mc=dev->ip_mc_list;  

        do  

        {  

            if(ip_mc==NULL)  

            {     

                kfree_skb(skb, FREE_WRITE);  

                return 0;  

            }  

            if(ip_mc->multiaddr==iph->daddr)  

                break;  

            ip_mc=ip_mc->next;  

        }  

        while(1);  

    }  

#endif  

    /* 

     *  Account for the packet 

     */  

#ifdef CONFIG_IP_ACCT  

    ip_acct_cnt(iph,dev, ip_acct_chain);  

#endif    

    /* 

     * Reassemble IP fragments. 

     */  

    if(is_frag)  

    {  

        /* Defragment. Obtain the complete packet if there is one */  

        skb=ip_defrag(iph,skb,dev);  

        if(skb==NULL)  

            return 0;  

        skb->dev = dev;  

        iph=skb->h.iph;  

    }  

    /* 

     *  Point into the IP datagram, just past the header. 

     */  

    skb->ip_hdr = iph;  

    skb->h.raw += iph->ihl*4;  

    /* 

     *  Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies. 

     */  

    hash = iph->protocol & (SOCK_ARRAY_SIZE-1);  

    /* If there maybe a raw socket we must check - if not we don't care less */  

    if((raw_sk=raw_prot.sock_array[hash])!=NULL)  

    {  

        struct sock *sknext=NULL;  

        struct sk_buff *skb1;  

        raw_sk=get_sock_raw(raw_sk, hash,  iph->saddr, iph->daddr);  

        if(raw_sk)  /* Any raw sockets */  

        {  

            do  

            {  

                /* Find the next */  

                sknext=get_sock_raw(raw_sk->next, hash, iph->saddr, iph->daddr);  

                if(sknext)  

                    skb1=skb_clone(skb, GFP_ATOMIC);  

                else  

                    break;  /* One pending raw socket left */  

                if(skb1)  

                    raw_rcv(raw_sk, skb1, dev, iph->saddr,iph->daddr);  

                raw_sk=sknext;  

            }  

            while(raw_sk!=NULL);  

            /* Here either raw_sk is the last raw socket, or NULL if none */  

            /* We deliver to the last raw socket AFTER the protocol checks as it avoids a surplus copy */  

        }  

    }  

    /* 

     *  skb->h.raw now points at the protocol beyond the IP header. 

     */  

    hash = iph->protocol & (MAX_INET_PROTOS -1);  

    for (ipprot = (struct inet_protocol *)inet_protos[hash];ipprot != NULL;ipprot=(struct inet_protocol *)ipprot->next)  

    {  

        struct sk_buff *skb2;  

        if (ipprot->protocol != iph->protocol)  

            continue;  

       /* 

    *   See if we need to make a copy of it.  This will 

    *   only be set if more than one protocol wants it. 

    *   and then not for the last one. If there is a pending 

    *   raw delivery wait for that 

    */  

        if (ipprot->copy || raw_sk)  

        {  

            skb2 = skb_clone(skb, GFP_ATOMIC);  

            if(skb2==NULL)  

                continue;  

        }  

        else  

        {  

            skb2 = skb;  

        }  

        flag = 1;  

           /* 

        * Pass on the datagram to each protocol that wants it, 

        * based on the datagram protocol.  We should really 

        * check the protocol handler's return values here... 

        */  

        ipprot->handler(skb2, dev, opts_p ? &opt : 0, iph->daddr,  

                (ntohs(iph->tot_len) - (iph->ihl * 4)),  

                iph->saddr, 0, ipprot);  

    }  

    /* 

     * All protocols checked. 

     * If this packet was a broadcast, we may *not* reply to it, since that 

     * causes (proven, grin) ARP storms and a leakage of memory (i.e. all 

     * ICMP reply messages get queued up for transmission...) 

     */  

    if(raw_sk!=NULL)    /* Shift to last raw user */  

        raw_rcv(raw_sk, skb, dev, iph->saddr, iph->daddr);  

    else if (!flag)     /* Free and report errors */  

    {  

        if (brd != IS_BROADCAST && brd!=IS_MULTICAST)  

            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0, dev);  

        kfree_skb(skb, FREE_WRITE);  

    }  

    return(0);  

}

3、傳輸層

如果在IP數據報的首部標明的是使用TCP傳輸數據，則在上述函數中會調用tcp_rcv函數。該函數的大體處理流程爲：

“所有使用TCP 協議的套接字對應sock 結構都被掛入tcp_prot 全局變量表示的proto 結構之sock_array 數組中，採用以本地端口號爲索引的插入方式，所以當tcp_rcv 函數接收到一個數據包，在完成必要的檢查和處理後，其將以TCP 協議首部中目的端口號（對於一個接收的數據包而言，其目的端口號就是本地所使用的端口號）爲索引，在tcp_prot 對應sock 結構之sock_array 數組中得到正確的sock 結構隊列，在輔之以其他條件遍歷該隊列進行對應sock 結構的查詢，在得到匹配的sock 結構後，將數據包掛入該sock 結構中的緩存隊列中（由sock 結構中receive_queue 字段指向），從而完成數據包的最終接收。”

該函數的實現也會比較複雜，這是由TCP協議的複雜功能決定的。附代碼如下：

[cpp]view
plaincopyprint?

/* 

 *  A TCP packet has arrived. 

 */  

int tcp_rcv(struct sk_buff *skb, struct device *dev, struct options *opt,  

    unsigned long daddr, unsigned short len,  

    unsigned long saddr, int redo, struct inet_protocol * protocol)  

{  

    struct tcphdr *th;  

    struct sock *sk;  

    int syn_ok=0;  

    if (!skb)   

    {  

        printk("IMPOSSIBLE 1\n");  

        return(0);  

    }  

    if (!dev)   

    {  

        printk("IMPOSSIBLE 2\n");  

        return(0);  

    }  

    tcp_statistics.TcpInSegs++;  

    if(skb->pkt_type!=PACKET_HOST)  

    {  

        kfree_skb(skb,FREE_READ);  

        return(0);  

    }  

    th = skb->h.th;  

    /* 

     *  Find the socket. 

     */  

    sk = get_sock(&tcp_prot, th->dest, saddr, th->source, daddr);  

    /* 

     *  If this socket has got a reset it's to all intents and purposes  

     *  really dead. Count closed sockets as dead. 

     * 

     *  Note: BSD appears to have a bug here. A 'closed' TCP in BSD 

     *  simply drops data. This seems incorrect as a 'closed' TCP doesn't 

     *  exist so should cause resets as if the port was unreachable. 

     */  

    if (sk!=NULL && (sk->zapped || sk->state==TCP_CLOSE))  

        sk=NULL;  

    if (!redo)   

    {  

        if (tcp_check(th, len, saddr, daddr ))   

        {  

            skb->sk = NULL;  

            kfree_skb(skb,FREE_READ);  

            /* 

             *  We don't release the socket because it was 

             *  never marked in use. 

             */  

            return(0);  

        }  

        th->seq = ntohl(th->seq);  

        /* See if we know about the socket. */  

        if (sk == NULL)   

        {  

            /* 

             *  No such TCB. If th->rst is 0 send a reset (checked in tcp_reset) 

             */  

            tcp_reset(daddr, saddr, th, &tcp_prot, opt,dev,skb->ip_hdr->tos,255);  

            skb->sk = NULL;  

            /* 

             *  Discard frame 

             */  

            kfree_skb(skb, FREE_READ);  

            return(0);  

        }  

        skb->len = len;  

        skb->acked = 0;  

        skb->used = 0;  

        skb->free = 0;  

        skb->saddr = daddr;  

        skb->daddr = saddr;  

        /* We may need to add it to the backlog here. */  

        cli();  

        if (sk->inuse)   

        {  

            skb_queue_tail(&sk->back_log, skb);  

            sti();  

            return(0);  

        }  

        sk->inuse = 1;  

        sti();  

    }  

    else  

    {  

        if (sk==NULL)   

        {  

            tcp_reset(daddr, saddr, th, &tcp_prot, opt,dev,skb->ip_hdr->tos,255);  

            skb->sk = NULL;  

            kfree_skb(skb, FREE_READ);  

            return(0);  

        }  

    }  

    if (!sk->prot)   

    {  

        printk("IMPOSSIBLE 3\n");  

        return(0);  

    }  

    /* 

     *  Charge the memory to the socket.  

     */  

    if (sk->rmem_alloc + skb->mem_len >= sk->rcvbuf)   

    {  

        kfree_skb(skb, FREE_READ);  

        release_sock(sk);  

        return(0);  

    }  

    skb->sk=sk;  

    sk->rmem_alloc += skb->mem_len;  

    /* 

     *  This basically follows the flow suggested by RFC793, with the corrections in RFC1122. We 

     *  don't implement precedence and we process URG incorrectly (deliberately so) for BSD bug 

     *  compatibility. We also set up variables more thoroughly [Karn notes in the 

     *  KA9Q code the RFC793 incoming segment rules don't initialise the variables for all paths]. 

     */  

    if(sk->state!=TCP_ESTABLISHED)       /* Skip this lot for normal flow */  

    {  

        /* 

         *  Now deal with unusual cases. 

         */  

        if(sk->state==TCP_LISTEN)  

        {  

            if(th->ack)  /* These use the socket TOS.. might want to be the received TOS */  

                tcp_reset(daddr,saddr,th,sk->prot,opt,dev,sk->ip_tos, sk->ip_ttl);  

            /* 

             *  We don't care for RST, and non SYN are absorbed (old segments) 

             *  Broadcast/multicast SYN isn't allowed. Note - bug if you change the 

             *  netmask on a running connection it can go broadcast. Even Sun's have 

             *  this problem so I'm ignoring it  

             */  

            if(th->rst || !th->syn || th->ack || ip_chk_addr(daddr)!=IS_MYADDR)  

            {  

                kfree_skb(skb, FREE_READ);  

                release_sock(sk);  

                return 0;  

            }  

            /*   

             *  Guess we need to make a new socket up  

             */  

            tcp_conn_request(sk, skb, daddr, saddr, opt, dev, tcp_init_seq());  

            /* 

             *  Now we have several options: In theory there is nothing else 

             *  in the frame. KA9Q has an option to send data with the syn, 

             *  BSD accepts data with the syn up to the [to be] advertised window 

             *  and Solaris 2.1 gives you a protocol error. For now we just ignore 

             *  it, that fits the spec precisely and avoids incompatibilities. It 

             *  would be nice in future to drop through and process the data. 

             */  

            release_sock(sk);  

            return 0;  

        }  

        /* retransmitted SYN? */  

        if (sk->state == TCP_SYN_RECV && th->syn && th->seq+1 == sk->acked_seq)  

        {  

            kfree_skb(skb, FREE_READ);  

            release_sock(sk);  

            return 0;  

        }  

        /* 

         *  SYN sent means we have to look for a suitable ack and either reset 

         *  for bad matches or go to connected  

         */  

        if(sk->state==TCP_SYN_SENT)  

        {  

            /* Crossed SYN or previous junk segment */  

            if(th->ack)  

            {  

                /* We got an ack, but it's not a good ack */  

                if(!tcp_ack(sk,th,saddr,len))  

                {  

                    /* Reset the ack - its an ack from a  

                       different connection  [ th->rst is checked in tcp_reset()] */  

                    tcp_statistics.TcpAttemptFails++;  

                    tcp_reset(daddr, saddr, th,  

                        sk->prot, opt,dev,sk->ip_tos,sk->ip_ttl);  

                    kfree_skb(skb, FREE_READ);  

                    release_sock(sk);  

                    return(0);  

                }  

                if(th->rst)  

                    return tcp_std_reset(sk,skb);  

                if(!th->syn)  

                {  

                    /* A valid ack from a different connection 

                       start. Shouldn't happen but cover it */  

                    kfree_skb(skb, FREE_READ);  

                    release_sock(sk);  

                    return 0;  

                }  

                /* 

                 *  Ok.. it's good. Set up sequence numbers and 

                 *  move to established. 

                 */  

                syn_ok=1;   /* Don't reset this connection for the syn */  

                sk->acked_seq=th->seq+1;  

                sk->fin_seq=th->seq;  

                tcp_send_ack(sk->sent_seq,sk->acked_seq,sk,th,sk->daddr);  

                tcp_set_state(sk, TCP_ESTABLISHED);  

                tcp_options(sk,th);  

                sk->dummy_th.dest=th->source;  

                sk->copied_seq = sk->acked_seq;  

                if(!sk->dead)  

                {  

                    sk->state_change(sk);  

                    sock_wake_async(sk->socket, 0);  

                }  

                if(sk->max_window==0)  

                {  

                    sk->max_window = 32;  

                    sk->mss = min(sk->max_window, sk->mtu);  

                }  

            }  

            else  

            {  

                /* See if SYN's cross. Drop if boring */  

                if(th->syn && !th->rst)  

                {  

                    /* Crossed SYN's are fine - but talking to 

                       yourself is right out... */  

                    if(sk->saddr==saddr && sk->daddr==daddr &&  

                        sk->dummy_th.source==th->source &&  

                        sk->dummy_th.dest==th->dest)  

                    {  

                        tcp_statistics.TcpAttemptFails++;  

                        return tcp_std_reset(sk,skb);  

                    }  

                    tcp_set_state(sk,TCP_SYN_RECV);  

                    /* 

                     *  FIXME: 

                     *  Must send SYN|ACK here 

                     */  

                }         

                /* Discard junk segment */  

                kfree_skb(skb, FREE_READ);  

                release_sock(sk);  

                return 0;  

            }  

            /* 

             *  SYN_RECV with data maybe.. drop through 

             */  

            goto rfc_step6;  

        }  

    /* 

     *  BSD has a funny hack with TIME_WAIT and fast reuse of a port. There is 

     *  a more complex suggestion for fixing these reuse issues in RFC1644 

     *  but not yet ready for general use. Also see RFC1379. 

     */  

#define BSD_TIME_WAIT  

#ifdef BSD_TIME_WAIT  

        if (sk->state == TCP_TIME_WAIT && th->syn && sk->dead &&   

            after(th->seq, sk->acked_seq) && !th->rst)  

        {  

            long seq=sk->write_seq;  

            if(sk->debug)  

                printk("Doing a BSD time wait\n");  

            tcp_statistics.TcpEstabResets++;         

            sk->rmem_alloc -= skb->mem_len;  

            skb->sk = NULL;  

            sk->err=ECONNRESET;  

            tcp_set_state(sk, TCP_CLOSE);  

            sk->shutdown = SHUTDOWN_MASK;  

            release_sock(sk);  

            sk=get_sock(&tcp_prot, th->dest, saddr, th->source, daddr);  

            if (sk && sk->state==TCP_LISTEN)  

            {  

                sk->inuse=1;  

                skb->sk = sk;  

                sk->rmem_alloc += skb->mem_len;  

                tcp_conn_request(sk, skb, daddr, saddr,opt, dev,seq+128000);  

                release_sock(sk);  

                return 0;  

            }  

            kfree_skb(skb, FREE_READ);  

            return 0;  

        }  

#endif    

    }  

    /* 

     *  We are now in normal data flow (see the step list in the RFC) 

     *  Note most of these are inline now. I'll inline the lot when 

     *  I have time to test it hard and look at what gcc outputs  

     */  

    if(!tcp_sequence(sk,th,len,opt,saddr,dev))  

    {  

        kfree_skb(skb, FREE_READ);  

        release_sock(sk);  

        return 0;  

    }  

    if(th->rst)  

        return tcp_std_reset(sk,skb);  

    /* 

     *  !syn_ok is effectively the state test in RFC793. 

     */  

    if(th->syn && !syn_ok)  

    {  

        tcp_reset(daddr,saddr,th, &tcp_prot, opt, dev, skb->ip_hdr->tos, 255);  

        return tcp_std_reset(sk,skb);     

    }  

    /* 

     *  Process the ACK 

     */  

    if(th->ack && !tcp_ack(sk,th,saddr,len))  

    {  

        /* 

         *  Our three way handshake failed. 

         */  

        if(sk->state==TCP_SYN_RECV)  

        {  

            tcp_reset(daddr, saddr, th,sk->prot, opt, dev,sk->ip_tos,sk->ip_ttl);  

        }  

        kfree_skb(skb, FREE_READ);  

        release_sock(sk);  

        return 0;  

    }  

rfc_step6:      /* I'll clean this up later */  

    /* 

     *  Process urgent data 

     */  

    if(tcp_urg(sk, th, saddr, len))  

    {  

        kfree_skb(skb, FREE_READ);  

        release_sock(sk);  

        return 0;  

    }  

    /* 

     *  Process the encapsulated data 

     */  

    if(tcp_data(skb,sk, saddr, len))  

    {  

        kfree_skb(skb, FREE_READ);  

        release_sock(sk);  

        return 0;  

    }  

    /* 

     *  And done 

     */   

    release_sock(sk);  

    return 0;  

}

4、應用層

當用戶需要接收數據時，首先根據文件描述符inode得到socket結構和sock結構，然後從sock結構中指向的隊列recieve_queue中讀取數據包，將數據包COPY到用戶空間緩衝區。數據就完整的從硬件中傳輸到用戶空間。這樣也完成了一次完整的從下到上的傳輸。

曦軒

發佈了33 篇原創文章 · 獲贊 64 · 訪問量 22萬+

私信關注

Linux內核--網絡棧數據包的傳遞過程

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

Cloud App -- Baidu App Engine爬蟲應用開發

C/C++ -- 編程中的內存屏障(Memory Barriers) (1)

UtilBox基礎組件

UtilBox(ub)基礎組件 -- Log日誌(1)

內存使用技巧及內存池實現(二)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結