問題是這樣出現的,
操作:客戶端正在向服務端請求數據的時候,突然拔掉客戶端的網線。
現象:客戶端死等,服務端socket一直存在。
在網上搜索後,需要設置KEEPALIVE屬性。
於是就在客戶端和服務端都設置了KEEPALIVE屬性。
代碼如下:
int keepalive = 1; // 打開keepalive
int keepidle = 10; // 空閒10s開始發送檢測包(系統默認2小時)
int keepinterval = 1; // 發送檢測包間隔 (系統默認75s)
int keepcount = 5; // 發送次數如果5次都沒有迴應,就認定peer端斷開了。(系統默認9次)
setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE,&keepalive, sizeof(keepalive));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE,&keepidle, sizeof(keepidle));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL,&keepinterval, sizeof(keepinterval));
setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT,&keepcount, sizeof(keepcount));
這樣的情況下,客戶端沒有問題了,可以主動關閉,但是服務端還是在死等,也就是說keepalive沒起作用。
其實我也沒有查到原因,插一句題外話,百度搜索真是不好用(偏偏google被封了,公司也不肯買vpn,有種淡淡的憂傷)。
後來我用了一個沒有被封的google ip搜索到了這樣一個屬性,TCP_USER_TIMEOUT (since Linux 2.6.37)。
鏈接:http://man7.org/linux/man-pages/man7/tcp.7.html
This option takes an unsigned int as anargument. When the
value is greater than 0, it specifies themaximum amount of
time in milliseconds that transmitted datamay remain
unacknowledged before TCP will forciblyclose the
corresponding connection and returnETIMEDOUT to the
application. If the option value is specified as 0, TCPwill
to use the system default.
Increasing user timeouts allows a TCPconnection to survive
extended periods without end-to-endconnectivity. Decreasing
user timeouts allows applications to"fail fast", if so
desired. Otherwise, failure may take up to 20 minutes with
the current system defaults in a normal WANenvironment.
This option can be set during any state ofa TCP connection,
but is only effective during thesynchronized states of a
connection (ESTABLISHED, FIN-WAIT-1,FIN-WAIT-2, CLOSE-WAIT,
CLOSING, and LAST-ACK). Moreover, when used with the TCP
keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
overridekeepalive to determine when to close a connection due
to keepalivefailure.
The option has no effect on when TCPretransmits a packet, nor
when a keepalive probe is sent.
This option, like many others, will beinherited by the socket
returned by accept(2), if it was set on thelistening socket.
Further details on the user timeout featurecan be found in
RFC 793 and RFC 5482 ("TCP UserTimeout Option").
所以我們在服務端加上了TCP_USER_TIMEOUT屬性,問題就解決了。
unsigned int timeout = 10000; // 10s
setsockopt(fd, IPPROTO_TCP, TCP_USER_TIMEOUT, &timeout, sizeof(timeout));
後來又搜索了一下,在下面的文章裏找到了印證。
以下做一下摘錄,原文請見:http://blog.leeyiw.org/tcp-keep-alive/
使用TCP KEEP-ALIVE與TCP_USER_TIMEOUT機制判斷通信對端是否存活
第一個問題:
在對端的網線被拔、網卡被卸載或者禁用的時候,對端沒有機會向本地操作系統發送TCP RST或者FIN包來關閉連接。這時候操作系統不會認爲對端已經掛了。所以在調用send函數的時候,返回的仍然是我們指定發送的數據字節數。當我們無法通過send的返回值來判斷對端是否存活的情況下,就要使用TCP Keep-alive機制了。
在《Unix網絡編程(卷一)》中提到,使用SO_KEEPALIVE套接字選項啓用對套接字的保活(Keep-alive)機制。
給一個TCP套接口設置保持存活(keepalive)選項後,如果在2小時內在此套接口的任一方向都沒有數據交換,TCP就自動給對方發一個保持存活探測分節(keepalive probe)。
TCP提供了這種機制幫我們判斷對端是否存活,如果對端沒有對KeepAlive包進行正常的響應,則會導致下一次對套接字的send或者recv出錯。應用程序就可以檢測到這個異常。
第二個問題:
如果發送方發送的數據包沒有收到接收方回覆的ACK數據包,則TCP Keep-alive機制就不會被啓動,而TCP會啓動超時重傳機制,這樣就使得TCP Keep-alive機制在未收到ACK包時失效。