首页 > 代码库 > 利用keepalive和timeout来判断死连接
利用keepalive和timeout来判断死连接
问题是这样出现的,
操作:客户端正在向服务端请求数据的时候,突然拔掉客户端的网线。
现象:客户端死等,服务端socket一直存在。
在网上搜索后,需要设置KEEPALIVE属性。
于是就在客户端和服务端都设置了KEEPALIVE属性。
代码如下:
int keepalive = 1; // 打开keepaliveint keepidle = 10; // 空闲10s开始发送检测包(系统默认2小时)int keepinterval = 1; // 发送检测包间隔 (系统默认75s)int keepcount = 5; // 发送次数如果5次都没有回应,就认定peer端断开了。(系统默认9次)setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE,&keepalive, sizeof(keepalive));setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE,&keepidle, sizeof(keepidle));setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL,&keepinterval, sizeof(keepinterval));setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT,&keepcount, sizeof(keepcount));
这样的情况下,客户端没有问题了,可以主动关闭,但是服务端还是在死等,也就是说keepalive没起作用。
其实我也没有查到原因,插一句题外话,百度搜索真是不好用(偏偏google被封了,公司也不肯买vpn,有种淡淡的忧伤)。
后来我用了一个没有被封的google ip搜索到了这样一个属性,TCP_USER_TIMEOUT (since Linux 2.6.37)。
链接:http://man7.org/linux/man-pages/man7/tcp.7.html
This option takes an unsigned int as anargument. When the
value is greater than 0, it specifies themaximum amount of
time in milliseconds that transmitted datamay remain
unacknowledged before TCP will forciblyclose the
corresponding connection and returnETIMEDOUT to the
application. If the option value is specified as 0, TCPwill
to use the system default.
Increasing user timeouts allows a TCPconnection to survive
extended periods without end-to-endconnectivity. Decreasing
user timeouts allows applications to"fail fast", if so
desired. Otherwise, failure may take up to 20 minutes with
the current system defaults in a normal WANenvironment.
This option can be set during any state ofa TCP connection,
but is only effective during thesynchronized states of a
connection (ESTABLISHED, FIN-WAIT-1,FIN-WAIT-2, CLOSE-WAIT,
CLOSING, and LAST-ACK). Moreover, when used with the TCP
keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
overridekeepalive to determine when to close a connection due
to keepalivefailure.
The option has no effect on when TCPretransmits a packet, nor
when a keepalive probe is sent.
This option, like many others, will beinherited by the socket
returned by accept(2), if it was set on thelistening socket.
Further details on the user timeout featurecan be found in
RFC 793 and RFC 5482 ("TCP UserTimeout Option").
所以我们在服务端加上了TCP_USER_TIMEOUT属性,问题就解决了。
unsigned int timeout = 10000; // 10ssetsockopt(fd, IPPROTO_TCP, TCP_USER_TIMEOUT, &timeout, sizeof(timeout));
后来又搜索了一下,在下面的文章里找到了印证。
以下做一下摘录,原文请见:http://blog.leeyiw.org/tcp-keep-alive/
使用TCP KEEP-ALIVE与TCP_USER_TIMEOUT机制判断通信对端是否存活
第一个问题:
在对端的网线被拔、网卡被卸载或者禁用的时候,对端没有机会向本地操作系统发送TCP RST或者FIN包来关闭连接。这时候操作系统不会认为对端已经挂了。所以在调用send函数的时候,返回的仍然是我们指定发送的数据字节数。当我们无法通过send的返回值来判断对端是否存活的情况下,就要使用TCP Keep-alive机制了。
在《Unix网络编程(卷一)》中提到,使用SO_KEEPALIVE套接字选项启用对套接字的保活(Keep-alive)机制。
给一个TCP套接口设置保持存活(keepalive)选项后,如果在2小时内在此套接口的任一方向都没有数据交换,TCP就自动给对方发一个保持存活探测分节(keepalive probe)。
TCP提供了这种机制帮我们判断对端是否存活,如果对端没有对KeepAlive包进行正常的响应,则会导致下一次对套接字的send或者recv出错。应用程序就可以检测到这个异常。
第二个问题:
如果发送方发送的数据包没有收到接收方回复的ACK数据包,则TCP Keep-alive机制就不会被启动,而TCP会启动超时重传机制,这样就使得TCP Keep-alive机制在未收到ACK包时失效。
利用keepalive和timeout来判断死连接