Resetting a TCP connection and SO_LINGER
Can you quickly close a TCP connection in your program by sending a reset (“RST”) packet?
Calling close()
usually starts an orderly shutdown, via a “FIN”
packet. This means the system has to go through the full TCP shutdown
sequence, where it has to get back an ACK, and a FIN from the other
end also, which itself needs an ACK (called LAST_ACK, quite
appropriately).
So: How can you send a reset and get rid of the connection in one shot?
Using linger option to reset
Here’s what I found on FreeBSD 8.4, based on what I see in the kernel.
- Set
SO_LINGER
option on the socket, with a linger interval of zero. - Call
close()
.
In C, you just need these lines to set up the socket:
Here is the relevant kernel code that gets executed, from
netinet/tcp_usrreq.c
. The function is tcp_disconnect()
:
Sample traces
I wrote a simple program to connect to a server, set the above
SO_LINGER
option with linger interval of 0, and then close. Here is
the packet trace, edited to focus on TCP flags.
12.828218 IP clt.59784 > svr.http: Flags [S], seq 2974233491, ... length 0
12.828273 IP svr.http > clt.59784: Flags [S.], seq 3927686263, ack 2974233492, ... length 0
12.828312 IP clt.59784 > svr.http: Flags [.], ack 1, ... length 0
12.828324 IP clt.59784 > svr.http: Flags [R.], seq 1, ack 1, ... length 0
Note the abrupt close via reset (‘R’ flag).
Here is a trace without SO_LINGER
option, again, trimmed for
clarity.
32.812991 IP clt.16575 > svr.http: Flags [S], seq 2769061407, ... length 0
32.813092 IP svr.http > clt.16575: Flags [S.], seq 1785234162, ack 2769061408, ... length 0
32.813110 IP clt.16575 > svr.http: Flags [.], ack 1, ... length 0
32.813786 IP clt.16575 > svr.http: Flags [F.], seq 1, ack 1, ... length 0
32.813843 IP svr.http > clt.16575: Flags [.], ack 2, ... length 0
32.814907 IP svr.http > clt.16575: Flags [F.], seq 1, ack 2, ... length 0
32.815107 IP clt.16575 > svr.http: Flags [.], ack 2, ... length 0
Note the extra packets, and also the FIN packets (‘F’ flag) and their acknowledgments (‘.’ flag).
What is SO_LINGER?
The remaining part of this post is about SO_LINGER
. Let us review
the documentation:
SO_LINGER controls the action taken when unsent messages are queued on socket and a close(2) is performed. If the socket promises reliable delivery of data and SO_LINGER is set, the system will block the process on the close(2) attempt until it is able to transmit the data or until it decides it is unable to deliver the information (a timeout period, termed the linger interval, is specified in seconds in the setsockopt() system call when SO_LINGER is requested).
From the man page, it seems as if the system waits only for the duration of linger interval to drain the data. But it turns out to be slightly different, as we will see. The user application is blocked until the linger interval, but the OS continues to try to drain the data beyond that.
The relevant source code is in uipc_socket.c
, function soclose()
:
tsleep()
is documented in
sleep(9).
It is a bit tricky to exercise SO_LINGER
, but not impossible. In
fact, this is how I tried to understand the linger option:
I wrote a client program that sets linger timeout to 5 seconds, puts
90001 bytes of data to send via TCP, calls close()
, and then exits.
I also wrote a server program that accepts a connection and then goes
to sleep for 30 seconds, and then reads all the remaining data. This
simulates unsent data on the client at the time of invoking close()
.
Here is the trace:
# connection open
09:41:15.940732 IP clt.44145 > svr.5000: Flags [S], seq 4078427161, win 65535, length 0
15.940780 IP svr.5000 > clt.44145: Flags [S.], seq 3774371314, ack 4078427162, win 65535, length 0
15.940797 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 0
# initial data push, server asleep but system accepts data up to socket buffer
15.940832 IP clt.44145 > svr.5000: Flags [P.], ack 1, win 8960, length 9
15.941616 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 14336
15.941628 IP svr.5000 > clt.44145: Flags [.], ack 14346, win 7166, length 0
15.941684 IP clt.44145 > svr.5000: Flags [P.], ack 1, win 8960, length 1
15.942466 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 14336
15.942479 IP svr.5000 > clt.44145: Flags [.], ack 28683, win 5374, length 0
15.942518 IP clt.44145 > svr.5000: Flags [P.], ack 1, win 8960, length 1
15.943301 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 14336
15.943318 IP svr.5000 > clt.44145: Flags [.], ack 43020, win 3582, length 0
15.943359 IP clt.44145 > svr.5000: Flags [P.], ack 1, win 8960, length 1
15.944146 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 14336
15.944163 IP svr.5000 > clt.44145: Flags [.], ack 57357, win 1790, length 0
15.944204 IP clt.44145 > svr.5000: Flags [P.], ack 1, win 8960, length 1
16.040862 IP svr.5000 > clt.44145: Flags [.], ack 57358, win 1790, length 0
# trying to drain after linger interval
21.040918 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 14320
# server buffer full
21.140894 IP svr.5000 > clt.44145: Flags [.], ack 71678, win 0, length 0
# client probes after every linger interval
26.140943 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 1
26.141008 IP svr.5000 > clt.44145: Flags [.], ack 71679, win 0, length 0
31.140992 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 1
31.141056 IP svr.5000 > clt.44145: Flags [.], ack 71680, win 0, length 0
36.141022 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 1
36.141091 IP svr.5000 > clt.44145: Flags [.], ack 71681, win 0, length 0
41.141102 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 1
41.141154 IP svr.5000 > clt.44145: Flags [.], ack 71681, win 0, length 0
# server wakes up, reads buffered data
45.951247 IP svr.5000 > clt.44145: Flags [.], ack 71681, win 3601, length 0
# client drains rest of the data, closes connection
45.951317 IP clt.44145 > svr.5000: Flags [.], ack 1, win 8960, length 14336
45.951331 IP clt.44145 > svr.5000: Flags [FP.], seq 86017:90001, ack 1, win 8960, length 3984
# server acks, opens window
45.951345 IP svr.5000 > clt.44145: Flags [.], ack 90002, win 4244, length 0
45.951419 IP svr.5000 > clt.44145: Flags [.], ack 90002, win 7846, length 0
46.008793 IP svr.5000 > clt.44145: Flags [.], ack 90002, win 8960, length 0
# socket already closed at client, so tear down connection with a reset
46.008835 IP clt.44145 > svr.5000: Flags [R], seq 4078517163, win 0, length 0
On the client, the close()
call returns the following EWOULDBLOCK
as expected, but note how the program is blocked until 5 seconds later:
closing, time now: Sat Oct 22 09:41:15 PDT 2016
close: Resource temporarily unavailable
exiting, time now: Sat Oct 22 09:41:20 PDT 2016
The socket wakes up every 5 seconds (our linger interval), trying to drain remaining data. Note that the server buffers are exhausted as evident by the zero window advertised. Finally, the server reads up the remaining data and the connection is closed with a reset because the application has exited.
Even though the server application is asleep, the server system is able to accept 71678 bytes of data in its receive buffers before advertising empty space. It is the remaining unsent data on the client that triggers the ‘linger’ option.
I tried a few other combinations with my test programs to see how
SO_LINGER
behaves:
- If client program is still running, then the OS keeps the connection open and waits for the server to send a FIN from its end.
- If the server wakes up before the linger interval, then the program returns before the linger interval.
- If this is a non-blocking socket,
close()
does not block, but the system still tries to drain data.
I’ve posted the test programs: server, client.
To linger or not, that is the question
So, what does this all mean to the BSD network programmer?
- Use
SO_LINGER
with linger interval of zero to send a reset immediately, whether your socket is blocking or non-blocking. - Use
SO_LINGER
with non-zero linger interval to block your program until that time to send the data. It can wake up before that if it is able to drain the data. If your socket is non-blocking, it returns immediately but the OS will still try to drain the data. - Make sure you set
l_onoff
field in thelinger
structure to a non-zero value, else none of this applies.
SO_LINGER on Linux
I had to tweak the client program a bit to simulate this on Linux. I
used Ubuntu 16.04, running kernel 4.8.3. I kept calling non-blocking
send(..., MSG_DONTWAIT)
on the client until I got an EWOULDBLOCK
error. The tcpdump
utility needed a bigger buffer (-B
option) so
as not to drop packets.
As far as the user program is concerned, the behavior is similar:
- Zero linger interval resulted in a TCP reset packet
- Client program was blocked until the sleep interval or data was drained
- After that, the OS continued to try to drain the data
- When it was done, it sent a FIN to initiate a close
But I saw a few differences:
- The
close()
call did not returnEWOULDBLOCK
like BSD, when I used regular (i.e., blocking) socket. - The
close()
call blocked the program for the duration of linger interval, even on a non-blocking socket. This looks bad! - The probes to drain data became less frequent over time (‘exponential backoff’):
# ... many packets omitted ...
21:38:45.228236 IP clt.5000 > clt.35512: Flags [.], ack 938815, win 141, ... length 0
21:38:45.464879 IP clt.35512 > clt.5000: Flags [P.], seq 938815:956863, ack 1, win 342, ... length 18048
# first zero window from server
21:38:45.464910 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 240ms interval
21:38:45.704875 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:38:45.704905 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 450ms interval
21:38:46.158240 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
# 880ms interval
21:38:47.038229 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:38:47.038258 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 1.89s interval
21:38:48.931618 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:38:48.931670 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 3.62s interval
21:38:52.558238 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:38:52.558278 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 7.04s interval
21:38:59.598276 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:38:59.598322 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 14.7s interval
21:39:14.318292 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:39:14.318357 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 29.0s interval
21:39:43.331579 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:39:43.331630 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 56.3s interval
21:40:39.651571 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:40:39.651616 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# 116s interval
21:42:35.704920 IP clt.35512 > clt.5000: Flags [.], ack 1, win 342, ... length 0
21:42:35.704995 IP clt.5000 > clt.35512: Flags [.], ack 956863, win 0, ... length 0
# ... many more packets omitted ...
# server woke up and read data
21:43:45.074313 IP clt.5000 > clt.35512: Flags [.], ack 2444924, win 8527, ... length 0
# drain almost complete ...
21:43:45.074351 IP clt.35512 > clt.5000: Flags [.], seq 3296203:3361686, ack 1, win 342, ... length 65483
# complete, and the FIN packet
21:43:45.074379 IP clt.35512 > clt.5000: Flags [FP.], seq 3361686:3427169, ack 1, win 342, ... length 65483
The netstat -t
command showed the socket in FIN_WAIT_1 state while
this drain was in progress. Also note the non-empty send queue.
$ netstat -t
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 2470307 localhost:35512 localhost:5000 FIN_WAIT1
tcp 956862 0 localhost:5000 localhost:35512 ESTABLISHED
This seems queer, because FIN_WAIT_1 implies that the FIN packet is
already sent out. But this may be a reflection of the socket state
rather than the state of the underlying TCP connection. It did move
to FIN_WAIT_2 after the FIN was acknowledged by the server when it
woke up and the server did not bother to call a close()
of its own.
$ netstat -t
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:35512 localhost:5000 FIN_WAIT2
tcp 0 0 localhost:5000 localhost:35512 CLOSE_WAIT
Linux implementation
I had a cursory look at the kernel source code. It is in
net/ipv4/tcp.c
, function tcp_close()
:
If a timeout is set, it is processed in net/ipv4/af_inet.c
, function
inet_release()
. The close()
function takes in a timeout
parameter.
To linger or not, Linux edition
So, what does this mean to the Linux network programmer?
- If you’re using non-blocking sockets, note that calling
close()
may block the program - Otherwise, the advice previously on BSD still holds good