When two computers fall in love and decide that they shall have a connection, their love-making-programs may issue the following POSIX commands in order to establish a TCP connection:
- On the server side:
socket()which will create a socket object and hopefully (on sucess) return its handle.
bind()will set the listening port and -- if desired -- listening address.
listen()will tell the kernel that the application is by now ready to process network connections -- from this point on, an incoming TCP packet with set SYN flag for the port specified in the preceeding call to
bind()will be answered with SYN ACK and the connection will technically (at least from the client's point of view) be established. Note that the application itself at this moment still hasn't done anything with an incoming connection -- currently the connection is still waiting at the kernel for the application to proceed -- pretty much like a guest at a restaurant who still stands near the door, waiting to be placed by the service personnel.
accept()finally requests the next incoming TCP connection from the kernel and returns its socket handle. Usually, this call either returns a handle or it blocks until there is a handle available. If there is already an established connection, it's returned immediately and can be used as one would imagine: Write to it, read from it, close it.
write()takes three parameters: File descriptor, buffer, and length of the buffer to write. It returns the number of bytes written. And here comes a dirty little secret: The return value is not the number of bytes written sucessfully to the wire (a.k.a. the number of bytes which have been sent and acknowledged by the connection peer) -- it is the number of bytes written sucessfully to the operating system's send-buffer for this connection. In case the writing process (to the send buffer, that is!) fails, -1 is returned.
read()fetches any data from a connection, it probably blocks if there currently is no data available. Whatevs, this isn't really about
shutdown()should be issued to signal that the TCP connection is to be closed and that there are no further pending
write()s. The connection shutdown with FIN flags will be initiated.
close()finally tells the kernel that the connection resources are no longer needed. If the connection hasn't been shut down carefully (see above), most kernels will now send a TCP packet with set RST flag to signal that this connection is no longer persistent.
- On the client side:
socket()is used to create a socket, too.
connect()is used to initiate the sending of the first SYN packet to the targetted server.
write()may be used as discussed above to receive and send data once the connection is established.
shutdown()should be issued in advance to closing the connection -- if the program is polite.
close()finally frees the system resources allocated with
shutdown()hasn't been issued, this will cause the system to send a packet with set RST flag.
While TCP ensures that there is no data loss within an existig connection, reality provides an opportunity to lose data even when using TCP. To be specific: When using unreliable TCP connection.
What's now an unreliable TCP connection? In context of this article, the term shall describe a TCP connection to a peer which may suddenly disappear (because Alice
kill -9ed the according process or Bob physically disconnected the network cable).
But since our service is clever, it offers automatic reconnect. And this reconnect -- that's the gap in which data disappears.
To help understanding the situation, a little drawing:
Appli. --[socket()][bind()][listen()]--------[accept()][write()] \ / \ Peer1: --------------------[ L i s t e n ][ C o n n e c t e d / \ / \ SYN SYNACK ACK \ / \ / \ Peer2: ---------------[ c o n n e c t ][Connected][... somehow lost... Time flows this direction -->In other words, data loss happens when
write()ing to a connection which has been interrupted on the side of Peer2 (as shown in the Figure) without notification of Peer1. Note that there hasn't been sent any RST or FIN to Peer1. The application calls, say,
write(fd, buf, 20)and it will return 20 -- because the underlying OS buffer has taken these 20 bytes to his send buffer. The return value of 20 signals ``Got it, dismissed!'' to the application and it may feel free to
free()the buffer or overwrite it or whatever. And now the problem occurs: The OS tries to transmit these 20 bytes. A TCP packet is sent -- best case Peer2 will return an ICMP message that there is no one interested in that packet any more; maybe Peer2s kernel will evend decide to answer with resetting the connection. Worst case, Peer1s kernel will just keep on retransmitting those 20 bytes until a timeout occurs and the connection is officially declared dead -- because absolutely no reaction occurred. The next time, the application tries to issue a
write(), the kernel will return -1. In context of discovering the lost connection, a SIGPIPE can be issued, which normally causes an application to exit; except it installed a signal handler and decides to ignore the signal.
In any of these cases, the 20 bytes are lost: The application assumes them to be sent to the other peer, the peer never got them.
You can easily stress-test programs for such behavior by writing a server which issues just
listen(): The kernel will SYN ACK the connection -- the other side will write their first data packet -- and then you can lay back and just
close() the connection. Use Wireshark or tcpdump to observe.
In context of an application, it may be desirable to wait for a reconnect and then re-transfer the lost sequence. Unfortunately, to do so, the application must know how many bytes have gone lost. A quick-and-dirty solution seems to be keeping the last used buffer around and in case
write() fails, the previous and the current buffer are to be retransmitted. Unfortunately, that won't work reliably. There is no need for each call to
write() to be connected to sending a packet of it's own (and vice versa for
read(), which is a common misconception about Socket API -- there is no guarantee that when I send three ten-byte-buffers I will receive three ten-byte buffers. I may receive two fifteen-byte-buffers or one thirty-byte-buffer; TCP just ensures the data to be in correct order and not to get lost, nothing else). And now, things start getting ugly. And there is a reason -- the application starts doing stuff applications are not supposed to to. Applications do application stuff like displaying GUIs, harass users, torture the FPU by doing calculations. Deciding layer-4-network stuff is just strictly kernel business. Having this problem is a typical symptom of ``you're doing something you shoudln't be doing!''. A real application would know its data and thus be able to handle such problems without any trouble -- because it would have a Session concept, covering a transaction model.
Unfortunately, the world needs programs which do not adhere to such claims. Even more sad, I do have to write such. Whatever.
As for my program, I do not know anything about the transaction state -- because my program is just a relay station intended for universal use. And thus I have only two possible ways out of the dilemma:
- Discard data -- whenever the connection is lost, the buffer gets lost. Higher application layers must know how to deal with it; it is just not my application's problem at this moment. With every relay station, the size of the lost buffers grows.
- Pain -- I forcefully set the send buffer size to 1 issuing a
setsockopt()call. Thus I ensure that every single byte is likly to be sent using its very own data packet (since the kernel has no other way to deal with the send buffer -- if it's filled, it can only be emptied by sending a data packet). Now cases in which Peer2s kernel sends an answer (the case of Alice killing the program on Peer2) will quickly lead to the next
write()resulting in a result of -1 (or earlier SIGPIPE). However, in case of Bob cutting the network cable, write is likly to block -- waiting for the kernel to accept the next byte.
In any case not funny: Putting a whole TCP packet on wire just to transmit one byte, that's just a performance killer par excellance.
- Religion, that is pray -- I assume that my kernel will be fair enough to have the 2nd write after connection loss break or at least block. In that case I can assume the previous buffer to be on the run and thus re-transmit that previous buffer after connection is re-estanblished. But as I do not have any knowledge about the data transferred (that is, about whether the re-transmit makes sense, ore ven worse -- whether parts of that buffer are already transmitted), this is the hope-and-pray version. Which may work or which may mess up everything.