A lot of TIME_WAITs

**epots9** · Aug 8 '07, 07:59 PM

here is a bit of reading:

1. First the application at one endpoint--in this example, that would be the Web server--initiates what is called an "active close." The Web server itself is now done with the connection, but the TCP implementation that supplied the socket it was using still has some work to do. It sends a FIN to the other endpoint and goes into a state called FIN_WAIT_1.

2. Next the TCP endpoint on the browser's side of the connection acknowledges the server's FIN by sending back an ACK, and goes into a state called CLOSE_WAIT. When the server side receives this ACK, it switches to a state called FIN_WAIT_2. The connection is now half-closed.

3. At this point, the socket on the client side is in a "passive close," meaning it waits for the application that was using it (the browser) to close. When this happens, the client sends its own FIN to the server, and deallocates the socket on the client side. It's done.

4. When the server gets that last FIN, it of course sends back on ACK to acknowledge it, and then goes into the infamous TIME_WAIT state. For how long? Ah, there's the rub.

The socket that initiated the close is supposed to stay in this state for twice the Maximum Segment Lifetime--2MLS in geek speak. The MLS is supposed to be the length of time a TCP segment can stay alive in the network. So, 2MLS makes sure that any segments still out there when the close starts have time to arrive and be discarded. Why bother with this, you ask?

Because of delayed duplicates, that's why. Given the nature of TCP/IP, it's possible that, after an active close has commenced, there are still duplicate packets running around, trying desperately to make their way to their destination sockets. If a new socket binds to the same IP/port combination before these old packets have had time to get flushed out of the network, old and new data could become intermixed. Imagine the havoc this could cause around the office: "You got JavaScript in my JPEG!"

So, TIME_WAIT was invented to keep new connections from being haunted by the ghosts of connections past. That seems like a good thing. So what's the problem?

The problem is that 2MLS happens to be a rather long time--240 seconds, by default. There are several costs associated with this. The state for each socket is maintained in a data structure called a TCP Control Block (TCB). When IP packets come in they have to be associated with the right TCB and the more TCBs there are, the longer that search takes. Modern implementations of TCP combat this by using a hash table instead of a linear search. Also, since each TIME_WAIT ties up an IP/port combination, too many of them can lead to exhaustion of the default number of ephemeral ports available for handling new requests. And even if the TCB search is relatively fast, and even if there are plenty of ports to bind to, the extra TCBs still take up memory on the server side. In short, the need to limit the costs of TIME_WAIT turns out to be a long-standing problem. In fact, this was part of the original case for persistent connections in HTTP 1.1.

The good news is that you can address this problem by shortening the TIME_WAIT interval. This article by Brett Hill explains how to do so for IIS. As Brett explains, four minutes is probably longer than needed for duplicate packets to flush out of the network, given that modern network latencies tend to be much shorter than that. The bad news is that, while shortening the interval is quite common, it still entails risks. As Faber, Touch and Yue (who are the real experts on this) explain: "The size of the MSL to maintain a given memory usage level is inversely proportional to the connection rate." In other words, the more you find yourself needing to reduce the length of TIME_WAIT, the more likely doing so will cause problems.

hope it helps

A lot of TIME_WAITs

A lot of TIME_WAITs

Comment