Return to the Kernel page

LWN.net Weekly Edition for May 23, 2013

LWN.net Weekly Edition for May 16, 2013

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

LPC: Making the net go faster

ByJonathan Corbet
September 13, 2011

Almost every service offered by Google is delivered over the Internet, so it makes sense that the company would have an interest in improving how the net performs. The networking session at the 2011 Linux Plumbers Conference featured presentations from three Google developers, each of whom had a proposal for a significant implementation change. Between the three, it seems, there is still a lot of room for improvement in how we do networking.

Proportional rate reduction

The "congestion window" is a TCP sender's idea of how much data it can have in flight to the other end before it starts to overload a link in the middle. Dropped packets are often a sign that the congestion window is too large, so TCP implementations normally reduce the window significantly when loss happens. Cutting the congestion window will reduce performance, though; if the packet loss was a one-time event, that slowdown will be entirely unnecessary. RFC 3517 describes an algorithm for bringing the connection up to speed quickly after a lost packet, but, Nandita Dukkipati says, we can do better.

According to Nandita, a large portion of the network sessions involving Google's servers experience losses at some point; the ones that do can take 7-10 times longer to complete. RFC 3517 is part of the problem. This algorithm responds to a packet loss by immediately cutting the congestion window in half; that means that the sending system must, if the congestion window had been full at the time of the loss, wait for ACKs for half of the in-transit packets before transmitting again. That causes the sender to go silent for an extended period of time. It works well enough in simple cases (a single packet lost in a long-lasting flow), but it tends to clog up the works when dealing with short flows or extended packet losses.

Linux does not use strict RFC 3517 now; it uses, instead, an enhancement called "rate halving." With this algorithm, the congestion window is not halved immediately. Once the connection goes into loss recovery, each incoming ACK (which will typically acknowledge the receipt of two packets at the other end) will cause the congestion window to be reduced by a single packet. Over the course of one full set of in-flight packets, the window will be cut in half, but the sending system will continue to transmit (at a lower rate) while that reduction is happening. The result is a smoother flow and reduced latency.

But rate halving can be improved upon. The ACKs it depends on are themselves subject to loss; an extended loss can cause significant reduction of the congestion window and slow recovery. This algorithm also does not even begin the process of raising the congestion window back to the highest workable value until the recovery process is complete. So it can take quite a while to get back up to full speed.

The proportional rate reduction algorithm takes a different approach. The first step is to calculate an estimate for the amount of data still in flight, followed by a calculation of what, according to the congestion control algorithm in use, the congestion window should now be. If the amount of data in the pipeline is less than the target congestion window, the system just goes directly into the TCP slow start algorithm to bring the congestion window back up. Thus, when the connection experiences a burst of losses, it will start trying to rebuild the congestion window right away instead of creeping along with a small window for an extended period.

If, instead, the amount of data in flight is at least as large as the new congestion window, an algorithm similar to rate halving is used. The actual reduction is calculated relative to the new congestion window, though, rather than being a strict one-half cut. For both large and small losses, the emphasis on using estimates of the amount of in-flight data instead of counting ACKs is said to make recovery go more smoothly and to avoid needless reductions in the congestion window.

How much more better is it? Nandita said that Google has been running experiments on some of its systems; the result has been a 3-10% reduction in average latency. Recovery timeouts have been reduced by 5%. This code is being deployed more widely on Google's servers; it also has been accepted for merging during the 3.2 development cycle. More information can be found in this draft RFC.

TCP fast open

Opening a TCP connection requires a three-packet handshake: a SYN packet sent by the client, a SYN-ACK response from the server, and a final ACK from the client. Until the handshake is complete, the link can carry no data, so the handshake imposes an unavoidable startup latency on every connection. But what would happen, asked Yuchung Cheng, if one were to send data with the handshake packets? For simple transactions - an HTTP GET request followed by the contents of a web page, for example - sending the relevant data with the handshake packets would eliminate that latency. The result of this thought is the "TCP fast open" proposal.

RFC 793 (describing TCP) does allow data to be passed with the handshake packets, with the proviso that the data not be passed to applications until the handshake completes. One can consider fudging that last requirement to speed the process of transmitting data through a TCP connection, but there are some hazards to be dealt with. An obvious problem is the amplification of SYN flood attacks, which are bad enough when they only involve the kernel; if each received SYN packet were to take up application resources as well, the denial of service possibilities would be significantly worse.

Yuchung described an approach to fast open which is intended to get around most of the problems. The first step is the creation of a per-server secret which is hashed with information from each client to create a per-client cookie. That cookie is sent to the client as a special option on an ordinary SYN-ACK packet; the client can keep it and use it for fast opens in the future. The requirement to get a cookie first is a low bar for the prevention of SYN flood attacks, but it does make things a little harder. In addition, the server's secret is changed relatively often, and, if the server starts to see too many connections, fast open will simply be disabled until things calm down.

One remaining problem is that about 5% of the systems on the net will drop SYN packets containing unknown options or data. There is little to be done in this situation; TCP fast open simply will not work. The client must thus remember cases where the fast-open SYN packet did not get through and just use ordinary opens in the future.

Fast open will not happen by default; applications on both ends of the connection must specifically request it. On the client side, the sendto() system call is used to request a fast-open connection; with the new MSG_FAST_OPEN flag, it functions like the combination of connect() and sendmsg(). On the server side, a setsockopt() call with the TCP_FAST_OPEN option will enable fast opens. Either way, applications need not worry about dealing with the fast-open cookies and such.

In Google's testing, TCP fast open has been seen to improve page load times by anything between 4% and 40%. This technique works best in situations where the round trip time is high, naturally; the bigger the latency, the more value there is in removing it. A patch implementing this feature will be submitted for inclusion sometime soon.

Briefly: user-space network queues

While the previous two talks were concerned with improving the efficiency of data transfer over the net, Willem de Bruijn is concerned with network processing on the local host. In particular, he is working with high-end hardware: high-speed links, numerous processors, and, importantly, smart network adapters that can recognize specific flows and direct packets to connection-specific queues. By the time the kernel gets around to thinking about a given packet at all, it will already be sorted into the proper place, waiting for the application to ask for the data.

Actual processing of the packets will happen in the context of the receiving process as needed. So it all happens in the right context and on the right CPU; intermediate processing at the software IRQ level will be avoided. Willem even described a new interface whereby the application would receive packets directly from the kernel via a shared memory segment.

In other words, this talk described a variant of the network channels concept, where packet processing is pushed as close to the application as possible. There are numerous details to be dealt with, including the usual hangups for the channels idea: firewall processing and such. The proposed use of a file in sysfs to pass packets to user space also seems unlikely to pass review. But this work may eventually reach a point where it is generally useful; those who are interested can find the patches on the unetq page.

(Log in to post comments)