Recently, there was discussion on the ietf-http-wg mailing list about the current HTTP/2.0 draft’s inclusion of a SPDY feature that allows the server to send its TCP congestion window (often abbreviated as cwnd) to the client, so it can echo it back to the server when opening a new connection. It’s a pretty interesting conversation so I figured I’d share the backstory of how we got to where we are today.
A long time ago in a galaxy far, far away (the 1980s), the series of tubes got clogged. Many personal Internets got delayed and the world was sad. Van Jacobson was pretty annoyed that his own personal Internets were getting delayed too due to all this congestion, so he proposed that we all agree to avoid congestion rather than cause it.
A key part of that proposal was Slow Start, which called for using a congestion window and exponentially growing it until a loss event happens. This congestion window specified a limit on the number of TCP segments a sender could have outstanding at any point. The TCP implementation would initialize the connection’s congestion window to a predefined “initial window” (often abbreviated as initcwnd or IW) and increase the window by 1 upon receipt of an ACK. If you got an ACK for each TCP segment you sent out, then in each roundtrip, the congestion window would double, thus leading to exponential growth. Originally, the initial congestion window was set to 1 segment, and eventually RFC 3390 increased it to 2-4. The congestion window would continue to grow until the connection encountered a loss event, which it assumes is due to congestion. In this way, TCP starts with low utilization and increases utilization (exponentially each roundtrip) until a loss event occurs.
Way back when (the 1990s), Tim Berners-Lee felt a burning need to see kitten photos, so he invented the World Wide Web, where web browsers would access kitten photo websites via HTTP over TCP. For a long time, the recommendation was to only use two connections per host, in order to prevent kitten photo congestion in the tubes:
Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.
However, over time, people noticed that their kitten photos weren’t loading as fast as they wanted them to. Due to this connection per host limit of 2, browsers were unable to send enough HTTP requests to receive enough responses to saturate the available downstream bandwidth. But users wanted more kitten photos faster and incentivized browser vendors to increase their connection per host limits so they could download more kitten photos in parallel. Indeed, the httpbis working group recognized that this limit was way too low and got rid of it. Yet even this wasn’t fast enough, and many users were still using older browsers, so website authors started domain sharding in order to send even more kitten photos in parallel to users.
Now that we see that browsers have raised their parallelization limits, and websites sometimes use domain sharding to get even more parallelization, it’s interesting to see how that interacts with TCP Slow Start. As previously noted, TCP Slow Start begins the congestion window with a conservative initial value and rapidly increases it until a loss event occurs. A key part to avoiding congestion is that the initial congestion window starts off below the utilization level that would lead to congestion. Note however, that if N connections simultaneously enter Slow Start, then the effective initial congestion window across those N connections is N*initcwnd. Obviously for values of N and initcwnd high enough, congestion will occur immediately on network paths which cannot handle such high utilization, which may lead to poor goodput (the “useful” throughput) due to wasting throughput on retransmissions. Of course, this begs the question how many connections and what initcwnd values we see in practice.
The Web has changed a lot since Tim Berners-Lee put up the first kitten website. Websites have way more images now, and may even have tons of third party scripts and stylesheets, not to mention third party ads. Fortunately, Steve Souders maintains the HTTP Archive which can help us figure out, for today’s Web, roughly how many parallel connections may be required to load them.
Looking at this diagram, we see that the average number of domains involved in a page load today is around 16. Depending on how the webpage is structured, we can probably expect to open upwards of tens of connections during the page load. This is of course only a very rough estimate, given this available data.
In their continuing efforts to make the web fast, Google researchers have figured out that removing throttles will make things go faster, so they proposed raising the initial congestion window to 10, and it’s on its way to becoming an RFC. It’s key to note why this makes a big difference to web browsing. Slow Start requires multiple roundtrips to increase the congestion window, but web browsing is interactive and bursty, so those roundtrips directly impact user perceived latency, and by the time TCP roughly converges on an appropriate congestion window for the connection, the page load most likely has completed.
Linux kernel developers found Google’s argument compelling and have already switched the kernel to using IW10 by default. CDN developers were likewise pretty quick to figure out that if they wanted to compete in the make kitten photos go fast space, they needed to raise their initcwnd values too to keep up.
As David Miller points out, any app can increase the initial congestion window by just opening more connections, and obviously websites can make the browser open more connections by domain sharding. So websites can effectively get as high an initial congestion window as they want.
But the question is - does this happen in practice and what happens in that case? To answer this, I used a highly scientific methodology to identify representative websites on the Web, loaded them up in WebPageTest with a fairly “slow” (DSL) network setup, and looked to see what happened:
As shown above, plenty of major websites are using a significant amount of domain sharding, which can be highly detrimental for their user experience when the user has insufficient bandwidth to handle the burst from the servers. With so many connections opening up in parallel, the effective initial congestion window across all the connections is huge, and Slow Start becomes ineffective at avoiding congestion.
The advent of SPDY changes this since it helps fix HTTP/1.X bottlenecks and thus obviates the need for workaround hacks like domain sharding that increase parallelism by opening up more connections. However, after fixing the application layer bottlenecks in HTTP/1.X, SPDY runs into transport layer bottlenecks, such as the initial congestion window, due to web browsing’s bursty nature. SPDY is at a distinct disadvantage here compared to HTTP, since browsers will open up upwards of 6 connections per host for HTTP, whereas they’ll only open a single connection for SPDY, therefore HTTP often gets 6 times the initial congestion window that SPDY does.
Given this knowledge, what should we be doing? Well, there’s some argument to be made that since SPDY tries to be a good citizen and use fewer connections than HTTP/1.X, perhaps individual SPDY connections should get a higher initial congestion window than individual HTTP/1.X connections. One way do so is for the kernel to provide a socket option for the application to control the initial congestion window. Indeed, Google TCP developers proposed this Linux patch, but David Miller shut the proposal down pretty firmly:
Stop pretending a network path characteristic can be made into an application level one, else I’ll stop reading your patches.
You can try to use smoke and mirrors to make your justification by saying that an application can circumvent things right now by openning up multiple connections. But guess what? If that act overflows a network queue, we’ll pull the CWND back on all of those connections while their CWNDs are still small and therefore way before things get out of hand.
Whereas if you set the initial window high, the CWND is wildly out of control before we are even started.
And even after your patch the “abuse” ability is still there. So since your patch doesn’t prevent the “abuse”, you really don’t care about CWND abuse. Instead, you simply want to pimp your feature.
Despite the rejection of this patch, Google’s servers have been leveraging this kernel modification to experiment with increasing the initial congestion window for SPDY connections. Beyond just experimenting with statically setting the initial congestion window, Google has also experimented with using SPDY level cookies (see SETTINGS_CURRENT_CWND) to cache the server’s congestion window at the browser for reuse later on when the browser re-establishes a SPDY connection to the server. This is precisely the functionality recently under debate in the httpbis working group.
It’s easy to see why this would be controversial. It definitely does raise a number of valid concerns:
- Violates layering. Why is a TCP internal state variable being stored by the application layer?
- Allows the client to request the server use a specific congestion window value.
- Attempts to reuse an old, very possibly inaccurate congestion window value.
- Possibly increases the likelihood of overshooting the appropriate congestion window.
So why are SPDY developers even experimenting at all with this? Well, there are some more and less reasonable explanations:
- While it’s true that this is clearly a layering violation, this is the only way to experiment with caching the congestion window for web traffic at reasonable scale. This could be conceivably done by the TCP layer itself with a TCP option, but given the rate of users updating to new operating system versions, that would take ages. Experimenting at the application layer, while ugly, is much more easily deployable in the short term.
- As with any variable controllable by the client, servers must be careful about abuse, treat this information as only advisory, and ignore ridiculous values.
- Yes, while it’s true that there’s definitely reason to be skeptical of an old congestion window value, a reasonable counterpoint question is, as Patrick McManus mentions, is there a reason to believe that an uninformed static guess like IW10 is any better than an informed guess based on an old congestion window?
- Is reusing an old congestion window more likely to cause overshooting the appropriate congestion window? Well, given that in a HTTP/1.X world, browsers are often opening 6 connections per host, and thus combined with IW10 often have an effective initial congestion window per host of 60, it’s interesting to note what values the SPDY CWND cookie caches in practice. Chromium (and Firefox too, according to Patrick McManus), sees median values around 30~, so in many ways, reusing the old congestion window is more conservative than what happens today with multiple HTTP/1.X connections.
It’s great though to see this get discussed in IETF, and especially great to see tcpm and transport area folks get involved in the HTTP related discussions. And it looks like there’s increasing interest from these parties in collaborating more closely in the future, so I’m hopeful that we’ll see more research in this area and come up with some better solutions here.
Until SPDY and HTTP/2 are widely deployed enough that web developers don’t need to rely on domain sharding, and we’re still years out from this, we have to make things work well in today’s web. And that means dealing with domain sharding. Even though domain sharding may make sense for older browsers with low connection per host limits, when domain sharding is combined with higher limits, then opening so many connections in parallel can actually be harmful to both performance and network congestion as previously demonstrated.
As I’ve previously discussed, Chromium historically has had poor control over connection parallelism. That said, now we have our fancy new ResourceScheduler, courtesy of my colleague James Simonsen. With that, we now have page level visibility into all resource requests. With that capability, we can start placing some limits on request parallelism on a per-page basis. Indeed, we’ve experimented with doing so and found compelling data to support limiting the number of concurrent image requests to 10, which should roll out to stable channel in the upcoming Chrome 27 release and dramatically mitigate congestion related issues. For further details, look at the waterfalls here and watch the video comparing the page loads.
While this limit would obviously be good for users with slower internet connections, I was minorly concerned that it would lead to degraded performance for users with faster internet connections due to not being able to fully saturate the connection’s available bandwidth. That said, it appears that it is still high enough for at least cable modem type bandwidth.
- Congestion is bad for the internet and for user perceived latency
- Extra roundtrips are bad for user perceived latency
- Slow Start helps avoid congestion
- Slow Start can incur painful roundtrips
- Low initial congestion windows can be bottlenecks for user perceived latency
- But high initial congestion windows can lead to immediate congestion, which can also hurt user perceived latency
- There’s an interesting discussion in IETF on whether or not we can improve upon Slow Start by picking better initial congestion windows, possibly dynamically by using old information.
- Web developers using high amounts of domain sharding to work around low connection per host limits in old browser should reconsider their number of shards for newer browsers. Anything over 2 is probably too much, unless most of your user base is using older browsers. Better yet, stop hacking around HTTP/1.X deficiencies and use SPDY or HTTP/2 instead.
Food for thought (and possibly future posts):
- Impact on bufferbloat
- How does TCP behave exactly when it encounters such high initial congestion like some of the demonstrated packet traces show? How does this change with the advent of newer TCP algorithms like Proportional Rate Reduction and Tail Loss Probe that are available in newer Linux kernel versions?
- Limiting requests will reduce attempted bandwidth utilization, which reduces congestion risk (and the risk for precipitous drops in goodput under excessive congestion). How else does it improve performance? For hints, see my some of my old posts like this and this.