William Chan's blag

Sonic DSL & Bufferbloat

[TL;DR: Sonic’s DSL modem (Pace 5268AC) suffers from terrible bufferbloat that renders the DSL service unusable whenever devices upload substantial data (e.g. whenever I come home and my phone sync photos to the cloud). Otherwise, Sonic provides solid service. I’d consider going back for their fiber service if/when that becomes available in my neighborhood.]

I moved within SF last year, and was excited to finally ditch Comcast in favor of…well…any other ISP. I had heard great things about Sonic, especially regarding user privacy, so I figured I’d give them a try. Sadly, despite their many good qualities, I ultimately have decided to switch to Monkeybrains. I’ve scheduled an appointment with them in a few weeks, and will see how Sonic’s service cancellation service works shortly thereafter.

First, let me talk about the good things about Sonic. Their installation was wonderfully uneventful. I ordered Sonic Fusion DSL before I moved in, and they showed up a week later and set everything up. No hassle whatsoever. I was without internet for a day or so, since I had timed the installation to happen after I had moved in. And it just worked fine.

Secondly, their customer service has been pretty good. When I first encountered slow internet issues, I emailed them late at night, and they responded first thing in the morning with the correct diagnosis (WiFi interference). It seems like they have good monitoring infrastructure set up to help diagnose network issues when using their provided DSL modem/router combination.

Subsequently, I had contacted them about some issues with streaming media, which it took them 5 days to respond to. That was disappointing, and by that point, they weren’t able to diagnose it. That said, the problem never reoccurred, so I didn’t mind so much. It could very well have been a problem on my end for all I know.

I continued to use Sonic for the past half year or so, and many times a week, the internet connectivity would just suck. Browsing was super slow, and many HTTP requests would just time out. It would pass after some time, so I never got around to trying to diagnose it. But one time, it was absolutely terrible, and I had friends over who couldn’t use my WiFi, so I figured I’d track it down. I contacted Sonic again, and it again took them like 4 days to get back to me. In the meanwhile, I kept sending Sonic output from ping, trying to demonstrate the high latency I would see. When they finally got back to me, they provided the useful information that the periods of high latency appeared to correlate with high uplink bandwidth utilization. Props to them for being able to identify that. Maybe I have low standards to be impressed by that, but Comcast was never helpful with this stuff.

Anyhow, from there, it was pretty easy to debug everything. I made a screen recording of the bufferbloat in action, and will break it down here. The initial setup is pretty simple. I use apenwarr’s Blip tool to visualize latency. Basically it pings with XHRs to a fast site (Google / and a slower site (, which is presumably his personal website). Simultaneously, I have ping commands (every 1s) running to and the DSL modem/router at the LAN address. With this monitoring going, I use Google’s speedtest (run by Measurement Lab) to test downloads and uploads. See this setup below:

The short summary is, when I run the test, latency is fine during downloads, but goes to shit during uploads. During downloads, latency looks stable both in Blip and in the command line pings:

On the other hand, uploads immediately start looking terrible. The ping to starts timing out, even though the router pings finish fine. Since Blip is sending XHRs at 100ms intervals until they time out, you can see the few initial XHR pings’ latency jumps until it maxes out at 1000ms latencies (at which Blip times out the XHRs).

For the entirety of the upload test, all the pings time out, as you can see below. I captured this screenshot right after the speedtest ended:

It’s interesting to see what happens shortly after as the uplink interface gets a chance to drain its buffer from the speedtest’s upload portion. Blip’s XHR latencies immediately start recovering, and we see all the pings to which had timed out (icmp_seq 19-31), actually still manage to complete. They weren’t dropped by network queues, but rather, were just excessively buffered, and still managed to complete. The primary component of their network delay is not the propagation delay across the network hops to the nearest Google server, but the queueing delay from excessive buffering at the DSL modem interface.

Nothing super surprising here. It’s pretty textbook bufferbloat. I guess I’m just frustrated because this is 2017, and I didn’t expect such a highly recommended ISP to be providing DSL modems with so much bufferbloat. I Googled [sonic bufferbloat] and found numerous hits. I wish I had seen these before signing up for Sonic’s DSL service. It’s extra unfortunate because it seems like Dane Jasper, their CEO, is aware of the problem but doesn’t think bufferbloat is to blame.

I think that buffer bloat is a red herring here, and suspect that QoS is the real challenge. This is something we are actively working with Pace on, the 4111N doesn’t currently support upstream ACK prioritization, and that can impair performance during times of upstream saturation with some applications.

It looks like Dave Taht from the Bufferbloat project chimed in to correct Dane, but still nothing has been done:

I do not think bufferbloat is a “red herring” here. The vast majority of modern dsl interfaces are dramatically overbuffered, and your 1+second results in line with rather large datasets.

Best results would come from using a modem with absolute minimal firmware buffering using bytes rather than packets,”BQL”, essentially, then combined with a latency sensitive aqm/fq system like fq_codel on top of that. Older DSL modems did this, with hardware flow control. The only DSL device I know of getting it right this way, today, is’s revolution V6 modems. Newer ones overbuffer and connect to switches that cannot do hardware flow control.

Second best results are using a good QoS system that also does fq and aqm underneath, running at slightly less than line rate - and many QoS systems do do that, nowadays. But I was reluctant to call it QoS because that implies that packet prioritization is useful when there are seconds of uncontrolled buffers underneath. We ended up inventing a new term - “smart queue management” that described things better. See openwrt’s sqm-scripts for details.

I am not a fan of ack prioritization - what’s an ack? In IPv6? In QUIC? In other protocols? - but of a combined fair queuing and aqm approach as per the above.

It was also doubly frustrating as a user to have Sonic’s customer service ask me what cloud software I’m using, as if the onus is on the user to make sure he/she doesn’t use too much upstream bandwidth. I mean, do they really expect me to rate limit the upload bandwidth utilization of all my devices, and the devices of guests who I may have over, in order to make sure I have usable internet?

apenwarr told me I could fix this by putting a Linux box in between my devices and the DSL modem that rate limits upstream bandwidth, so that the DSL modem upstream buffer never fills up. I mean, yes, I could totally do this, but I don’t think a normal user should have to do this. I don’t really want to maintain my own networking setup, I just want it to work out of box. I really wish they’d get Pace to take Dave Taht’s advice to fix the DSL modems they provide to customers.

But since there’s no sign that it’ll be fixed in the near future, nor do I know when they’ll expand their fiber service to my SF neighborhood, I’m sadly giving up on Sonic for now. I had been holding out on canceling my Sonic DSL service, because I was hoping to switch to their fiber service whenever it expands to the Mission, but it’s pretty lame for me to deal with having the internet service go out to lunch whenever I come home and my devices sync photos/videos/files to the cloud, or whenever guests come over and their devices do the same thing. It’s simply unacceptable to have to ask people to make sure they’re not uploading any large files, so that other folks can use the internet while they’re over at my place. And when I’m videoconferencing and some cloud sync process fires, I don’t want to hunt around to figure out which device/process is causing it.

Anyway, hopefully Monkeybrains is better. We’ll see!

Back to SF After the U.S. Digital Service

After 6 months working for the U.S. Digital Service, I returned back to San Francisco mid-April to figure out what was next for me back here. I got up that Thursday morning and went out to ride the Google shuttle down to Mountain View so I could see what was happening back at Google.

It’s hard to describe the multitude of feelings returning back to Google invoked inside me. Sitting there at a Google cafe, eating delicious, healthy, free food, walking around the hallways with all the internal corporate advertisements about tech talks, yoga classes, cinema nights, etc…Yes, after 6 months through a brutal DC winter, working in dreary government offices, I had forgotten how comfortable the warm embrace of corporate America was.

By far, the most conflicted thing for me was catching up with my colleagues. They’d heartily welcome me back and ask for stories about working for the U.S. Digital Service. Everyone always loves hearing about train wrecks, so I’d regale them with horror stories about the terrifying things I saw in government. But then I’d tell them that there were also good, smart people in government who are trying to do the right thing, but the organizational dysfunction binds their hands. How, at the U.S. Digital Service, we’re able to leverage our unique position as outsiders, with no vested interest other than getting shit to work, to effect change at different U.S. government agencies.

My colleagues would listen fascinated by the stories, and then frustrated by the irritatingly simple problems I’d describe. And then would say it was awesome how we were able to fix some of the problems. Some would then say how they’d like to take time off to work for the U.S. Digital Service too. And then I’d ask them how things were going back at Google. They’d excitedly talk about their work on networking protocols, web standards, web performance, web security, etc…all the kinda stuff that I used to work on back before I left for DC.

Within a few seconds, I’d fall back into my old groove, geeking out with other engineers about tech stuff. It was so fun and intellectually satisfying to nerd out with them about all the techie subjects I loved. But after a period of time, sometimes sooner, sometimes later, this background process would kick off in the back of my head and ask me “But does this matter?”

I like to think that it does matter. I still believe that the advancing state of computing technology has overall benefitted peoples’ lives, so I’m proud to have worked on improving core technology like network protocols and the web platform at Google. But when I look at what the tech industry is spending its energy on, I see them working on helping rich people find taxis more easily, selling ads more effectively, or building sexting apps. It bothers me that I might be stuck in an ivory tower, solving abstract computing technology problems that enable the tech industry to make money off silly products for the 1%, rather than solving important problems for people that need help.

Approximately 200 million people suffer from malaria each year, and the death estimates range between 400,000 and 800,000. About 90 percent of those mortalities are in sub-Saharan Africa, and three-quarters of them occur in children younger than five. The second-order effects of the disease are vicious: Malaria is a massive impediment to economic growth, since survivors often cannot work, and parents have to devote their lives to caring for their sick children.

I’ve read, and typed, and read again these numbers, and they are so stark to me that they can easily float away into the atmosphere of statistics, escaping true empathy. Understanding one nation’s experience feels more visceral: Every day, more than 500 people die from malaria in the Democratic Republic of Congo, and the majority of these deaths are children under the age of five. AMF offers a shattering metaphor: Imagine a fully booked 747 airplane and infants strapped into seats A through K of every row of the economy section; their feet cannot reach the floor. Every day, this plane disappears into the Congo River, killing every soul on board. That is malaria—in one country. By GiveWell’s calculations it would cost $1.7 million to save the airplane.

This is nothing I hadn’t thought about before. What was different this time was that I just came back from 6 months helping fix government digital services that, by and large, benefited the neediest demographic in the United States. These critical government services were failing or at risk for many frustratingly simple (from the technical standpoint) reasons, hampered by systemic organizational dysfunction that normal federal employees or contractors working within the system are ill-equipped to address. The U.S. Digital Service, on the other hand, for a mix of reasons, had unique leverage to actually execute and make a large impact.

For a long time, I had struggled to find projects with software engineering problems that I thought were ultimately impactful on peoples’ lives. Working for the U.S. Digital Service, I finally found something where my skillset could be applied to meaningfully improve large numbers of poor peoples’ lives. If not for the fact that the bulk of my friends and loved ones live in California, I could have seen myself stay at the U.S. Digital Service, working to improve the government services our society, especially the poorest segments of it, so rely on.

Working for the U.S. Digital Service showed me that, indeed, there were far better ways for me to spend my time if I really wanted to make a difference in people’s lives. I couldn’t go back to what I was doing before. Even though I’m still a believer in technology, particularly in the web platform, as a force for good, society needs more people working directly on problems that matter. Google will always be able to find someone to work on its core software platforms and business. I needed to redirect my energies towards problems that society wasn’t adequately addressing.

So, after a few minutes of geeking out with a colleague on the state of tech, that nagging voice in the back of my head would smother my little bit of happiness, and I would look for the next opportunity to end the conversation. As it wound down, my coworker would always ask the same question: “So, are you back?”

I didn’t know. I was not hopeful that I would find a suitable project for me at Google, or for that matter, in the bay area. Most social impact projects and organizations are in the east coast, and, at least as far as I could tell (but in all likelihood, it’s probably because I suck at finding them), most of them don’t seem to have significant / meaningful software projects. I had a few leads, and fortunately one of them panned out. I have a new project, working on problems that negatively impact, and sometimes kill, millions of the poorest people in the world. So I’m pretty happy to say that, yes, I’m back here to stay.

HTTP/2 Considerations and Tradeoffs

[Disclaimer: The views expressed in this post are my own, and do not represent those of my employer.]

[Disclaimer: I’m a Chromium developer working on HTTP/2 amongst other things.]

There are many things to love about HTTP/2, but there are also many things to hate about it. It improves over HTTP/1.X in a number of ways, yet it makes things worse in a number of ways as well. When proposing something new, it’s important to try to comprehensively examine the tradeoffs made, to evaluate if it makes sense overall. I’m going to try (and probably fail) to provide a reasonably comprehensive description of the important considerations & tradeoffs involved with HTTP/2. Warning: this is going to be a long post, but it’s required in order to try to present a more complete analysis of HTTP/2’s tradeoffs.

Oh, and if you don’t know what HTTP/2 is, it’s the in-progress next major version of HTTP (the latest version is 1.1), based on Google’s SPDY protocol. It keeps the same HTTP message semantics, but uses a new connection management layer. For a basic introduction to SPDY, I recommend reading the whitepaper or Ilya’s post. But at a high level, it uses:

  • Muliplexed streams (each one roughly corresponding to a HTTP request/response pair)
  • Stream prioritization (to advise the peer of the stream’s priority relative to other streams in the connection)
  • Stateful (HTTP) header compression
  • Secure transport (TLS) in all browsers that support SPDY (it’s not required by spec though, just by current browser implementations)

I’m not going to bother explaining anymore, since there are plenty of great descriptions of HTTP/2 out there, so if you want to know more, use your favorite search engine. Let’s dive into the considerations.

Major Considerations:

Network Performance

HTTP/1.X is very inefficient with its network usage. Excepting pipelining, which has its own issues, HTTP/1.X only allows one transaction per connection at any point in time. This causes major head of line blocking issues that costs expensive roundtrips, which is the dominant factor (in terms of networking) of page load performance.

If page load performance matters, then in terms of networking protocol design, reducing roundtrips is the way to go, since reducing the actual latency of a roundtrip is very hard (the speed of light does not show signs of increasing, so the only remaining option is to pay for more servers closer to the clients and solving the associated distributed systems problems). There are some HTTP/1.X workarounds (hostname sharding, resource inlining, CSS sprites, etc), but they are all worse than supporting prioritized multiplexing in HTTP/2. Some reasons why include:

  • Hostname sharding incurs DNS lookups (more roundtrips to DNS servers) for each hostname shard
  • Hostname sharding ends up opening more connections which:
    • Increases contention since there’s no inter-connection prioritization mechanism
    • Can lead to network congestion due to multiplying the effective cwnd
    • Incurs more per-connection overhead in intermediaries, servers, etc.
    • Requires waiting for each connection to open, rather than for just one connection to open (and multiplex all the requests on that one connection)
  • Resource inlining, CSS sprites, etc. are all forms of resource concatenation which:
    • Prevents fine grained caching of resources, and may even outright prevent caching (if you’re inlining into uncacheable content like most HTML documents)
    • Bloats resources, which delays their overall download time. Many resources must be downloaded in full before the browser can begin processing them. Inlining images as data URIs in CSS can hurt performance because documents can’t render before they download all external stylesheets in .
    • Can interfere with resource prioritization. Theoretically you want lower priority resources (like images) downloaded later rather than be inlined in the middle of a high priority resource (like HTML).
  • These techniques all require extra work/maintenance for the web developer, so only websites with developers who know these techniques and are willing to put up with the extra maintenance burden will actually employ them. HTTP/2 makes the simple, natural way of authoring web content just work fast, so the benefits accrue to the entire web platform, not just the websites with engineers who know how to hack around issues.

Another important aspect of network performance is optimizing for the mobile network, where roundtrips are even more costly, and the uplink bandwidth is even more constrained. Header compression makes the per-request overhead cheap, since lots of the overhead manifests in the form of large HTTP headers like cookies. Indeed, HTTP’s per-request overhead is costly enough that web performance advocates recommend using fewer requests. Where this really can kill you is in the roundtrips required to grow the client-side TCP congestion window as Patrick McManus from Firefox explains. Indeed, he goes as far as to say that it’s effectively necessary in order to reach sufficient parallelization levels:

Header compression on the upstream path is more or less required to enable effective prioritized mux of multiple transactions due to interactions with TCP congestion control. If you don’t have it you cannot effectively achieve the parallelism needed to leverage the most important HTTP/2 feature with out an RTT penalty. and RTT’s are the enemy.

Without compression you can “pipeline” between 1 and 10 requests depending on your cookie size. Probably closer to 3 or 4. With compression, the sky is more or less the limit.

For more details about HTTP/2 networking performance concepts, check out my colleague Ilya’s wonderful talk and slidedeck.

On the flip side, browsers that currently deploy SPDY only do so over TLS for various reasons (see the later deployability and TLS sections), and that TLS handshake will generally incur at least an extra 1-2 roundtrips. Moreover, to the degree that webpages load resources from different domains, HTTP/2 will be unable to multiplex requests for those resources over the same connection, and the browser will instead have to open up separate connections.

Now, that’s the theory of course behind most of the theoretical networking performance improvements offered by HTTP/2. But there’s also some debate over how well it performs in practice. That topic is beyond the scope of this post, and indeed, many posts have been written about this topic. Guy wrote a pretty good post identifying issues with SPDY / HTTP/2 performance on unmodified websites. When reading that, it’s also important to read Mike’s counterpost where he critiques Guy’s first party domain classifier. TL;DR: Guy points out that if web sites don’t change, they won’t see much benefit since resources are loaded across too many different hostnames, but Mike points out that lots of the hostnames are first party (belonging to the website owner), so in a real deployment, they would be shared over the same HTTP/2 connections.

And then it’s important to note that Google, Twitter, and Facebook (notably all sites that are already primarily HTTPS, so they aren’t paying any additional TLS penalty for switching) all deployed SPDY because of its wins. If the website is already using TLS, then deploying to SPDY is a clear win from a page load time improvement perspective. One particularly exciting result though that Google announced in a blog post is that when they switched from non-SSL search to SSL search, SPDY-capable browsers actually loaded the search results page even faster. Another result from Google+ is that leveraging SPDY prioritization dramatically sped up their page loads.

Putting aside the page load performance aspect of network performance for now, let’s consider the other way in which HTTP/2 may affect network performance: real-time networking latency. For a long time now, folks have been raising a big fuss over bufferbloat, and with good reason! These bloated buffers are leading to huge queueing delays for everything going through these queues (located in routers, switches, NICs, etc). This is why when someone is uploading or downloading a whole lot of data, like video download/upload, it fills these huge buffers, which massively increases the queueing delays and kills the interactivity of real-time applications like Skype and videoconferencing which depend on low latency. Web browsing tends to be another one of these bad contributors to network congestion due to websites using so many HTTP/1.X connections in the page load. As Patrick McManus observes, HTTP/2’s multiplexing will lead to a reduction of connections a browser will have to use to load a page. Fewer connections will both decrease the overall effective cwnd to something more reasonable and increase the likelihood that congestion related packet loss signals will affect the transmission rate, leading to less queue buildup. HTTP/2 is a key piece in the overall incentives so that web developers don’t have to increase the connection count in order to get sufficient parallelization.

I’ve primarily discussed networking performance from a browser page load performance here, but the same principles apply to non-browser use cases too. As Martin Thomson (HTTP/2 editor) says:

I don’t think that I’m alone in this, but the bulk of my day job at Skype was building those sorts of systems with what you call “RESTful APIs”. Having HTTP/2.0 was identified as a being hugely important to the long term viability of those systems. The primary feature there was multiplexing (reducing HOL blocking is a big win), but we did also identify compression as important (and potentially massively so). We were also speculatively interested in push for a couple of use cases.

Indeed, Twitter even provides us with performance data for its API users who experienced significant performance improvements when using SPDY:

Scalability & DoS

Another major HTTP concern relates to scalability and DoS. The working group is very sensitive to these issues, especially because a large chunk of the active participants of the httpbis working group represent intermediaries / large services (e.g. Akamai, Twitter, Google, HAProxy, Varnish, Squid, Apache Traffic Server, etc). HTTP/2 has a number of different features/considerations that influence scalability:

  • Header compression - very controversial
  • Multiplexing - not controversial at all from a scalability standpoint, widely considered a Good Thing (TM)
  • Binary framing - not controversial at all from a scalability standpoint, widely considered a Good Thing (TM)
  • TLS - controversial

To get an understanding of scalability & DoS concerns, it’s useful to see what some of the high scalability server folks have to say here. Willy Tarreau of HAProxy has written / talked extensively about these issues. From Willy’s IETF83 httpbis presentation slides:

Slide 2:

Intermediaries have a complex role :

  • must support unusual but compliant message formating (eg: variable case in header names, single LF, line folding, variable number of spaces between colon and field value)
  • fix what ought to be fixed before forwarding (eg: multiple content-length and folding), adapt a few headers (eg: Connection)
  • must not affect end-to-end behaviour even if applications rely on improper assumptions (effects of rechunking or multiplexing)
  • need to maintain per-connection context as small as possible in order to support very large amounts of concurrent connections
  • need to maintain per-request processing time as short as possible in order to support very high request rates
  • front line during DDoS, need to take decisions very quickly

Slide 4:

Intermediaries would benefit from :

  • Reduced connection/requests ratio (more requests per connection)
    • drop of connection rate
    • drop of memory footprint (by limiting concurrent conns)
  • Reduced per-request processing cost and factorize it per-connection
    • higher average request rate
    • connection setup cost is already “high” anyway
  • Reduced network packet rate by use of pipelining/multiplexing
    • reduces infrastructure costs
    • significantly reduces RTT impacts on the client side

As Willy discusses in his HTTP/2 expression of interest for HAProxy, he’s overall favorably inclined towards the SPDY proposal, which became the starting point for HTTP/2. Prefixing header name/values with their size makes header parsing simpler and more efficient. And the big win really is multiplexing, since you reduce the number of concurrent connections, and lots of the memory footprint is per connection. Also, as can be seen in HAProxy’s homepage, a large majority of the time is spent in the kernel. By multiplexing multiple transactions onto the same connection, you reduce the number of expensive syscalls you have to execute, since you can do more work per syscall. Header compression is a key enabler of high levels of parallelism while multiplexing, so it helps enable doing more work per syscall. On the flip side, header compression is also the major issue Willy has with the original SPDY proposal:

For haproxy, I’m much concerned about the compression overhead. It requires memory copies and cache-inefficient lookups, and still maintains expensive header validation costs. For haproxy, checking a Host header or a cookie value in a compressed stream requires important additional work which significantly reduces performance. Adding a Set-Cookie header into a response or remapping a URI and Location header will require even more processing. And dealing with DDoSes is embarrassing with compressed traffic, as it improves the client/server strength ratio by one or two orders of magnitude, which is critical in DDoS fighting.

That said, that’s in reference to SPDY header compression which used zlib. The new header compression proposal (HPACK) is different in that it is CRIME-resistant, by not being stream-based, and instead relying on delta encoding of header key-value pairs. It is notably still stateful, but allows the memory requirement to be bounded (and even set to 0, which effectively disables compression). This latter facility is key, because if header compression does pose significant enough scalability or DDoS concerns, it can definitely be disabled. There is a long, complicated, somewhat dated thread discussing it, which is highly educational. I think it’s safe to say that the feature is still fairly controversial, although there’s definitely a lot of momentum behind it.

There have been alternate proposals for header encoding, mostly based around binary/typed coding. Many people hope that this is enough, and it removes the requirement for stateful compression, but many fear that it is not enough due to huge, opaque cookie blobs. I do not mention it further here since while these proposals have merit, they don’t seem to generate sufficient interest/discussion for the working group at large to want to pursue them. Although recently someone expressed interest again.

And Varnish maintainer Poul-Henning Kamp has been especially critical of HTTP/2 on the scalability / DDoS prevention front. Indeed, one of his slides from his RAMP presentation “HTTP Performance is a solved problem” says it best:

HTTPng HTTP/2.0 –

  • ”Solves” non-problems (bandwidth)
  • Ignores actual problems (Privacy, DoS, speed)
  • Net benefit: At best, very marginal
  • -> Adoption: Why bother ?

PHK is well known for being a bit hyperbolic, so rather than address the actual words he writes here, I’ll interpret the general meaning being that we aren’t doing enough to make HTTP more scalable (computationally performant, reduced memory consumption, and DDoS resistant). He puts forth a number of concrete proposals for things we could do better here:

What does a better protocol look like ?

  • Flow-label routing
  • Only encryption where it matters
  • Non-encrypted routable envelope
  • Fixed-size, fixed order fields
  • No dynamic compression tables

Notably HPACK runs against a number of these. You can’t simply do flow-label (subset of headers represents a flow that always gets routed to backend X) routing since HPACK requires “processing” all headers, and the order is not fixed. HPACK has a number of advantages, but PHK’s primary counterpoint is that most of those advantages come in a world where cookies comprise the largest portion of HTTP headers, and cookies are simply evil. PHK would love to see us kill cookies in HTTP/2 and replace them with tiny session IDs. Lots of his rationale here comes from political/legal opinions about cookies and user tracking / privacy, which I’m not going to dwell on. But he makes an interesting point that perhaps the main thing HPACK fixes is redundant sending of large cookies, and he’d like us to kill them off and replace them with session ids. As Willy points out, this is actually contrary to overall scalability due to requiring distributed server-side session management syncing across machines/datacenters. This is obviously controversial because if we want to encourage servers to adopt HTTP/2, we need to provide incentives to do so, and breaking backwards compatibility with cookies and requiring deploying distributed data stores is likely to make most companies question the wisdom of switching to HTTP/2.

Now, the other major controversial issue from a scalability standpoint is TLS, and the reasons are fairly obvious. It incurs extra buffer copies for symmetric encryption/decryption, expensive computation for the asymmetric crypto used in the handshake, and extra session state. Going into detail on the exact cost here is beyond the scope of this post, but you can read up about it on the internets: Adam Langley on SSL performance at Google & Vincent Bernat’s two posts on the performance of common SSL terminators.

Another obvious scalability issue with encryption is sometimes intermediaries want to improve scalability (and latency) by doing things like caching or content modification like image&video transcoding/downsampling/etc). However, if you can’t inspect the payload, you can’t cache, which means caching intermediaries are not viable unless they are able to decrypt the traffic. This clearly has some amount of internet scalability / latency concerns, especially in the developing world and other places further away from the web servers, like Africa and Australia.

Lastly, TLS faces some scalability issues due to IPv4 address space scarcity. With HTTP, webservers can reduce their IP address utilitization by using virtual hosting. That same virtual hosting method doesn’t work for HTTPS without SNI support, meaning that as long as a significant user base doesn’t support SNI (e.g. IE users on XP and Android browser pre-Honeycomb), servers that don’t want to break those users will need to use separate VIPs per hostname. Given IPv4 address scarcity, and the insufficient deployment of IPv6, TLS can present additional scalability issues in terms of address space availability.

Multiplexing is a big win for scalability, but stuff like header compression is more ambiguous, and TLS clearly imposes scalability costs. There hasn’t been much public release of numbers here, except for this one article from Neotys which concludes with:

It’s no surprise that SPDY improves response times on the client side. That’s what it was designed to do. It turns out that SPDY also has advantages on the server side:

  • Compared to HTTPS, SPDY requests consume less resources (CPU and memory) on the server.
  • Compared to HTTP, SPDY requests consume less memory but a bit more CPU. This may be good, bad, or irrelevant depending on which resource (if either) is currently limiting your server.
  • Compared to HTTP/S, SPDY requires fewer Apache worker threads, which increases server capacity. As a result, the server may attract more SPDY traffic.

Implementation Complexity

A lot has been said about SPDY and HTTP/2’s implementation complexity. I think it’s pretty useful to see what actual implementers have to say about the matter, since they have real experience:

  • Patrick McManus discusses the binary framing decision here, where he explains why binary is so much simpler and more efficient than ASCII for SPDY / HTTP/2.
  • On the flip side, Jamie Hall, who had worked on SPDY for his undergraduate dissertation, discusses his SPDY/3 implementation here. Notably, he says that most things were fine, but that header compression and flow control were a little complicated.
  • Jesse Wilson of Square also wrote up his feedback on HTTP/2 header compression. He has a lot of substantive “nitpicks” about the “rough edges”, but says that “Overall I’m very happy with the direction of HPACK and HTTP/2.0.”
  • Adrian Cole of Square also wrote his thoughts on HPACK, saying: “I found that implementing this protocol, while not trivial, can be done very efficiently.” and “All in, HPACK draft 5 has been quite enjoyable to develop. Thanks for the good work.”
  • Stephen Ludin of Akamai also noted that:

I just implemented the header compression spec and it felt incredibly complex so I am definitely inclined to figure out a scheme for simplification. My concern is similar to Mike’s: we will get buggy implementations out there that will cause us to to avoid its use in the long run. The beauty of the rest of the HTTP/2.0 spec is its simplicity of implementation. I like to think we can get there with header compression as well.

  • James Snell from IBM has participated heavily in the working group and has written up a number of good, detailed blog posts highlighting complexity in the HTTP/2 draft specs. He spends the vast majority of the time criticizing the complexity, so people might think that he’s overall negative on HTTP/2, but they miss the point that he’s soundly in favor of the core parts of HTTP/2 - binary framing and multiplexing. As he says:

Many people who have responded to my initial post have commented to the effect that switching to a binary protocol is a bad thing. I disagree with that sentiment. While there are definitely things I don’t like in HTTP/2 currently, I think the binary framing model is fantastic, and the ability to multiplex multiple requests is a huge advantage. I just think it needs to be significantly less complicated.

Overall, I think it goes without saying that everyone agrees there is definitely more complexity in HTTP/2, especially around header compression and state management. Some people have incorrectly viewed the binary framing as a source of implementation complexity. This is simply false as Patrick McManus goes to great detail to demonstrate. Indeed, if you go look at the actual implementations out there, the binary framing is some of the simplest, most straightforward code. But there are plenty of areas that are clearly more complicated, with multiplexing, header compression, and server push clearly among them. For a discussion of the complexities of header compression, one has only to look at Jesse Wilson’s email to httpbis and James Snell’s blog post on header compression to get an idea of the complexities involved.

Header compression used to be significantly less complicated, practically speaking, when SPDY used zlib for header compression. When you don’t need a whole, new, separate spec for header compression, and can instead rely on an existing standard that has widely deployed open source implementations, header compression adds significantly less implementation complexity. That said, CRIME rendered zlib insecure for use in HTTP header compression, leaving us with the burden of added complexity for implementing a new header compression algorithm if we still want header compression.

Putting header compression aside, another big chunk of the complexity in HTTP/2 is due to stream multiplexing. Now, multiplexing sounds simple, and conceptually it is. You tag frames with a stream id, big deal. And now each stream is its own, mostly independent unit with its own moderately sized state machine, which definitely isn’t trivial, but is quite understandable. But the simple act of introducing multiplexing streams leads to other issues. Many of them have to do with handling races and keeping connection state synchronized. None of these are hard per se, but they increase the number of edge cases that implementers need to be aware of.

One of the bigger implications of multiplexing is the necessity of HTTP/2 level flow control. Previously, with HTTP/1.X, there was only ever one stream of data per transport connection. This means that the transport’s flow control mechanisms were by themselves sufficient. If you ever ran out of memory for buffers, you just stopped reading from the transport connection socket, and the transport flow control mechanism kicked in. However, with multiplexed streams, things got more complicated. TL;DR: Fundamentally, the issue is that not reading from the socket will block all streams, even if only one stream has buffering issues. Flow control has non-trivial complexity, and naive implementations could definitely kill their performance by keeping the windows too small. Indeed, the spec actually recommends disabling flow control on the receiver end unless you absolutely need the control provided by flow control.

Another controversial part of HTTP/2, primarily due to its complexity, is server push. HTTP is traditional a request/response protocol, so having the server push responses clearly complicates the stream state model, not to mention introduces the possibility for races which the protocol would have to address. There is a lot more to say about server push, but it clearly adds additional complexity to HTTP/2 and for that reason amongst others, its status within the draft spec has always been rather shaky. The primary counterpoint to it is that servers are already using existing, but suboptimal techniques (most notably inlining small resources) to “push” responses, so server push would just provide a better, protocol-based solution.

I could go on about the other complexities of HTTP/2, but I think it’s fair to say that it’s clearly nontrivially more complicated than HTTP/1.X. It’s not rocket science though, and this complexity is clearly quite tractable, given the wide availability of both SPDY and HTTP/2 implementations.

Text vs Binary

One of the most noticeable changes with HTTP/2 is that it changed from a text based protocol to a binary protocol. There are a lot of tradeoffs between textual and binary protocols, and a lot of them have been written up before. Stack Overflow has a great thread on some of the tradeoffs, and ESR also wrote about it in The Art of Unix Programming.

In the specific HTTP case, there are lots of wins to binary. I’ve already covered them earlier in the scalability and implementation complexity sections. Binary framing with length indicators makes parsing way more efficient and easy. The interesting thing to discuss here is the downsides, first and foremost of which is ease of use/understanding.

Binary sucks in terms of ease of use and understanding. In order to handle the binary, you’ll most likely have to pipe stuff through a library or tool to process it. No more straight up telnet. No more human readable output by just sniffing the bytes off the network. This obviously sucks. One of the best things about the web has been its low barrier of entry - the ease of understanding by just looking at what’s actually getting sent over the wire, and testing things out with telnet.

That said, it’s interesting to think about how things will change in a HTTP/2 world, especially one that’s sitting atop TLS. How are people poking around today? Are they running tcpdump and examining the raw bytes? Are they using telnet and netcat and what not? My gut instinct, which could be very wrong, is that most people are using the browser web developer tools. HTTP/2 may introduce wire level changes, but the HTTP semantics are unchanged, and you’ll still see the same old HTTP request/response pairs in the developer tools, emitted to and parsed from the wire in HTTP/2 binary format for you by the browser. Server-side, the webservers that support HTTP/2 will likewise handle all the parsing and generating of HTTP/2 frames into HTTP messages, and will likely emit the same logs for debugging purposes. For people who actually examine the network traffic, they’re probably using Wireshark, which already has basic support for HTTP/2 draft versions and can decrypt SSL sessions when the pre-master secret is provided.

It’s difficult for me to say how much binary will impact ease of use and understanding. It’s not clear to me how often people care about the actual wire format, rather than the HTTP request/response messages, which will still be good ol text. Who’s to say?

Moreover, depending on whether or not you’re sold on HTTP/2 being over TLS everywhere, then a good part of the binary vs text discussion may already be decided. If you’re a HTTPS/TLS all the things! kinda guy, then yeah, the wire format is going to be a bunch of binary gibberish anyway.


One of the major considerations with SPDY and HTTP/2 has been how to deploy it widely and safely. If you don’t care about wide, public interop, then there’s not a huge need to standardize a protocol and you can go ahead and use the whatever private protocol you want. But otherwise, this is a major factor to keep in mind. SPDY sidesteps many issues by (1) not changing HTTP semantics, thereby avoiding compatilibity issues and (2) using TLS in all major deployments, which avoids issues with intermediaries. Let’s review some of the various other options for deployment that people have discussed:

Don’t bother with HTTP/2, just use HTTP Pipelining

Many have questioned the need for HTTP/2 if pipelining can already provide many of the benefits. Putting aside the ambiguous performance benefits of pipelining, the main reason that pipelining isn’t adopted by browsers is because of the compatibility issues. As Patrick explains for Firefox:

This is a painful note for me to write, because I’m a fan of HTTP pipelines.

But, despite some obvious wins with pipelining it remains disabled as a high risk / high maintenance item. I use it and test with it every day with success, but much of the risk in tied up in a user’s specific topology - intermediaries (including virus checkers - an oft ignored but very common intermediary) are generally the problem with 100% interop. I encourage readers of planet moz to test it too - but that’s a world apart turning it on by default- the test matrix of topologies is just very different.

I explain the same reasoning in more detail for why Chromium won’t enable pipelining by default either:

For example, the test tries to pipeline 6 requests, and we’ve seen that only around 65%-75% of users can successfully pipeline all of them. That said, 90%-98% of users can pipeline up to 3 requests, suggesting that 3 might be a magic pipeline depth constant in intermediaries. It’s unclear what these intermediaries are…transparent proxies, or virus scanners, or what not, but in any case, they clearly are interfering with pipelining. Even 98% is frankly way too low a percentage for us to enable pipelining by default without detecting broken intermediaries using a known origin server, which has its aforementioned downsides. Also, it’s unclear if the battery of tests we run would provide sufficient coverage.

You’ll note that Patrick refers to 100% interop, and I say that 98% interop is way too low. Breakage has to be absolutely tiny, because users have low tolerance for breakages. If a user can’t load their favorite kitten website, they’ll give up and switch browsers very quickly.

HTTP Upgrade in the clear over port 80

HTTP Upgrade is definitely an option, and it’s currently in the HTTP/2 draft spec, primarily to serve as the cleartext option for HTTP/2. One of the major issues with it, from a deployment standpoint, is that it has a relatively high failure rate over the internet. People around the web are independently discovering this themselves and are recommending that you only deploy WebSockets over TLS. There’s some hope that if HTTP/2 gets standardized with the Upgrade method, that HTTP intermediaries will eventually (years, decades, who knows) get updated to accept upgrading to HTTP/2. But internet application developers, especially those who have real customers such that any loss of customer connectivity affects the bottom line, simply have very little incentive to use HTTP Upgrade to HTTP/2 as their deployment option. On the other hand, within private networks like corporate intranets, there shouldn’t be as many troublesome uncontrolled HTTP intermediaries, so HTTP Upgrade might be much more successful in that scenario.

Use a different transport protocol

In IETF 87 in Berlin, there was a joint tsvwg (transport area folks) and httpbis (HTTP folks) meeting. One of the things that came out of this meeting was that the transport area folk wanted a list of features that application (HTTP) folks wanted, which my colleague Roberto provided. This led to a series of responses asking why HTTP/2 was reinventing the wheel and why not use other protocols like SCTP/IP and what not. Almost all of these basically come down to deployability. These transport features are not available on the host OSes that our applications run on top of, nor do they traverse NATs particularly well. That makes them not deployable for a large number of users on the internet. SCTP/UDP/IP is much more interesting to consider, although as noted by Mike, it has its own issues like extra roundtrips for the handshake.

Use a new URL scheme and ports

As James Snell blogged about previously:

Why is it this complicated? The only reason is because it was decided that HTTP/2 absolutely must use the same default ports as HTTP/1.1…. which, honestly, does not make any real sense to me. What would be easier? (1) Defining new default TCP/IP ports for HTTP/2 and HTTP/2 over TLS. (2) Creating a new URL scheme http2 and https2 and (3) Using DNS records to aid discovery. It shouldn’t be any more complicated than that.

Going back to the earlier numbers that Adam Langley shared with the TLS working group on the WebSocket experiment, using a new TCP port has a much lower connectivity success rate than SSL, most likely due to firewalls that whitelist only TCP ports 80 and 443. Granted, the success rate was higher than upgrading over port 80, but whereas HTTP Upgrade theoretically gracefully falls back to HTTP/1.1 when the upgrade fails (at least we hope so!), failure to connect to a new port is just a failure. And it might not even be a TCP RST or something, it might just hang (some firewalls send RSTs, some just drop packets), and force clients to employ timer based fallback solutions which are absolutely terrible for interactivity (like browsing).

The incentives simply aren’t favorable for this deployment strategy to succeed. Server operators and content owners aren’t terribly inclined to support this new protocol if it means they both have to update URLs in their content to use the new scheme and also tolerate a loss in customer connectivity. And client (e.g. browser) vendors aren’t terribly incentivized to support the new protocol if using it results in connectivity failures, because the first thing a user does when a page loads in browser X is try browser Y out, and if it works there, then switch to it. And switching to a new scheme here breaks the shareability of URLs. Let’s say I see a cute kitten photo at http2:// in my favorite browser that supports the http2 scheme. I send this URL to my friend who is on IE8. It fails to load. This is a terrible user experience. We’ve seen this happen before with WebP (when not using content negotiation via Accept):

It turns out that Facebook users routinely do things like copy image URLs and save images locally. The use of WebP substantially defeated both of these use cases, because apart from Chrome and Opera, there’s very little software that’s in regular day-to-day use that supports WebP. As a result, users were left with unusable URLs and files; files they couldn’t open locally, and URLs that didn’t work in Internet Explorer, Firefox, or Safari.

While this proposal simplifies the negotiation as James points out, it suffers from the downsides of terrible user experience, deployability difficulties, and requiring updating all URLs in all content.

Forget backwards compatibility, fix issues from HTTP/1.1

PHK has written and talked extensively about his issues with the current HTTP/2 proposal. One of the major things he asserts is that the HTTP/2 proposal simply layers more complexity on top of the existing layers, without tackling the deep architectural issues with HTTP/1.1. For one, he suggests killing off cookies and replacing them with session IDs. Cookies are hated for many reasons (privacy, network bloat, etc), and the user agent definitely has some incentive to replace them with something better. However, as Willy notes and Stephen affirms, servers don’t really have much incentive to switch to this replacement. PHK claims that:

In my view, HTTP/2.0 should kill Cookies as a concept, and replace it with a session/identity facility, which makes it easier to do things right with HTTP/2.0 than with HTTP/1.1.

Being able to be “automatically in compliance” by using HTTP/2.0 no matter how big dick-heads your advertisers are or how incompetent your web-developers are, would be a big selling point for HTTP/2.0 over HTTP/1.1.

It’s unclear to me how effective this “automatically in compliance” carrot is compared to the technical and financial costs of updating web servers/content to replace client-side cookies with server-side distributed, synchronized data stores. As PHK himself says, this raises the question for me, what if they made a new protocol, and nobody used it?

TLS & Privacy

Ah yes, privacy. Well, that’s a good thing, right? If communications aren’t kept confidential, then it’s difficult to maintain one’s privacy. So why not try to preserve confidentiality in all communications, by doing stuff like mandating running HTTP/2 over a secure transport such as TLS?

Political / Legal / Corporate restrictions on Privacy

Well for one, some people think that increasing the use of TLS on the internet is overall bad for privacy due to political, legal, and economic reasons. PHK covers a long list of reasons in his ACM Queue article on encryption, but his email to httpbis summarizes it pretty well:

Correct, but if you make encrypt mandatory, they will have to break all encryption, that’s what the law tells them to.

As long as encryption only affects a minority of traffic and they can easier go around (ie: FaceBook, Google etc. delivering the goods) they don’t need to render all encryption transparent.

PHK seems to fundamentally believe that it’s futile to preserve confidentiality via technical means like encryption. If more people use encrypted communications, then that will just incentivize governments to break that encryption. The only solution here is political, and the IETF has no business trying to mandate use of encryption. On the other hand, some other folks believe that various parties already have plenty of incentive to try to break encryption, not to mention they are more likely to use other attack vectors rather than directly breaking the cryptosystems since the other vectors are typically easier to attack.

Putting aside the economics of breaking encryption, the legal aspect of mandatory encryption is very useful to note. As PHK’s ACM Queue article details, mandatory encryption clashes with legal mandates in certain nation-states like the UK. Adrien de Croy (author of WinGate) also explains these issues:

Proponents of mandatory crypto are making an assumption that privacy is ALWAYS desirable.

It is not ALWAYS desirable.

In many cases privacy is undesirable or even illegal.

  1. Many states have outlawed use of crypto. So will this be HTTP only for the “free” societies? Or is the IETF trying to achieve political change in “repressed” countries? I’ve dealt with our equivalent of the NSA on this matter. Have you? I know the IETF has a neutral position on enabling wiretapping. Mandating SSL is not a neutral / apolitical stance. Steering HTTP into a collision course with governments just doesn’t seem like much of a smart idea.

  2. Most prisons do not allow inmates to have privacy when it comes to communications. Would you deny [*] all prisoners access to the web?
    There are other scenarios where privacy is not expected, permitted or desirable.

As Adrien and PHK have both pointed out, there are many situations where arguably privacy is undesirable. Employees sometimes don’t get privacy because their companies want to scan all traffic to detect malware or loss of corporate secrets. Schools often must monitor students’ computer use for porn. Going further, Adrien blames large websites whose use of SSL has encouraged companies and schools and other organizations to deploy MITM proxies (that rely on locally installed root certificates) to break MITM SSL connections to fulfill their corporate needs or legal obligations:

We added MITM in WinGate mostly because Google and FB went to https.
Google and FB you may take a bow.

Does this improve security of the web overall? IMO no. People can now snaffle banking passwords with a filter plugin.

You really want to scale this out? How will that make it any better?

The counter to this argument seems fairly obvious to me. Just because some subset of users in specific situations (at work, at school, in a police-state, etc) may not be “allowed” to have privacy (due to whatever reason, be it legal, corporate policy, school policy, etc), we shouldn’t give up trying to provide confidentiality of communication for everyone else the rest of the time. I responded accordingly and Patrick McManus followed up saying:

On Wed, Nov 13, 2013 at 7:09 PM, William Chan (陈智昌)
replied to Wily:
> Just to be clear, the MITM works because the enterprises are adding new
> SSL root certificates to the system cert store, right? I agree that that is
> terrible. I wouldn’t use that computer :) I hope we increase awareness of
> this issue.

This is a super important point. If someone can install a root cert onto your computer then you are already owned - there is no end to the other things they can do too. Call it a virus, call it an enterprise, but call it a day - you’re owned and there is no in-charter policy this working group can enact to change the security level of that user for good or for bad.. The good news is not everyone is already owned and SSL helps those people today.

The Cost of TLS & PKI

Another common argument against trying to increase TLS usage is that it’s costly. This cost takes a number of different forms:

  • Computational / scalability costs - As discussed earlier, TLS incurs some amount of computational costs (more buffer copies, symmetric encryption/decryption, asymmetric crypto for key exchange, preventing caching at intermediaries, etc). As before, I won’t delve into these costs in detail, there is already plenty of information out on the interwebs about them. The costs are real and exist.
  • PKI cost - Setting up certificates can incur some amount of cost. Note that StartSSL provides free certificates, and indeed that’s what I myself use. That said, there are definitely costs to acquiring a certificate.
  • Operational costs - Properly managing the keys (key rotation, etc) and certificates is difficult. Debugging is more difficult since now you have to decrypt the traffic. Setting up HTTPS requires a certain amount of knowledge, and it’s not obvious when you do it incorrectly.

Cost has to be put into perspective. What portion of the total cost of operation does TLS increase? Is it significant? Here’s a discussion thread between Mike Belshe and Yoav Nir on the topic:

> It’s not contentious, it’s just false. Go to the pricing page for Amazon
> cloundfront CDN (I would have
> picked Akamai, but they don’t put pricing on their website), and you pay
> 33% more plus a special fee for the certificate for using HTTPS. That’s
> pretty much in line with the 40% figure. That’s real cost that everybody
> has to bear. And you will get similar numbers if you host your site on your
> own servers.

I think you’re thinking like an engineer. You’re right, they do charge more (and I’m right those prices will continue to come down). But those prices are already TINY. I know 33% sounds like a lot, but this is not the primary cost of operating a business. So if you want to do a price comparison, do an all-in price comparison. And you’ll find that the cost of TLS is less than a fraction of a percent difference in operating cost for most businesses. And if you’re not talking about businesses, but consumers, CDNs aren’t really relevant. As an example, I run my home site at Amazon for zero extra cost, but I did buy a 5yr $50 cert.

Another consideration that has come up a few times is that there are a number of other HTTP users that aren’t browsers, such as printers (and other electronic devices). These users may want the new capabilities of HTTP/2, or they may simply not want to be stuck with an older/unsupported version of HTTP, yet they may not want to bear the cost of TLS. I mean, do people that own printers really want to have to set up a certificate for it? How important is it really to secure the communications to your printer or router? What about when washers, refrigerators, toasters, etc. are all networked?

Do We Want To Encrypt Everything?

In addition to many already well-known issues like passive monitoring & attacks on unencrypted wifi networks, it’s become clear after the Snowden revelations that there’s large scale state-sponsored passive pervasive monitoring occurring. In light of this information, how much traffic do we want to secure? It’s costly, so is it OK if we don’t encrypt communications to a printer? What about “unimportant” traffic? What if people don’t think their traffic is important enough to encrypt?

Many people believe that if they have nothing to hide, then surveillance does not impact them. There have been numerous articles arguing that this is a flawed belief. And as time goes on, there’s more and more evidence showing how our governments are indiscriminately gathering all sorts of information on people for later possible use.

Given all the known (and potential) threats to privacy (and security), many people feel that it’s of utmost importance to secure as much of people’s communications as possible. Part of the reason is that, as Stephen Farrell (IETF Security Area Director) says:

Here, Mike is entirely correct. The network stack cannot know when the payload or meta-data are sensitive so the only approach that makes sense is to encrypt the lot to the extent that that is practical. Snowdonia should be evidence enough for that approach even for those who previously doubted pervasive monitoring.

HTTP is used for lots of sensitive data all the time in places that don’t use https:// URIs today. Sensitive data doesn’t require any life or death argument, it can be nicely mundane, e.g. a doctor visit being the example Alissa used in the plenary in Vancouver.

We now can, and just should, fix that. There’s no hyperbole needed to make that argument compelling.

It’s difficult to know what is “sensitive” and what isn’t. It’s reasonable to assume that most users browsing the web expect/want privacy in their browsing habits, yet don’t know exactly what communications need to be secured. Making communications secure by default would address some of the privacy issues here.

It’s of course, unclear how much it helps, and indeed some argue that doing this may provide a false sense of security. There are well-known issues with the CA system, some suspicions about whether or not cryptographic algorithms / algorithmic parameters have been compromised, and does transport level security even matter if governments can simply persuade the endpoints to hand over data)? On the flip side, many feel that despite any issues it has, encryption does work. At the IETF 88 Technical Plenary, Bruce Schneier drove this point home:

So we have a choice, we have a choice of an Internet that is vulnerable to all attackers or an Internet that is secure for all users.

All right? We have made surveillance too cheap, and we need to make it more expensive.

Now, there’s good news-bad news. All right? Edward Snowden said in the first interview after his documents became public, he said, “Encryption works. Properly implemented, strong cryptosystems are one of the few things that you can rely on.”

All right? We know this is true. This is the lesson from the NSA’s attempts to break Tor. They can’t do it, and it pisses them off.

This is the lessons of the NSA’s attempt to collect contact lists from the Internet backbone. They got about ten times as much information from Yahoo! users than from Google users, even though I’m sure the ratio of users is the reverse. The reason? Google use SSL by default; Yahoo! does not.

This is the lessons from MUSCULAR. You look at the slides. They deliberately targeted the data where SSL wasn’t protecting it. Encryption works.

On the other hand, some folks are worried that if we encrypt too much traffic, then it might make finding hostile traffic emanating from one’s device hard to find, and thereby lower overall security. Bruce Perens chimed in, originally to point out that encryption is illegal for ham radio, but also to point out that encrypting normal traffic will lower security since the hostile encrypted traffic will be hard to find:

Let’s make this more clear and ignore the Amateur Radio issue for now. I don’t wish to be forced into concealment in my normal operations on the Internet.

Nor do I wish to have traffic over my personal network which I can not supervise. Unfortunately, there are a lot of operating systems and applications that I have not written which use that network. When I can’t see the contents of their network traffic, it is more likely that traffic is being used to eavesdrop upon me. Surrounding that traffic with chaff by requiring encryption of all HTTP traffic means that this hostile encrypted traffic will be impossible to find.

Thus, my security is reduced.

Opportunistic Encryption

Let’s assume for the sake of discussion that securing more traffic is a good thing. How would one do so? There are a number of barriers to increased HTTPS adoption, which is why it’s very slow going. But what about trying to secure http:// URIs too? That’s the fundamental idea behind opportunistic encryption - to opportunistically encrypt http:// URIs when the server advertises support for it. Mark Nottingham (httpbis chair) put together a draft for this. It’s key to note that from the web platform perspective, http:// URIs remain http:// URIs, so the origins aren’t changing, nor would the browser SSL indicator UI change.

There used to be some discussion of whether or not opportunistic encryption should require authentication. Requiring authentication would be a big barrier to adoption, since acquiring certificates is a major blocker for some folks. It’s definitely an interesting middle ground, but I won’t bother discussing it further since it’s mostly died out for now.

The appeal of unauthenticated encryption should be fairly evident. It doesn’t require CA-signed certificates, which means that it becomes perhaps feasible to achieve wide deployment of encryption by adding support for this into a few common webservers (perhaps enabled by default) and the major browsers.

Now, unauthenticated encryption obviously has some issues. If you do not authenticate the peer, then it’s easy to do an active MITM attack. Authentication is key to preventing this, so unauthenticated encryption can only thwart passive attackers. However, is that good enough? Active attacks leave traces (and thus are detectable) and cost more, so if it doesn’t have much cost, then it will raise the bar at least, which is a good thing. That said, it’s an open question how much it raises the bar. It’s easily defeated by cheap downgrade attacks, although there’s some hope that some form of pinning (maybe Trust on First Use?) could mitigate that.

It’s key here to note that opportunistic encryption (at least without pinning or some mechanism to defeat active attackers) provides very marginal security benefit here, but might have some privacy wins if it can impose enough costs to make large-scale pervasive surveillance too costly. It’s easy to see that any active attacker can downgrade/MITM you, so no one should have any illusions about the actual security benefit here. As for raising the costs for organizations to do pervasive surveillance, I don’t think anyone would argue against that as long as it has no downsides. That said, it might be unwise to underestimate the resources that governments are willing to pour into pervasive surveillance.

As to whether or not it has tradeoffs, it definitely has some, although it’s hard to quantify. By providing a middle ground between cleartext HTTP and secured HTTPS, it may lead to some folks switching to opportunistic encryption and not going all the way to HTTPS. Some folks think that encryption is sufficient, and skeptical of the value of authentication. Thus, the fear that some people who want fully authenticated HTTPS everywhere have is that offering the middle ground of opportunistic value may prevent some folks from biting the bullet and going all the way to HTTPS.

As Tim notes, it’s hard to weigh these benefits and costs here since there’s no hard data:

> As for downsides, will people read too much into the marginal security
> benefit and thus think that it’s OK not to switch to HTTPS? If so, that
> would be terrible. It’s hard to assess how large this risk is though. Do
> you guys have thoughts here?

I agree that’s a risk, but we’re all kind of talking out of our asses here because we don’t have any data. My intuition is that people who actually understand the issues will understand the shortcomings of opportunistic and not use it where inappropriate, and people who don’t get why they should encrypt at all will get some encryption happening anyhow. But intuition is a lousy substitute for data.

Due to the difficulty of this cost/benefit analysis, in addition to the questionable (by some) value of encrypting as much traffic as possible, opportunistic encryption remains a hotly debated topic in httpbis. It’ll be very interesting to see how it turns out.


All this talk about switching more traffic to being encrypted end to end obviously raises a big question about what that means for proxies. Interception (otherwise known as “transparent”) proxies are quite ubiquitous in computer networks, especially in corporate networks, ISPs, mobile operators, etc. If use of encryption does increase on the web, then what happens to all these interception proxies? In the current world, there are only two options: (1) Do nothing. The interception proxies will simply see less traffic. (2) Turn the interception proxies into active SSL MITM proxies, by installing additional root certs on devices, for use by the proxy.

There are many issues with option (2), some of which are:

  • MITM proxies are not detectable by users nor servers.
  • When additional root certs are installed on devices, then the user agent is no longer sure it’s authenticating the real server, so enhanced security mechanisms such as public key pinning must be disabled.
  • Likewise, SSL client authentication cannot work, since the MITM proxy does not (at least one should hope not!) have the client’s private key.
  • In order to achieve their goals, MITM proxies have to fully break the TLS connection. This means complete loss of confidentiality, integrity, and authentication guarantees, even when the proxy operators may only want to break confidentiality (e.g. for malware scanning) or just metadata confidentiality (e.g. examining HTTP headers in order to perhaps serve a cached copy).

For these various reasons, many in the httpbis working group are looking into “explicit” proxies, where the “explicit” is mostly in contrast to the transparency of interception proxies. And many of these proposals call for making the proxy “trusted”, for various definitions of trusted (give up confidentiality? integrity? everything?). Note that today’s HTTP proxies (both configured and transparent) are essentially “trusted”, since they can MITM HTTP transactions and do whatever they want. The question is whether or not we want to “weaken” HTTPS in any explicit manner. There are a variety of proposals on the table here, almost all of which explicitly give up some of HTTPS’ guarantees in some way. Some people go as far as to suggest that the client should fully trust the proxy. Some propose server, protocol (TLS or HTTP/2), or content modifications in order to provide more specific loss of guarantees, perhaps by switching to object level integrity instead of end to end TLS channel integrity.

What Can/Should the IETF Do?

There are some big questions here about what is reasonable and feasible for the IETF to do. A number of folks feel like the IETF has no business requiring certain behavior, but merely should specify mechanisms and leave it up to individual actors to decide what to adopt. For example, Adrien had this to say about it:

Maybe the problem is us.

e.g. that we think the level of https adoption is a problem to be solved.

personally I do not.

What if it simply reflects the desires of the people who own and run the sites. Exercising their choice.

We are proposing (yet again) taking that choice away which I have a major problem with. It’s a philosophical problem, I don’t believe any of us have the right to make those choices for everyone else, especially considering (which few seem to be) the ENORMOUS cost.

Moreover, some people think that mandating encryption in HTTP/2 is security theater. As Eliot Lear said:

There simply is no magic bullet. The economics are clear. The means to encrypt has existed nearly two decades. Mandating encryption from the IETF has been tried before – specifically with IPv6 and IPsec. If anything, that mandate may have acted as an inhibitor to IPv6 implementations and deployment and served as a point of ridicule.

On the flip side, there’s a strong movement in the IETF to explicitly treat pervasive surveillance as an attack that IETF protocols should defend against. This goes a step further beyond the IETF’s previous consensus stance here, documented in RFC 2804. Brian Carpenter explained further that:

My understanding of the debate in Vancouver was that we intend to go one step beyond the RAVEN consensus (RFC 2804). Then, we agreed not to consider wiretapping requirements as part of the standards development process. This time, we agreed to treat pervasive surveillance as an attack, and therefore to try to make protocols resistant to it.

Which is completely disjoint from whether operators deploy anti-surveillance measures; that is a matter of national law and not our department.

While I still consider it very much under debate as to what the IETF should be doing here, I think the current sentiment is definitely leaning towards adopting Stephen’s draft to treat pervasive surveillance as an attack. In terms of how this applies to HTTP/2, httpbis chair Mark Nottingham answered that for me:

> What do “adequately address pervasive monitoring in HTTP/2.0”

Well, that’s the fun part. Since there isn’t specific guidance in this draft, we’ll need to come up with the details ourselves. So far, our discussion has encompassed mandatory HTTPS (which has been controversial, but also seems likely to be in some of the first implementations of HTTP/2.0) and opportunistic encryption (which seems to have decent support in principle, but there also seems to be some reluctance to implement, if I read the tea leaves correctly). Either of those would probably “adequately address” if we wrote them into HTTP/2.0. Alternatively, it may be that we don’t address pervasive monitoring in the core HTTP/2.0 document itself, since HTTP is used in a such a wide variety of ways, but instead “adequately address” in a companion document. One proposal that might have merit is shipping a “HTTP/2.0 for Web Browsing” document and addressing pervasive monitoring there. My biggest concern at this point is the schedule; we don’t have the luxury of a drawn-out two year debate on how to do this.

> and “we’ll very likely get knocked back for it” mean?

It means the IESG would send the documents back to us for further work when we go to Last Call.

This puts the httpbis working group in an interesting situation of perhaps being required to do something to address pervasive surveillance in HTTP/2. Of course, whether or not there’s any consensus at all to do something here remains to be seen. Most of the players involved seem to be sticking to their various positions, which makes me a little skeptical that mandating any sort of behavior will reach “rough consensus”.

What will happen if the IETF can’t get any consensus on anything TLS-related? At that point, it’s likely that the market will decide. So it’s interesting to see what the current landscape of vendor opinion is. On the browser front, all major browsers currently only support SPDY over TLS. Chromium and Firefox representatives have insisted on only supporting HTTP/2 over TLS, whereas Microsoft insists that HTTP/2 must provide a cleartext option:

Patrick McManus (Firefox):

I will not deploy another cleartext protocol. Especially another one where the choice of encryption is solely made by the server. It doesn’t serve my user base, or imo the web.

Me (Chromium):

Well, it should be no surprise that the Chromium project is still planning on supporting HTTP/2 only over a secure channel (aka TLS unless something better comes along…).

Rob Trace (WinInet, in other words, IE):

We are one browser vendor who is in support of HTTP 2.0 for HTTP:// URIs. The same is true for our web server. I also believe that we should strongly encourage the use of TLS with HTTP, but not at the expense of creating a standard that is as broadly applicable as HTTP 1.1.

I think this statement correctly captures the proposal:

> To be clear - we will still define how to use HTTP/2.0 with http://
> URIs, because in some use cases, an implementer may make an
> informed choice to use the protocol without encryption.

If these vendors maintain these positions, and all signs point towards that being the case, then it’ll be interesting to see how those market forces, combined with the unreliability of deploying new non-HTTP/1.X cleartext protocols over port 80, will affect how HTTP/2 gets deployed on the internet.

Prioritization Only Works When There’s Pending Data to Prioritize

[ TL;DR: It’s possible to write data to sockets, even when they will not immediately be sent. Committing data too early can result in suboptimal scheduling decisions if higher priority data arrives later. If multiplexing different priority data onto the same TCP connection (e.g. SPDY, HTTP/2), consider using the TCP_NOTSENT_LOWAT socket option / sysctl on relevant platforms to reduce the amount of unsent, buffered data in the kernel, keeping it queued up in user space instead so the application can apply prioritization policies as desired. ]

The Problem

A large part of my job is helping different teams at Google understand their frontend networking performance. Last year, one of the teams that asked for debugging assistance was the Google Maps team, since they were launching the new version of Google Maps. Amongst the resources loaded on the page were the minimaps used in parts of the UI and the actual tiles for the map, which is the main content they care about. So what they would like is for the tiles to have a higher priority than the minimaps.

Recall that browsers will assign priorities to resources by their resource type. The new Google Maps will load these minimaps as image resources, whereas they’ll fetch the tile data using XHRs. Chromium, if not all modern browsers, will give image resources a lower priority than XHRs. Moreover, Google Maps uses HTTPS by default, so SPDY-capable browsers, like Chromium, should see higher priority responses preempt lower priority responses, assuming response data for a higher priority stream is available at the server. Yet, Michael Davidson (Maps engineer) claimed that this wasn’t happening. Better yet, he had Chromium net-internals logfiles to prove it. He had one with a fast connection (probably on the Google corp network) and one with a simulated 3G connection. Let’s look at what they said for the fast connection: “The requests that we really care about are stream_id 45 and 47. These are requested via XHR, and are the tiles for the site. Note that Chrome is servicing 33-43 before we get a response for 45 and 47.” Here are the relevant snippets of the net-internals logfile, with the unimportant fluff elided. Note that “[st]” refers to start time in milliseconds since the net-internals log started.

           --> fin = false
           --> :status: 200 OK
               content-length: 12354
               content-type: image/jpeg
           --> stream_id = 33
           --> fin = false
           --> size = 1155
           --> stream_id = 33
// ...
           --> fin = true
           --> size = 0
           --> stream_id = 33
// ... (I've cut out the rest of the minimap image responses for brevity)
           --> fin = true
           --> size = 0
           --> stream_id = 43
           --> fin = false
           --> :status:0 OK
               content-length: 14027
               content-type: image/jpeg
           --> stream_id = 41
           --> fin = false
           --> size = 4644
           --> stream_id = 41
// ...
           --> fin = true
           --> size = 0
           --> stream_id = 41

// At this point, the minimap image responses are done.
// 77~ms of network idle time later, we start receiving the tile responses.

           --> fin = false
           --> :status: 200 OK
               content-type: application/; charset=x-user-defined
           --> stream_id = 45
           --> fin = false
           --> size = 8192
           --> stream_id = 45
// ... The data is rather large and keeps streaming in over 100ms. At which point the other tile
//     response becomes available and starts interleaving since they're the same priority.

           --> fin = false
           --> size = 3170
           --> stream_id = 45
           --> fin = false
           --> :status: 200 OK
               content-type: application/; charset=x-user-defined
           --> stream_id = 47
           --> fin = false
           --> size = 141
           --> stream_id = 47
           --> fin = false
           --> size = 1155
           --> stream_id = 45

           --> fin = true
           --> size = 0
           --> stream_id = 47

           --> fin = true
           --> size = 0
           --> stream_id = 45

Indeed, Michael is quite correct that the response data for streams 45 and 47 are arriving after the other streams. Now, the question is whether or not this is due to incorrect prioritization. It’s interesting to note that there’s a 70+ms gap from st=2004 to st=2081 where stream 41 has finished and the network goes idle for 70+ms before starting to send the response for stream 45. This makes the incorrect prioritization hypothesis somewhat suspect. At this point, it’ll be useful to describe how a common SPDY deployment works. Let me borrow the NSA’s useful diagram of Google’s serving infrastructure:

Both SSL and SPDY are added and removed at GFE. GFE will demux incoming requests on the client SPDY connection to backends (application frontends, like Maps) and mux the backend responses onto the client SPDY connection.
       Client                          Google

                            |               +---------+
                            |               |         |
                            |          +--->| Backend |
                            |          |+--+|         |
             Image reqs     |          +v   +---------+
+---------+  Tile reqs      |     +-----+
|         |+----------------|---->|     |+------>.
| Browser |                 |     | GFE |        . More backends
|         |<----------------|----+|     |<------+.
+---------+  Image resps    |     +-----+
             Tile resps     |          ^+   +---------+
                            |          |+-->|         |
                            |          +---+| Backend |
                            |               |         |
                            |               +---------+

                     Backend Response Queues
                       (X represents data)

+--------------+                                +-----------+
|              | Queue                          |           |
|              | XXXXXXXXXX       X      X    X | Backend 1 |
|              |<------------------------------+|           |
|              |                                +-----------+
|              |
|              |                                +-----------+
|              | Queue                          |           |
|              | XXXXX        X        X        | Backend 2 |
|              |<------------------------------+|           |
|      GFE     |                                +-----------+
|              |
|              |
|              |                                     ...
|              |
|              |
|              |                                +-----------+
|              | Queue                          |           |
|              |                                | Backend X |
|              |<------------------------------+|           |
+--------------+                                +-----------+

Hopefully these diagrams highlight why the 70+ms delay between the minimap image responses and the tile responses makes the hypothesis that prioritization is broken seem less likely. Request prioritization will only work if there is data from multiple responses to choose from. If there’s no higher priority data to choose to prioritize, then prioritization cannot have any effect. For example, if the image responses were coming from backends 1 & 2 in the above diagram, but the higher priority tile responses were coming from backend X, then given the backend response queues in the diagram, the GFE would only be able to respond with data from backends 1 and 2, and has no data from backend X to stream instead. That’s why it’s key to note the 70+ms delay between the image responses and the tile responses. That implies that backend responses for the tiles only arrived at the GFE after the image responses had already been forwarded onward to the browser. There are other possible explanations like TCP delays and what not, but they’re less likely for various reasons I won’t bother explaining here. Theoretically, if the GFE wanted to strictly enforce prioritization, it could hold up the image responses until the tile responses arrived, but that’s generally a pretty silly thing to do, as it implies wasting available bandwidth, since it’s likely the peer can still process lower priority responses while waiting for higher priority responses to arrive.

The hypothesis that prioritization does not happen because the higher priority responses only arrive later after the lower priority responses have already been drained is a reasonable hypothesis given the network characteristics here (the corporate network has low latency and high bandwidth to Google datacenters). That of course raises the question of what would happen were we to use a slower network connection, which might conceivably allow backend responses to build up in GFE queues. Luckily, Michael also had a net-internals log for that case, using a simulated 3G connection. Michael helpfully identifies the noteworthy streams in the log for us: “Stream ID 51 is the tiles we care about. 49 is an image that we don’t care about. We don’t get any data for 51 until 49 completes. In the network tab, this appears as ‘waiting’ time. We waited 1.3 seconds for 51, and 3.8 seconds for 53!”

            --> fin = false
            --> :status: 200 OK
                content-length: 13885
                content-type: image/jpeg
            --> stream_id = 49
            --> fin = false
            --> size = 1315
            --> stream_id = 49
            --> fin = true
            --> size = 0
            --> stream_id = 49
            --> fin = false
            --> :status: 200 OK
                content-type: application/; charset=x-user-defined
            --> stream_id = 51

As we can see here, the responses are clearly back to back. Due to the low bandwidth and high RTT of the simulated 3G connection, it takes over 100ms to stream the response for the minimap image in stream 49. A millisecond after that stream finishes, the server immediately streams the higher priority response in stream 51. So either GFE’s SPDY prioritization is broken (which would be quite a significant performance regression) or it still doesn’t have the chance to take effect, even with this slow network. Since I’ve done a fair amount of performance debugging of Chromium-GFE SPDY prioritization before, I did not believe it was at fault here. Rather, I suspected that we discovered our first concrete instance of a problem that Roberto and I had previously only theorized could occur - the TCP send socket buffer was buffering too much unsent data.

                |                                                               |
                |                                                               |XXXXX
                |                                                               |<------------
                |                                                               |  Backend 1
                |                                                               |
                |                                                               |XXXXXXXX
                |                                                               |<------------
                |                                                               |  Backend 2
                |                                                               |
                |       One of the GFE kernel socket buffers                    |XXX
                |+------------------------------------------------+             |<------------
                ||                                                |             |  Backend 3
                ||                      limit of sendable data due|             |
                ||                      to stuff like rwnd & cwnd |             |
                ||                                   +            |             |
                ||                                   |            |             |    ...
                ||+----------------------------------v-----------+|             |
                ||+----------------------------------|-----------+|             |
                ||              sent data             unsent data |             |<------------
                || (kept for possible retransmission)             |             |  Backend X
                |+------------------------------------------------+             |

Once a proxy receives data from the previous hop, it will likely try to forward it onward to the next hop if the outgoing socket is writable. Otherwise, it will have to queue up the data and possibly assert flow control on the previous hop to prevent it from having to queue up too much data (which would eventually lead to OOM). If the outgoing TCP connection is a SPDY connection, and thus supports prioritized multiplexing, the proxy will probably mux the data onto the same SPDY connection from the incoming queues in more or less priority order.

Of course, this raises the question of when the TCP socket is writable. That’s a fairly complicated question, as it is based on variables like the peer’s receive window, the estimated BDP, and other settings like the socket send buffer size. On the one hand, if the sender wants to fully saturate the bandwidth, it needs to buffer at least a BDP’s worth of data in the kernel socket buffer, since it needs to keep it around to retransmit in case of loss. On the other hand, it doesn’t want to use an excessive amount of memory for the socket buffers. Yet, it may also want to minimize the computational costs of extra kernel-user context switches to copy data from user space to the kernel for transmission.

At the end of the day, depending on the various variables, a TCP socket may become writable even if the TCP stack won’t immediately write the data out to the network. The data will simply sit in the kernel socket buffer until the TCP stack decides to send it (possibly when already sent data has been ACK’d). The problem here should be evident: given the current BSD interface for stream sockets (TCP sockets are stream sockets), there’s no way to re-order the data (e.g. data for a higher priority stream just became available) once it has been written to the socket. And even if there were, that wouldn’t help the typical SPDY case, since it runs over TLS, which uses the sequence number in its ciphers, so it can’t reorder data that has already been processed by the TLS stack. Both the TLS stack API and the BSD stream socket API operate similarly to high latency FIFO queues (inserted at the sender end, and consumed at the network receiver’s end). Therefore, in order to better prioritize data, it’s desirable to delay committing the data to the TLS/TCP stacks until one knows that the data will be immediately sent out over the network. Unfortunately, the only signal a user space application has related to this is the socket becoming writable (aka POLLOUT), meaning the kernel is willing to accept new data into its buffers, although it may not want to or be able to send the data out immediately. Furthermore, POLLOUT does not indicate to the user space application how much data it can write. This poses a difficulty for the application developer who does not know how much data to feed into the TLS/TCP stack. An obvious solution is to simply write smaller chunks of data to the socket, but this of course requires more syscalls.

The Solution

Anyhow, at this point, I voiced my suspicions and my wonderful colleagues Hasan and Roberto (GFE devs) investigated the hypothesis and confirmed that it was indeed the case. Great! But now what do we do? Well, as I stated earlier, what we want is for the kernel to only mark the socket as writable if the data will be promptly written out to the network. Good thing our resident Linux networking ninja Eric added this feature to the Linux kernel. This feature allows controlling the limit of unsent bytes in the kernel socket buffer. Setting it as low as possible will enable a SPDY server to make the best possible scheduling decision (based on stream priorities), but will also incur extra computational costs from extra syscalls and context switches. As my colleague Jana explains in detail:

A bit of background: this socket buffer is used for two reasons – for unacked data and for unsent data. TCP on a high Bandwidth-Delay Product (BDP) path will have a large window and will have a large amount of unacked data; setting this buffer too small limits TCP’s ability to grow its sending window. TCP buffer autotuning is often used to increase the buffersize to match a higher BDP than the default (most OSes do autotuning now, and I don’t think that they generally ever decrease the buffer size.) As Will points out, this increasing buffer to match the BDP also ends up opening up more room for unsent data to sit in. Generally autotuning errs on the side of more buffersize so as to not limit the TCP connection, and this is at odds with an app that wants to retain scheduling decisions (such as SPDY with prioritized sending). TCP_NOTSENT_LOWAT is that knob that allows the app to decide how much unsent data ought to live in the socket buffer.

The simplest use of this sockopt is to set it to the amount of buffer that we want in the kernel (per our latency tolerance). But there’s another tradeoff to be watchful of – if the watermark is set to be too low, the TCP sender will not be able to keep the connection saturated and will lose throughput. If an ack arrives and the socket buffer has no data in it, TCP will lose throughput. Ideally, you want to set the watermark high enough to exactly match the TCP drain rate ( the ack-clock tick, ~ cwnd/rtt) to the process’s schedule rate (which is f(context switch time, load)).

Practically, you want the buffer to be small in low-bw environments (cellular) and larger in high-bw environments (wifi). The right metric to be considering in general is delay – how long did a data segment spend in the socket buffer – and we can use this to tweak the watermark. (An app can measure this by measuring time between two POLLOUTs, given that we set the low watermark, and send data into the socket above the low watermark).

So far in this discussion, I’ve focused almost entirely on the HTTP response path from SPDY origin server. That’s simply because this is the case that typically matters most for this issue. This case can also happen on the upstream path from a client to a server. For example, let’s say you have a client that is uploading a large file (maybe sync’ing data on Google Drive perhaps?) up to a server over a SPDY connection. That’s a bulk data transfer and theoretically should be assigned a low priority. It cares about throughput so it will try to hand off data to the kernel ASAP so that the kernel buffers never drain and it maximizes the network throughput. However, if the client tries to send interactive requests over the same SPDY connection, then those requests may sit in the client kernel send socket buffer behind the bulk upload data, which would be unfortunate. Clients usually have an advantage here over servers, in that the extra kernel-user context switch and syscall costs are generally not significant issues, at least in comparison to highly scalable servers.

More generally speaking, this is very analogous to bufferbloat. In the typical bufferbloat scenario, multiple network flows of varying priority levels share a network queue. The way TCP works, it will try to maximize throughput (and thereby fill the buffers/queues) until a congestion event occurs (packet loss being the primary signal here), which leads to queueing delays for everyone else. The queueing delays especially hurts applications with higher priority network traffic, like videoconferencing or gaming, where high-latency/jitter can seriously impact the user experience. In the SPDY prioritization case above, the shared network queue is actually a single TCP connection, including not just the inflight data, but the data sitting in the kernel send and receive socket buffers, and the multiple flows are the different SPDY streams that are being multiplexed on the same TCP connection. And applications, much like TCP flows, will greedily fill the buffers if given the chance. Unlike how network devices treat packets, it’s usually unacceptable for the kernel to discard unack’d application data that it received via the stream socket API, as the application expects it to be reliable.

Great, so now we understand both the problem and the solution, right? More or less that’s true. The appropriate/available solution mechanism is the TCP_NOTSENT_LOWAT socket option / sysctl. AFAICT, it’s available only on Linux and OS X / iOS based operating systems. The bigger question is how to use this. As mentioned previously, there are tradeoffs. Lowering TCP_NOTSENT_LOWAT will keep more of the data buffered in user space instead, allowing the application to make better scheduling decisions. On the flip side, it may incur extra computation from increased syscalls and kernel-user context switches and, if set too low, may also underutilize available bandwidth (if the kernel socket buffer drains all unsent data before the application can respond to POLLOUT and write more data to the socket buffer). Since these are all workload dependent tradeoffs, it’s hard to specific a single value that will solve everyone’s problems. So, the answer is to measure and tune your application performance using this knob.


If one notes the timing of this blog post and the TCP_NOTSENT_LOWAT Linux patch, one will see that my colleague Eric Dumazet already upstreamed a patch for TCP_NOTSENT_LOWAT earlier in July 2013, before I debugged this problem later in 2013 for our Maps team. That’s because we had not yet updated our frontend servers to take advantage of this socket option yet. But, thanks to Stuart Cheshire at Apple, my colleagues at Google were already aware of this theoretical problem and implemented the TCP_NOTSENT_LOWAT solution in Linux (which already existed in iOS and OS X back in 2011 or so).

Network Congestion and Web Browsing

Recently, there was discussion on the ietf-http-wg mailing list about the current HTTP/2.0 draft’s inclusion of a SPDY feature that allows the server to send its TCP congestion window (often abbreviated as cwnd) to the client, so it can echo it back to the server when opening a new connection. It’s a pretty interesting conversation so I figured I’d share the backstory of how we got to where we are today.

TCP Slow Start

A long time ago in a galaxy far, far away (the 1980s), the series of tubes got clogged. Many personal Internets got delayed and the world was sad. Van Jacobson was pretty annoyed that his own personal Internets were getting delayed too due to all this congestion, so he proposed that we all agree to avoid congestion rather than cause it.

A key part of that proposal was Slow Start, which called for using a congestion window and exponentially growing it until a loss event happens. This congestion window specified a limit on the number of TCP segments a sender could have outstanding at any point. The TCP implementation would initialize the connection’s congestion window to a predefined “initial window” (often abbreviated as initcwnd or IW) and increase the window by 1 upon receipt of an ACK. If you got an ACK for each TCP segment you sent out, then in each roundtrip, the congestion window would double, thus leading to exponential growth. Originally, the initial congestion window was set to 1 segment, and eventually RFC 3390 increased it to 2-4. The congestion window would continue to grow until the connection encountered a loss event, which it assumes is due to congestion. In this way, TCP starts with low utilization and increases utilization (exponentially each roundtrip) until a loss event occurs.

TCP, HTTP, and the Web

Way back when (the 1990s), Tim Berners-Lee felt a burning need to see kitten photos, so he invented the World Wide Web, where web browsers would access kitten photo websites via HTTP over TCP. For a long time, the recommendation was to only use two connections per host, in order to prevent kitten photo congestion in the tubes:

Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.

However, over time, people noticed that their kitten photos weren’t loading as fast as they wanted them to. Due to this connection per host limit of 2, browsers were unable to send enough HTTP requests to receive enough responses to saturate the available downstream bandwidth. But users wanted more kitten photos faster and incentivized browser vendors to increase their connection per host limits so they could download more kitten photos in parallel. Indeed, the httpbis working group recognized that this limit was way too low and got rid of it. Yet even this wasn’t fast enough, and many users were still using older browsers, so website authors started domain sharding in order to send even more kitten photos in parallel to users.

Multiple TCP Connections and Slow Start

Now that we see that browsers have raised their parallelization limits, and websites sometimes use domain sharding to get even more parallelization, it’s interesting to see how that interacts with TCP Slow Start. As previously noted, TCP Slow Start begins the congestion window with a conservative initial value and rapidly increases it until a loss event occurs. A key part to avoiding congestion is that the initial congestion window starts off below the utilization level that would lead to congestion. Note however, that if N connections simultaneously enter Slow Start, then the effective initial congestion window across those N connections is N*initcwnd. Obviously for values of N and initcwnd high enough, congestion will occur immediately on network paths which cannot handle such high utilization, which may lead to poor goodput (the “useful” throughput) due to wasting throughput on retransmissions. Of course, this begs the question how many connections and what initcwnd values we see in practice.

Number of Connections in a Page Load

The Web has changed a lot since Tim Berners-Lee put up the first kitten website. Websites have way more images now, and may even have tons of third party scripts and stylesheets, not to mention third party ads. Fortunately, Steve Souders maintains the HTTP Archive which can help us figure out, for today’s Web, roughly how many parallel connections may be required to load them.

As we can see, loading a modern webpage requires connecting to many domains and issuing many requests.

Looking at this diagram, we see that the average number of domains involved in a page load today is around 16. Depending on how the webpage is structured, we can probably expect to open upwards of tens of connections during the page load. This is of course only a very rough estimate, given this available data.

Typical Initial Congestion Window Values

In their continuing efforts to make the web fast, Google researchers have figured out that removing throttles will make things go faster, so they proposed raising the initial congestion window to 10, and it’s on its way to becoming an RFC. It’s key to note why this makes a big difference to web browsing. Slow Start requires multiple roundtrips to increase the congestion window, but web browsing is interactive and bursty, so those roundtrips directly impact user perceived latency, and by the time TCP roughly converges on an appropriate congestion window for the connection, the page load most likely has completed.

Linux kernel developers found Google’s argument compelling and have already switched the kernel to using IW10 by default. CDN developers were likewise pretty quick to figure out that if they wanted to compete in the make kitten photos go fast space, they needed to raise their initcwnd values too to keep up.

But What About Slow Start?

As David Miller says, the TCP initial congestion window is a myth.

As David Miller points out, any app can increase the initial congestion window by just opening more connections, and obviously websites can make the browser open more connections by domain sharding. So websites can effectively get as high an initial congestion window as they want.

But the question is - does this happen in practice and what happens in that case? To answer this, I used a highly scientific methodology to identify representative websites on the Web, loaded them up in WebPageTest with a fairly “slow” (DSL) network setup, and looked to see what happened:

Google Kitten Search

Here I examine my daily “kittens” image search in WebPageTest and also use CloudShark to do basic TCP analysis.

Waterfall of the Google Kitten Search page load. Of note is the portion where the browser opens 6 connections in parallel to each of the sharded encrypted-tbn\[0-3\] hostnames, leading to severe congestion related slowdown during the SSL handshake (due to transmitting the SSL certificates) and the image downloads. In particular, the waterfall makes clear the increased latency in completing the SSL handshake, which directly adds to latency until image downloads begin, which increases the time until the user sees the images he/she is searching for.
CloudShark analysis graph of goodput (which I’ve defined here as bytes that weren’t retransmissions, using the !tcp.analysis.retransmission filter) vs retransmissions (tcp.analysis.retransmission) vs “spurious” retransmissions (tcp.analysis.spurious_retransmission - subset of retransmissions where the bytes had already previously been received). Retransmissions means the sender is having to waste bytes on repeated transmissions, but from the receiver’s perspective, the bytes aren’t necessarily wasted. Spurious retransmissions on the other hand are clearly wasteful from the receiver’s perspective and increase the time until the document is completely loaded.


Etsy shards its image asset hostnames 4 ways (img[0-3], in addition to its other hostnames, opening up around 30~ connections in parallel.
Etsy’s sharding causes so much congestion related spurious retransmissions that it _dramatically_ impacts page load time.

SPDY - Initial Congestion Window Performance Bottleneck

As shown above, plenty of major websites are using a significant amount of domain sharding, which can be highly detrimental for their user experience when the user has insufficient bandwidth to handle the burst from the servers. With so many connections opening up in parallel, the effective initial congestion window across all the connections is huge, and Slow Start becomes ineffective at avoiding congestion.

The advent of SPDY changes this since it helps fix HTTP/1.X bottlenecks and thus obviates the need for workaround hacks like domain sharding that increase parallelism by opening up more connections. However, after fixing the application layer bottlenecks in HTTP/1.X, SPDY runs into transport layer bottlenecks, such as the initial congestion window, due to web browsing’s bursty nature. SPDY is at a distinct disadvantage here compared to HTTP, since browsers will open up upwards of 6 connections per host for HTTP, whereas they’ll only open a single connection for SPDY, therefore HTTP often gets 6 times the initial congestion window that SPDY does.

Given this knowledge, what should we be doing? Well, there’s some argument to be made that since SPDY tries to be a good citizen and use fewer connections than HTTP/1.X, perhaps individual SPDY connections should get a higher initial congestion window than individual HTTP/1.X connections. One way do so is for the kernel to provide a socket option for the application to control the initial congestion window. Indeed, Google TCP developers proposed this Linux patch, but David Miller shut the proposal down pretty firmly:

Stop pretending a network path characteristic can be made into an application level one, else I’ll stop reading your patches.

You can try to use smoke and mirrors to make your justification by saying that an application can circumvent things right now by openning up multiple connections. But guess what? If that act overflows a network queue, we’ll pull the CWND back on all of those connections while their CWNDs are still small and therefore way before things get out of hand.

Whereas if you set the initial window high, the CWND is wildly out of control before we are even started.

And even after your patch the “abuse” ability is still there. So since your patch doesn’t prevent the “abuse”, you really don’t care about CWND abuse. Instead, you simply want to pimp your feature.

SPDY - Caching the Congestion Window

Despite the rejection of this patch, Google’s servers have been leveraging this kernel modification to experiment with increasing the initial congestion window for SPDY connections. Beyond just experimenting with statically setting the initial congestion window, Google has also experimented with using SPDY level cookies (see SETTINGS_CURRENT_CWND) to cache the server’s congestion window at the browser for reuse later on when the browser re-establishes a SPDY connection to the server. This is precisely the functionality recently under debate in the httpbis working group.

It’s easy to see why this would be controversial. It definitely does raise a number of valid concerns:

  • Violates layering. Why is a TCP internal state variable being stored by the application layer?
  • Allows the client to request the server use a specific congestion window value.
  • Attempts to reuse an old, very possibly inaccurate congestion window value.
  • Possibly increases the likelihood of overshooting the appropriate congestion window.

So why are SPDY developers even experimenting at all with this? Well, there are some more and less reasonable explanations:

  • While it’s true that this is clearly a layering violation, this is the only way to experiment with caching the congestion window for web traffic at reasonable scale. This could be conceivably done by the TCP layer itself with a TCP option, but given the rate of users updating to new operating system versions, that would take ages. Experimenting at the application layer, while ugly, is much more easily deployable in the short term.
  • As with any variable controllable by the client, servers must be careful about abuse, treat this information as only advisory, and ignore ridiculous values.
  • Yes, while it’s true that there’s definitely reason to be skeptical of an old congestion window value, a reasonable counterpoint question is, as Patrick McManus mentions, is there a reason to believe that an uninformed static guess like IW10 is any better than an informed guess based on an old congestion window?
  • Is reusing an old congestion window more likely to cause overshooting the appropriate congestion window? Well, given that in a HTTP/1.X world, browsers are often opening 6 connections per host, and thus combined with IW10 often have an effective initial congestion window per host of 60, it’s interesting to note what values the SPDY CWND cookie caches in practice. Chromium (and Firefox too, according to Patrick McManus), sees median values around 30~, so in many ways, reusing the old congestion window is more conservative than what happens today with multiple HTTP/1.X connections.

It’s great though to see this get discussed in IETF, and especially great to see tcpm and transport area folks get involved in the HTTP related discussions. And it looks like there’s increasing interest from these parties in collaborating more closely in the future, so I’m hopeful that we’ll see more research in this area and come up with some better solutions here.

Chromium Strikes Back

Until SPDY and HTTP/2 are widely deployed enough that web developers don’t need to rely on domain sharding, and we’re still years out from this, we have to make things work well in today’s web. And that means dealing with domain sharding. Even though domain sharding may make sense for older browsers with low connection per host limits, when domain sharding is combined with higher limits, then opening so many connections in parallel can actually be harmful to both performance and network congestion as previously demonstrated.

As I’ve previously discussed, Chromium historically has had poor control over connection parallelism. That said, now we have our fancy new ResourceScheduler, courtesy of my colleague James Simonsen. With that, we now have page level visibility into all resource requests. With that capability, we can start placing some limits on request parallelism on a per-page basis. Indeed, we’ve experimented with doing so and found compelling data to support limiting the number of concurrent image requests to 10, which should roll out to stable channel in the upcoming Chrome 27 release and dramatically mitigate congestion related issues. For further details, look at the waterfalls here and watch the video comparing the page loads.

Chrome 26 demonstrates a fair amount of congestion related retransmissions.
Chrome 29 demonstrates that the image parallelism change dramatically improves the situation here.

While this limit would obviously be good for users with slower internet connections, I was minorly concerned that it would lead to degraded performance for users with faster internet connections due to not being able to fully saturate the connection’s available bandwidth. That said, it appears that it is still high enough for at least cable modem type bandwidth.


In summary:

  • Congestion is bad for the internet and for user perceived latency
  • Extra roundtrips are bad for user perceived latency
  • Slow Start helps avoid congestion
  • Slow Start can incur painful roundtrips
  • Low initial congestion windows can be bottlenecks for user perceived latency
  • But high initial congestion windows can lead to immediate congestion, which can also hurt user perceived latency
  • There’s an interesting discussion in IETF on whether or not we can improve upon Slow Start by picking better initial congestion windows, possibly dynamically by using old information.
  • Web developers using high amounts of domain sharding to work around low connection per host limits in old browser should reconsider their number of shards for newer browsers. Anything over 2 is probably too much, unless most of your user base is using older browsers. Better yet, stop hacking around HTTP/1.X deficiencies and use SPDY or HTTP/2 instead.

Food for thought (and possibly future posts):

  • Impact on bufferbloat
  • How does TCP behave exactly when it encounters such high initial congestion like some of the demonstrated packet traces show? How does this change with the advent of newer TCP algorithms like Proportional Rate Reduction and Tail Loss Probe that are available in newer Linux kernel versions?
  • Limiting requests will reduce attempted bandwidth utilization, which reduces congestion risk (and the risk for precipitous drops in goodput under excessive congestion). How else does it improve performance? For hints, see my some of my old posts like this and this.

Name Collisions

One of the problems of having a common name is name collisions :)

Getting overwritten

One day I came into my Google office and was hacking away as normal. As the day went on, I started getting mysterious failures, losing access to various systems with credential failures. I was boggled and eventually consulted Tech Stop (Google tech support) about what was going on. I sat there chuckling as I watched myself lose access to internal systems one by one. For kicks I went to see my compensation page which I discovered to be amusingly empty. No biggie, I didn’t need my salary/vacation/stock anyway. I sat there wondering if Google was trying to subtly fire me. First they take my stapler, then they take me off the payroll, and next thing I know my manager will be moving me to the basement.

Eventually Tech Stop found out who was writing all the changes to my accounts - HR. They were about to call up HR to find out why I was getting deleted from Google when they noticed that a new William Chan was entering the system, taking my spot. Apparently we had a new acquisition, which of course had their own employee accounts, and this new William Chan had the same username as me, and HR was overwriting me with William Chan v2. I think I scared away the imposter because I don’t recall him being around later when I searched for other William Chans who could possibly be receiving my office mail.

Good thing we had backups. It’s kinda nice getting paid.

Losing my “chance” to be a co-founder

Back in 2002, I had the good fortune of hanging out with a wonderful fellow by the name of Rolland Yip. We were both Stanford computer science students studying abroad in Japan and working internships in Tokyo. Fast forward nearly a decade and we had totally lost touch (I’m terrible at that), but Rolland was going to do a new startup in Hong Kong. So, of course he thinks about finding a clever co-founder to help him get things rolling. Apparently I made a good impression on him all those years ago (haha, fooled him!), so he went to look me up. He goes through the various social networks looking up William Chan, Stanford, Google, etc and finds “me”.

Little did he know, I’m not the only William Chan who graduated from Stanford with a CS degree and went to work at Google. This William Chan actually sat down the hall from me at Google in building 43, and we regularly received each others’ mail. He’s a swell fellow though, so he always returned my new camera lenses I ordered off Amazon, my recompensation checks from when I bought Google stuff on my own account (quite hilarious since he could actually cash those checks), etc. After he left Google, I did discover that he had used one of my free massage credits though, the bastard! :P

So Rolland reaches out, and discovers that he got the wrong William Chan, but the William Chan he contacted is totally interested in moving to Hong Kong and doing a startup. 2012 comes around and Google flies me to Tokyo for a HTTP standardization meeting, and I decide to take a slight detour to visit my old classmate in Hong Kong, where I get the lowdown on how these two fellows became co-founders. Hah! Great guys, I’m glad it worked out for them. Especially since I didn’t really have any intention of moving to Hong Kong nor leaving Google at that point in my life, so it’s great Rolland found a William Chan who did :)

Some Quick Thoughts on TCP Fast Open and Chromium

Since this came up yesterday, I figured I ought to write up a quick & dirty post about Chromium and TCP Fast Open. The current status is it’s experimentally supported for new Linux kernel versions (and requires both client and server kernel support). The basic status of the Chromium implementation is there’s a flag to enable it for testing. All that flag does is, if the kernel/system (even if the kernel supports it, the system wide setting may have disabled it) supports/enables TCP Fast Open, then we’ll try to use TCP Fast Open. We’ll optimistically return success for Connect() and try to take advantage of TCP Fast Open cookies on the first Write(). That’s all the code does really for now. But this implementation is broken in a number of critical edge cases. Let’s consider the ways it’s broken:

  1. TCP Fast Open requires application level idempotency. As noted in the text, it’s quite possible for the server to receive a duplicate SYN (with data). It must be resilient to this. In the web browsing scenario, POSTs are not idempotent. GETs are supposed to get idempotent, and indeed things like HTTP pipelining depend heavily on this requirement. Of course, it’s not strictly true in practice, but that’s a server-side bug. In the web browser case, proper use of TCP Fast Open would probably require only attempting to use it with idempotent requests. We’re not doing that yet. So there’s definitely a small risk of multiple POSTs, and if the server application doesn’t detect that, then it could be a serious problem. We need to fix our implementation to only try TCP Fast Open when the initial data packet doesn’t have any unsafe side effects (HTTP GET, or higher layer handshakes like SSL’s CLIENT_HELLO).
  2. TCP Fast Open violates a number of internal code assumptions in the Chromium HTTP stack. We naturally assume that all connection errors will be returned upon completion of Connect(). But as I explained earlier, we just optimistically return success for Connect() and return the connection error, if any, in the response to the Write(). This mostly works, but it’s obviously pretty dicey and we need to go through all our code to iron out the edge cases where this fails. Or we need to change the API to detect Fast Open support prior to the Connect() call, and if it’s likely to succeed, we can call a new method like ConnectWithData() or something.
  3. TCP Fast Open defeats late binding when the zero roundtrip connect falls back to the 3-way handshake. When our code optimistically returns connection success, the socket will be handed up to the HTTP stack for use and it’ll try to use it. But if the TCP Fast Open cookie fails to work and we fall back to the 3-way handshake, then our HTTP request is tightly bound to this socket that is in progress of being connected. This prevents late binding which would result in better prioritized allocation of requests to available connected sockets. Unfortunately, there’s no good API to the kernel to detect whether or not the kernel has a TCP Fast Open cookie for the destination server (and who knows how likely it is that servers will keep cookies around for long enough for them to be useful?). Until a better API exists, the only other alternative is to try to implement an application level cache to predict whether or not we think it’s worth it to try TCP Fast Open for a host.
  4. TCP Fast Open doesn’t play well with our TCP preconnect code. Our TCP preconnect code will try to start the SYNs early so we can mitigate the RT cost of the TCP 3-way handshake. We get caught in this situation where maybe we shouldn’t preconnect because maybe there’s a chance that Blink’s resource loader will request a resource on that host in the near future, before a TCP SYNACK would come in, and we might want to attempt a TCP Fast Open “connect” instead. But the preconnect code is way more dependable, since it works on all servers and doesn’t rely on TCP extensions that middleboxes may barf on. Note however that TCP Fast Open should combine nicely with SSL (and other higher layer) preconnect handshakes.

These problems are all addressable to some degree. Indeed, problems (1) and (2) are simple correctness issues that we just need to fix, but we haven’t begun addressing them yet. (3) is also relatively easy to address with some extra application layer logic to improve the TCP Fast Open success rate. (4) is a little bit tricky and we’ll have to experiment to see what works best in practice. And it’s hard to conceive of a case where TCP Fast Open is not a straight up win for HTTPS.

In short, there’s a lot of potential here, but the implementation is totally naïve now and needs more work to fix correctness and performance problems before it’s ready for actual end user usage.


Yay, I finally began part-time employment at Google on March 18!

  • I’m still a Google / Chromium representative at IETF httpbis for HTTP/2.0 related work. That’s officially my only work responsibility now, although I’m sure people will continue to bug me about random Chromium related stuff. I’ve handed off Chromium SPDY implementation work to my very capable colleagues.
  • I’ve still got a long queue of Chromium / webperf / networking related blog posts that I’ve been meaning to write. I’ll post them as I get around to them.
  • In my spare time, I switched to Octopress. I’ll write more on that later, but overall I really like it.

Connection Management in Chromium

Connection latency and parallelism are significant factors in the networking component of web performance. As such, Chromium engineers have spent a significant amount of time studying how best to manage our connections – how many to keep open, how long to keep them open, etc. Here, I present the various considerations that have motivated our current design and implementation.

Connection latency

It’s expensive to set up a new connection to a host. First 1, Chromium has to resolve the hostname, which is already an expensive process as I’ve previously discussed. Then it has to do various handshakes. This means doing the TCP handshake, and then potentially doing an SSL handshake. As seen below, both of these are expensive operations. Note that the samples gathered here represent the TCP connect()s and SSL handshakes as gathered over a single day in February 2013 from our users who opt-in to making Google Chrome better. And since these are gathered at the network stack level, it includes more than just HTTP transactions, and more than just web browsing requests.

TCP connect() latency. Note the spikes towards the long tail, indicating the platform-specific TCP retransmission timeouts.
CDF of TCP connect() latency.
SSL handshake latency. Includes the cert verification time (which may include an online certificate revocation check).
CDF of SSL handshake latency. SSL handshake latency. Includes the cert verification time (which may include an online certificate revocation check).

As can be seen, new TCP connections, much less full SSL connections, generally cost anywhere from tens to hundreds of milliseconds, so hiding this latency as much as possible by better connection management is important for reducing user perceived latency.

Socket “late binding”

When I first began working on Chromium in 2009, the team had recently finished the cross platform network stack rewrite, and finally had time to begin optimizing it. Our HTTP network transaction class originally did the simple thing, maintaining the full state machine internally and proceeding serially through it – proxy resolution, host resolution, TCP connection, SSL handshake, etc. We called this original state “early binding”, where the HTTP transaction was bound early on to the target socket. This was obviously suboptimal, because while the HTTP transaction was bound to a socket and waiting for the TCP (and possibly the SSL) handshake to complete, a new socket might become available (might be a newly connected socket or newly idle persistent HTTP connection). Furthermore, one HTTP transaction may be higher priority than another, but due to early binding and network unreliability (packet loss and reordering and what not), the higher priority transaction may get a connected socket later than the lower priority socket. My first starter project on the team was to replace this with socket “late binding”, introducing the concepts of socket requests and connect jobs. HTTP transactions would issue requests for sockets, and sit in priority queues, waiting for connect jobs or newly released persistent HTTP connections to fulfill them. When we first launched this feature for TCP sockets, we saw a 6.5% reduction in time to first byte on each HTTP transaction, due to the improved connection reuse.

Example of socket late binding. Connection 1 is used for bar.html, and then reused for a.png. Connections 2 and 3 get kicked off, but 3 connects first, and binds to b.png. Before 2 finishes connecting, a.png is done on 1, so connection 1 is reused again for c.png. In this example, only connections 1 and 3 are used, and connection 2 is unused and idle.

Leveraging socket late binding to enable more optimizations

Delaying the binding of a HTTP transaction to an actual socket gave us important flexibility in deciding which connect job would fulfill a socket request. This flexibility would enable us to improve the median and long tail cases in important ways.

TCP “backup” connect jobs

Examining the TCP connection latency chart above, you can see that the Windows connection latency has a few spikes. The spikes at the 3s mark and later correspond to the Windows TCP SYN retransmission timeout. Now, 3 seconds may make sense for some applications, but it’s horrible for an interactive application where the user is staring at the screen waiting for the page to load. Our data indicates that around 1% of Windows TCP connect() times fall into the 3~ second bucket. Our solution here was to introduce a “backup” connect job, set to start 250ms 2 after the first socket connect() to an origin. This very hacky solution attempts to workaround the unacceptably long TCP SYN retransmission by retrying sooner. Note that we only ever have one “backup” connect job active per destination host. After implementing this, our socket request histograms showed that the spike at 3s due to the TCP level SYN transmission timer went down significantly. As someone who feels very passionately about long-tail latency and reducing jank, this change makes me feel very good, and I was excited when Firefox followed suit.

Here you see the DNS TCP times of each socket connection vs the request time for a connected socket (filtered by only newly connected TCP sockets). It’s useful to note where the backup job helps…only with the TCP SYN retransmission timeouts, but not with the DNS retransmission timeouts. This is because we don’t call getaddrinfo() multiple times per host, only once. With our upcoming new DNS stub resolver, we can control DNS retransmission timeouts ourselves. The backup connect job is fairly effective at reducing the spikes at the TCP retransmission timeouts.

Some people have asked why we employ this approach instead of the IE9 style approach of opening 2 sockets in parallel 3. That idea is indeed intriguing and has its upsides as they point out. But in terms of compensating for SYN packet loss, it’s probably an inferior technique, if you assume a bursty packet loss model where bursts overflow routing buffers. The overall idea definitely has merits which we need to consider, but there’s less benefit for Chromium since Chromium also implements socket preconnect.

Socket preconnect

Once the HTTP transactions were detached from a specific socket and instead just used the first available one, we were able to speculatively preconnect sockets if we thought they would be needed. Basically, we simply leveraged our existing Predictor infrastructure that we used to power DNS prefetching, and hooked it up to preconnect sockets when we had extremely high confidence. Therefore, by the time WebKit decides it wants to initiate a resource request, the socket connect has often already started or even completed. As Mike demonstrates, this results in huge (7-9%) improvements in PLT in his synthetic tests. Later on, when we refactored the SSL connections to leverage the late binding socket pool infrastructure, we were able to expose a generic preconnect interface that allowed us to prewarm all types of connections (do TCP handshakes and HTTP CONNECTs and SSL handshakes) as well as prewarm multiple connections at the same time. Our Predictor infrastructure lets us predict based on past browsing, for a given network topology, what is the maximum number of concurrent transactions we’d want, and takes advantage of the genericized “socket” 4 pool infrastructure to fully “connect” a socket.

Improving connection reuse

While working on improving the code so we could better take advantage of available connections, we also spent some time studying how to increase the reusability of connections and optimize the choice of connection to reuse.

Determining how long to keep sockets open

The primary mechanism we studied for increasing connection reusability was adjusting our connection idle times. This is a sensitive area to play around with. Too long, and it’s likely that some middlebox or the origin server has closed the connection, and may or may not have send a TCP FIN (or it might be on its way), leading to a wasted attempt to use the connection only to pay a roundtrip to receive a TCP RST. Too short, and we shrink the window for potential reuse. It also got more complicated in a socket late binding world, since sometimes we would get resource request cancellations or preconnects that would lead to sockets potentially sitting idle in the socket pool prior to being used. Previously, a socket would only be idle in between reuse, but now a socket might sit idle before first use. It turns out that middleboxes and origin servers have different idle timeouts for unused vs previously used connections. Moreover, we have to be much more conservative with unused sockets, since if we get a TCP RST on first use, it may be indicative of server/network issues, rather than idle connection timeouts. Therefore, we only retry our HTTP request on previously used persistent HTTP(S) connections. With that in mind, we gathered histograms on the type of socket usage, and the time sockets sat idle before use/reuse, in order to guide our choice of appropriate socket timeouts.

Shows the amount of time we wait before using or reusing a socket. Note that Chromium currently times out unused sockets after 10-20s, and used sockets after 6min~.
Shows how often we are waiting for a new UNUSED socket, how often we immediately grab an UNUSED_IDLE socket, and how often we get to reuse a previously used idle socket.

One thing to note about the first graph is that the reason we’re fuzzy about the absolute idle socket timeout is because we run a timer every 10s that expires a socket if it exceeded the timeout, so when the timeout is 10s (as it is for unused, idle sockets), the true timeout is 10-20s. What the above charts show, is that there are significantly diminished returns to extending the idle socket times for previously used sockets, but that there is definitely potential for reuse by extending the socket timeout for unused, idle sockets (that’s why the slope for their CDFs caps out so abruptly at 100%). Before people jump to concluding that we should indeed increase the socket timeout for unused, idle sockets, keep in mind that every time we get this wrong, we may show a user-visible error page, which is a terribly confusing user experience. All they get is that Chromium failed to load their requested resource (which is terrible if it’s a toplevel frame). Moreover, looking at the HTTP Socket Use Types chart, you can see that the percentage of times that we actually utilize unused, idle sockets is extremely low. Obviously, if we increased their idle socket timeout, the reuse percentage would increase somewhat, and we should also note that the use of a UNUSED or UNUSED_IDLE socket is extremely important, since it’s probably directly contributing to user perceived latency during page loads. There’s a tradeoff here, where we could gain more connection reuse, at the cost of more user visible errors.

Determining how many total sockets to keep open

Beyond trying to optimize reuse of individual sockets, we also looked into our overall socket limits to see if we could increase the number of sockets we keep available for potential reuse. Originally we had no limits, like IE, but eventually we encountered issues with not having limits due to broken third party software, so we had to fix our accounting and lower our max socket limit. We knew this probably impacted performance due to lower likelihood of being able to reuse sockets, but since certain users were broken, we didn’t know what to do. Eventually, due to many bug reports where I saw us reach the max socket limit (not necessarily all in active use, but it meant we were closing idle sockets before their timeout, thereby limiting potential reuse), we reversed our policy and just decided that the third party should fix their code (some of our users helpfully reached out to the vendor) and bumped our limit up to 256, which hopefully should be enough for the foreseeable future.

Determining which idle socket to use

One thing to keep in mind about reusing idle sockets is that not all sockets are equal. If sockets have been idle longer, there’s a greater likelihood that the middleboxes/servers have terminated the connection or that the kernel has timed out the congestion window 5 If some sockets have read more data, then their congestion windows are probably larger. The problem with greedily optimizing at the HTTP stack level is it’s unclear if it’s ultimately beneficial for user perceived latency. We can more or less optimize for known HTTP transactions within the HTTP stack, but a millisecond later, WebKit might make a request for a much higher priority resource. We may hand out the “hottest” socket for the highest priority currently known HTTP transaction, but if a higher priority HTTP transaction comes in immediately afterward, that behavior may be worse off for overall performance. Since we don’t have crystal balls lying around, it makes it difficult for us to determine how best to allocate idle sockets to HTTP transactions. So, we decided to run live user experiments 6 to see what worked best in practice. Unfortunately, the results were inconclusive: “Skimming over the results, the primary determinant of whether there’s any difference in the four metrics I looked at, and if so, which group came out on top (Time to first byte, time from request send to last byte, PLT, and net error rates) was which day I was looking at.  No consistent differences in either the long tail or everyone else.” Since we didn’t see any clearly better approach, we optimized for simplicity and kept our behavior of preferring the most recently used idle socket.

How much parallelism is good?

The simple, naive reasoning is that more parallelism is good, so why don’t we parallelize our HTTP transaction as much as possible? The short answer is contention. There are different sorts of contention, but the primary concern is bandwidth. More parallelization helps achieve better link utilization, which as I’ve discussed previously, often introduces contention in the web browsing scenario and thus may actually increase page load time. It’s difficult for the browser to know what level of parallelization is optimal for a given website, since that’s also predicated on network characteristics (RTT, bandwidth, etc) which we often are not aware of (we can potentially estimate this on desktop, but it’s very difficult to do so on mobile given the high variance). So, in the end, for HTTP over TCP, at least for now, we’re basically stuck with choosing static parallelization values for all our users. Since Chromium’s HTTP stack is agnostic of web browsing (it has multiple consumers like captive portal detection, WebKit, omnibox suggestions, etc), it’s been historically difficult to control parallelization beyond the connections per host limit, but now that we’re developing our own Chromium-side ResourceScheduler that handles resource scheduling across all renderer processes, we may be able to be smarter about this. So our live experimentation in this area has primarily focused on studying connection per host limits.

Experimenting with the connection per host limit

In order to identify the optimal connection per host limit in real websites, we ran live experiments on Google Chrome’s dev channel with a variety of connection per host limits (4, 5, 6, 7, 8, 9, 16) back in summer 2010. We primarily analyzed two metrics: Time to First Byte (TTFB) and Page Load Time (PLT). 4 and 16 were immediately thrown out since their performance was very clearly worse, and we focused our experiment population on 5-9. Out of those options, 5 and 9 performed the worst, and 6-8 performed fairly similarly. 7 and 8 performed very very similarly in both TTFB and PLT. 8 connections had anywhere from 0-2% TTFB improvement over 6 connections. In terms of actual PLT improvement though, 8 connections had around a 0.2%-0.8% improvement over 6 connections up to the 90th percentile. Beyond the 90th percentile, 8 connections had a PLT regression in comparison to 6 connections of around 0.7-1.3%. The summary of our discussion was: “6 is actually a decent value to be using. 7 connections seemed to perform slightly better in some circumstances, but probably not enough to make us want to switch to it.” There are many qualifications about this experiment.

For example, these metrics are sort of flawed, but they were the best we had at the time. Today, there are other interesting metrics, like WebPageTest’s Speed Index, which would be cool to analyze. Also, this experiment data only shows us in aggregate across our population what the ideal connection per host limit might be, but as previously noted, there’s every reason to believe that the optimal connection per host limit would be different based on certain variables, like network type (mobile, modem, broadband), geographic location (roundtrip times are very high in certain regions), user agent (web content often varies greatly for mobile and desktop user agents), etc. As noted before, it’s difficult to do accurate network characteristics estimation, especially on mobile, but we might try to tweak the limit based on whether or not you’re on a mobile connection.

Opening too many TCP connections can cause congestion

Another issue that we’ve seen with increasing parallelism, aside from contention, is bypassing TCP’s congestion avoidance. TCP will normally try to avoid congestion by starting slow start with a conservative initial congestion window (initcwnd), and then ramp up from there. Note, however, that the more connections a client opens concurrently, the higher the effective initcwnd is. Also recall that newer Linux kernel versions have increased the TCP initcwnd to 10. What this means is that when the client opens too many connections to servers, it’s very possible to cause congestion and packet loss immediately, thus negatively impacting page load time. Indeed, as I’ve previously mentioned, this definitely happens in practice with some major websites. Really, as we’ve said time and time again, the solution is to deploy a protocol that supports prioritized multiplexing over a single connection, e.g. SPDY (HTTP/2), so web sites don’t need to open more TCP connections to achieve more parallelism.

Making Chromium’s connection management SP[ee]DY

As our SPDY implementation got under way, we realized that our existing connection management design held back the SPDY implementation from being optimal. What we realized was that our HTTP transaction didn’t really need to bind to a socket per se, but rather to a HTTP “stream” over which it could send a HTTP request over and from which receive a HTTP response. The normal HttpBasicStream would of course be directly on top of a socket, but this abstraction would let us also create a SpdyHttpStream that would translate the HTTP requests to SPDY SYN_STREAMs, and SYN_REPLYs to HTTP responses. It also provided us with a good shim point in which to play around with a HTTP pipelining implementation that would create HttpPipelinedStreams. Now, with this new layer of indirection, we had a new place where we wanted to once again delay our HTTP transaction binding to a HttpStream as long as possible, which meant pulling more states out of our serial HTTP transaction state machine and moving them into HttpStreamFactory jobs. When a new SPDY session gets established over a socket, each HttpStreamFactory job for that origin can create SpdyHttpStreams from that single SPDY session. So, even if we don’t know that the origin server supports SPDY, and we may initiate multiple connections to the server, once one connection completes and indicates SPDY support, we are able to immediately execute all HTTP transactions for that origin over that first SPDY connection, without waiting for the other connections to complete. Moreover, in certain circumstances, a single SPDY session might even service HttpStreamFactory jobs for different origins.

HttpStream late binding, enabling a single SPDY session to create HttpStreams simultaneously.

Once we had the HttpStream abstraction, we could switch from preconnecting sockets to preconnecting HTTP streams. When we visit SPDY-capable origins, we record that it supports SPDY, so we know on repeat visits only to preconnect one actual socket, even though the Predictor requested X HTTP streams.

Mobile connection management

Ever since the Android browser started embedding Chromium’s network stack, the core network team has assisted our mobile teams in tuning networking for our mobile browsers.

Detecting blackholed TCP connections with SPDY

One of the very first changes we had to make was related to how we kept alive our idle sockets. Historically, we’ve found that TCP middleboxes may time out TCP connections (probably actually NAT mappings), but instead of notifying the sender with a TCP RST, they simply drop packets, leading to painful TCP retransmission timeout, causing requests to hang for minutes until the TCP connection finally timed out. This happened for HTTP connections before, but since we often opened multiple HTTP connections to a server, page loads would generally be pretty resilient, and at worst the user could recover by reloading the page (the problematic connection would still be hung and thus not used for the new HTTP requests). However, for SPDY, where we try to send all requests over the same connection, a hung TCP connection would completely hang the page load.  We solved this problem at first by making sure TCP middleboxes never timed out our connections, by using TCP keepalives at 45s intervals. However, this obviously wakes up the mobile radio (the radio is very power hungry) every 45s, which drains a ton of batteries, so the Android browser team turned it off. And then they found that SPDY connections started hanging. So we sat around wondering what to do, and realized that what we wanted wasn’t to keep the connections alive, but rather to detect the dead connections (rather than waiting for lengthy TCP timeouts). So, we decided to use the SPDY PING frame as our liveness detection mechanism, since it was specced to require the peer to respond asap. This means that we could detect dead SPDY connections far sooner than dead HTTP/1.X connections, and close the SPDY connection and retry the requests over a new connection. We still have to be a bit conservative with the retry timeout, since if it is less than RTT, we’ll always fail the PING. Really, what we want to do in the future is not actually close the connection down on timeout, but race building a new SPDY connection in parallel and only use that connection if it completes before the PING response returns.

Reducing unnecessary radio usage

While examining this issue for Android browser, they also informed me that they were under a lot of pressure to try to save even more power by closing all HTTP connections after single use. The reasoning is as follows:

  • After a server-specific timeout, the server will close the socket, sending a TCP FIN to the Android device which wakes up the radio just to close the socket.
  • After Chromium’s idle socket timeouts, we close the sockets, which sends a FIN from the Android device to the server, again waking up the radio just to send a FIN.
  • As Souders explains in detail, waking up the radio is expensive and the radio only powers back down based off timers that move the connection through a state machine.

The problem with this proposed solution is that closing down each connection after a single HTTP transaction prevents reuse, which slows down page load, which may ultimately keep the radio alive longer. It’s unclear to me what the right solution is, but I sort of suspect that closing down the HTTP connections after a single transaction is ultimately worse for power, and clearly worse from the performance perspective. One thing we can do to mitigate the problem is not to ever wake up the radio just to send a FIN. Indeed, this is what we now do for most platforms. However, that only addresses the client side of the problem, but not the server side. This is another one of those cases where SPDY helps. By reducing the number of connections, SPDY reduces the number of server FINs that will ultimately be received by the client, thus reducing the total number of times server FINs wake up / keep alive the client radio.


Connection management is one of the parts of Chromium’s networking stack that has a large impact on performance, so we have spent a lot of time optimizing it. Much of the work has been refactoring the original HTTP stack which previously kept all HTTP transactions fairly independent of each other, with a serial state machine where the transaction would only be blocked on a single network event, into a much more intertwined HTTP stack, where networking events initiated by one HTTP transaction may create opportunities for a different HTTP transaction to make progress. All HTTP transactions share the same overall context, with different layers initiating jobs and servicing requests in priority order. We’ve also spent a lot of time experimenting with different constants in the HTTP stack, trying to figure out what static compile-time constants are best for the majority of our users. Yet we know that any single constant we pick will be suboptimal for a large number of users/networks. This is one reason why we advocate SPDY so strongly – it reduces our reliance on these constants by providing much better protocol semantics. Rather than having to make a tradeoff between link utilization/parallelism and contention, not to mention having to worry about bypassing TCP’s congestion avoidance with too many connections, we can have our cake and eat it too by leveraging SPDY’s prioritized multiplexing over a single connection. We’ve spent a lot of work in this area of the Chromium network stack, but there’s still a lot more work to be done. We really have only started scratching the surface of optimizing for mobile. And as Mike likes to say, SSL is the unoptimized frontier. Moreover, new technologies like TCP Fast Open are coming out, and we’re wrapping our heads around how to properly take advantage of a zero roundtrip connect (well, you only get to fit in a single packet’s worth of data, but that’s plenty for say…a SSL CLIENT_HELLO). So much to do, so little time.


  1. Well, not quite first. There are other steps for certain configurations, like proxy resolution.
  2. This constant is obviously suboptimal in lots of cases, and we’re looking into fixing it by implementing it more dynamically.
  3. It appears to have some interesting side effects. It looks like they issue double GETs, and cancel them, hoping that Windows will reuse the socket for other GETs, but sometimes it doesn’t work and they just burn a socket. They probably have to do it this way since they rely on WinINet, rather than implementing the HTTP stack themselves. But I’m just speculating, I don’t really know. [edit: Eric Lawrence chimed in with a much more authoritative explanation].
  4. Chromium abuses the term socket to generically refer to a “connected” byte stream
  5. Linux (for HTTP, we generally care more about the server congestion window) by default times out the congestion window after a RTO, controllable by the tcp_slow_start_after_idle setting.
  6. “warmest socket” uses the socket which has read the most bytes. ”last accessed socket” uses the socket that has been idle for the least amount of time. ”warm socket” uses “read bytes / (idle time)0.25

LocalStorage Load Times Redux

I previously discussed LocalStorage load times, and had concluded that it wasn’t too bad. However, when I got back from my new year’s holiday in Patagonia, I started digging through my email backlog and found an interesting line of questioning from GMail engineers, asking why Chrome’s first LocalStorage access was so slow in their use case, and providing data showing it. I, of course, knew why it should be slow, because it’s loading the entire DB from disk on the first access. But I previously had data that showed it wasn’t so bad. What I realized then was that I was not accounting for the size of the DB. So I added histograms to record LocalStorage DB sizes and load times by size. Note that LocalStorage is cached both in the renderer and browser processes. Furthermore, note that I’m only recording a single sample for each time LocalStorage is loaded into the process’s in-memory cache. This obviously means that long-lived renderer processes (roughly speaking, tabs) will only get one sample recorded, even if they heavily use LocalStorage.

Shows the size of the browser process and renderer process LocalStorage in-memory cache at load time (first access). There’s an interesting little jump at 1MB, which I’m unable to explain.

As can be seen, the vast majority of LocalStorage DBs are rather small in size (only on the order of a few KBs). That means that in my previous post where I recorded the LocalStorage load times, the histograms primarily consisted of samples for small LocalStorage DBs. In this iteration, I separated out the load times for DBs into three buckets: size < 100KB, 100KB < size < 1MB, and 1MB < size < 5MB, and got the following results.

Shows the load times of LocalStorage caches by size and process. Windows only.
Shows the load times of LocalStorage caches by size and process.

The short of it is that the long tail is terribly slow for large DBs. This definitely has implications for people considering using LocalStorage as an application resource cache for performance reasons, since caching resources will probably noticeably increase the size of the LocalStorage DB, and also has some security implications. And just in case you forgot, the API is synchronous, so the renderer main thread is blocked while the browser loads LocalStorage into memory, at most likely the worst time for performance – initial page load. And note the jump at the end of the chart…that’s because my max bucket was capped at 10s, since I didn’t think we’d have many samples that exceeded that. Unfortunately, I was wrong :(

In the end, as with all web performance techniques, you really should measure the impact of the technique to make sure that it is actually a performance win in your use case.

PS: I’ve published the full CDFs from the histogram data. Note that this consists of data gathered from Google Chrome 26 opted-in dev channel users in February over a space of 5-6 days. The results should definitely change somewhat for stable channel users (probably for the worse…dev channel users tend to have more advanced machines). Take the Mac and especially the Linux data with extra grains of salt, since their sample sizes are significantly lower.