Insouciant

William Chan's blag

“Server Push” Already Exists, but SPDY Server Push Is Better

SPDY server push is one of the most poorly understood parts of SPDY. When people hear that the protocol supports the server pushing resources to the client, some of them are excited by the possibilities, but many are scared that SPDY will allow the server to push undesired content. What many people don’t realize is that “server push”, where the server is able to push content down to the client that it may not have explicitly requested [yet], already exists. It’s called resource inlining. Servers already sometimes take an external resource (e.g. scripts, stylesheets, and images) and directly inline it into the document (via inline or blocks, or data URIs). SPDY server push is superior to this approach in a number of ways. Here are a few:

  • Inlined resources cannot be cached separately from the document by the client. They are directly inlined into the document, which is usually kept uncacheable. People often complain that SPDY server push may push content that is already in the browser cache, but if servers instead inline resources, then the resource will never be cached. At least for SPDY server push, the resource may be in the cache and the client can use a RST_STREAM frame to race to cancel the pushed stream.
  • Inlined resources cannot be shared across documents. If your website has multiple references to a resource, it’s desirable to reference them via the same URL for caching purposes. This reduces unnecessary redundancy across the pages on the website.
  • Inlined resources may be poorly prioritized. Generally speaking, browsers will try to prioritize the document over external resources like stylesheets, scripts, and images, since the document will usually allow discovering and downloading more resources sooner, thus speeding up resource loading. However, when originally external resources are inlined directly into the document, then their content will preempt the rest of the content in the document. SPDY server push gives the server the ability to advertise that it will push some of the externally referenced resources, but rather than immediately pushing them before the rest of the document completes, as would happen with resource inlining, it can wait until the document finishes before pushing resources.
  • For clients that truly do not want to be pushed content, they can disable it by setting SETTINGS_MAX_CONCURRENT_STREAMS to 0. On the other hand, clients are not able to disable resource inlining. Clients should be careful about disabling SPDY server push, since if SPDY server push becomes too unreliable, then servers may instead go back to resource inlining despite its downsides.

Fundamentally, once a client makes a request to a server, the server can send anything back in its response. For the servers that want to reduce roundtrips by pushing content instead, SPDY server push is a superior mechanism to resource inlining.

PS: I’m pleased to note that, despite earlier signs that SPDY server push may be removed from HTTP/2, at the interim httpbis meeting in Tokyo, everyone agreed that server push should stay in the spec.

SPDY Prioritization Case Study – Gmail

Awhile back, Gmail team asked some Chromium devs why Google Chrome downloaded CSS so much more slowly than JS. This sounded strange to me, since I knew the code, and Chromium clearly prioritizes script and stylesheets at the same level (look for DetermineRequestPriority). I was initially skeptical of their claim, but then they said they had data that proved otherwise. Specifically, they told me:

I start downloading 400KB gzip'ed JS
JS  x--------;
Short after that I start downloading a JSON file containing our CSS, 60KB gzip'ed.
JS  x-------->;
CS   x---->;
I would expect to see this, in 99% of the cases:
JS x---------------------------------------| done
CS   x--------| done
But what we see often is this pattern:
JS x---------------------------------------| done
CS   x-----------------------------------------------------------------------------| done

At this point I was pretty intrigued. There’s nothing like a mystery to pique my interest.  First, to understand Gmail’s loading infrastructure, check out Gmail’s webperf slides to see how they bootstrap the initial main page and then load the main script and stylesheet. So, I dove into the debugger and saw that, not only is the CSS downloaded as JSON, it’s downloaded using an XHR (not very surprising in retrospect). On the other hand, the javascript resource isn’t a true javascript resource, it’s actually an iframe! I dive into Chrome DevTools to see what’s going on:

Viewing Gmail’s main JS and CSS loading in DevTools

As can be seen, the CSS is smaller, and it starts after the JS load started, but it generally finishes afterward. Indeed, in practice, this matches what we see in the wild. The question is, why? Well, the answer of course is that this is expected behavior with SPDY.

To figure out why, we need to see what the resources look like from the browser’s perspective. As previously mentioned, the javascript resource is actually an iframe where the html is full of inline script blocks. As the code I linked to earlier shows, an iframe is requested with the highest priority. And the CSS resource is actually JSON that is requested via an XHR. Again, as the code I linked to earlier shows, an XHR is requested with the lowest priority. If we look at the SPDY3 spec, we see that The sender and recipient SHOULD use best-effort to process streams in the order of highest priority to lowest priority. As specced, that means that Gmail’s iframe should pre-empt the XHR served over the same SPDY session. That’s of course why, from the Gmail team’s perspective, the CSS download seems to download more slowly than JS. They’re measuring from Javascript using techniques that do not have visibility into what’s happening in the SPDY session, so they can see that the CSS JSON is taking a long time to download, but cannot tell that it’s because it’s being downloaded at a lower priority than other resources and thus is getting pre-empted. Anyhow, mystery solved!

But we should step back here and ask, is SPDY doing the right thing? Should a resource (or rather, a SPDY stream) at a higher priority starve a resource at a lower priority? It’s very reasonable how it made Gmail team wonder why CSS looked so slow to download. But if you think about it, SPDY prioritization isn’t changing the actual throughput. It’s simply affecting the order in which the resources’ bytes are being sent by the server, thus the time to download all resources should be the same in absence of prioritization. However, outside of exceptional, adversarial cases, it should generally result in individual resources completing sooner. Now, the question is whether or not it’s better to have individual resources complete sooner, or to get more interleaving amongst resources. It’s useful to note that for many resources such as script, the web rendering engine is not able to progressively process them. The entire script is delivered as a whole to the script engine (V8 in Chromium’s case). On the other hand, resources like HTML are able to be incrementally processed. That said, would it be better to completely deliver one HTML resource and then another HTML resource or interleave them? That’s completely unclear to the browser. For images, it’s likewise unclear whether or not it’s better to deliver them serially or interleaved, although it’s generally more likely that images earlier on in the document will be in the viewport. But some images are progressive, and interleaving them will let the user see lower quality versions of more images sooner, which is arguably a better user experience. And since documents don’t always have the layout markup for images, it may be better to interleave the images in order to get the dimensions for images sooner so the rendering engine can layout the page correctly sooner. Current SPDY prioritization allows for conveying fixed priority levels, but it only has very rough semantics for describing where the browser wants strict ordering of resources versus equivalency (such that a server could interleave if it’s better). Indeed, SPDY3 has 3 bits for priorities, which isn’t necessarily enough since there are generally far more than 8 resources per page. This, amongst other reasons, is why we’re revamping SPDY prioritization in SPDY4, and I will present many of these use cases to the httpbis working group next week in Tokyo so we can design for them for HTTP/2.

Digging in even deeper, it’s curious to ask why Gmail is downloading JS using an iframe with inline script, and why it’s downloading CSS as JSON using an XHR. Chromium gives scripts and stylesheets the same priority levels, so if Gmail didn’t use these techniques, Google would probably have interleaved the two resources. As far as the JS goes, there are multiple reasons to use the script in iframe technique, and I won’t cover them all here. For one, it is an effective cross browser technique for loading resources in parallel. Also, Gmail is segmenting the script into multiple script blocks which allows the browser to incrementally feed script chunks to the javascript engine for parsing and execution in parallel with the download of more script blocks in the iframe. Moreover, this technique also allows Gmail to render the progress bar as each script block executes. Amongst other downsides though, it defeats the browser’s attempt to recognize the correct resource type and accurately prioritize the resource. In this case, it introduces contention with the initial main page, since both are considered to be documents and thus are at the same priority level, which may or may not be a good thing. So I asked Dr. Barth to help me figure out a replacement for the script in iframe technique, and he proposed supporting multipart/mixed responses for script elements.

As far as why Gmail downloads CSS as JSON using an XHR, it’s actually for a number of reasons, not all of which I’m going to dive into here. But one major reason is to avoid blocking first paint, since rendering engines will block first paint until relevant stylesheets come in, in order to prevent FOUC. Gmail doesn’t want to block first paint on downloading the CSS for the main part of the web app, since it wants to render the progress bar in the meanwhile. There are other reasons CSS as JSON is useful for them, but if we ignore those, then what Gmail needs is a way in the web platform to declaratively (so the speculative parser can discover the resource sooner) asynchronously load stylesheets in a manner that doesn’t block first paint, fires a load event, and is properly recognized by the web platform as a stylesheet download (so it can be appropriately prioritized) rather than an opaque blob. There are many loading techniques, like creating a link element and appending from script, that meet most of these goals, but not all of them.

Gmail’s a fascinating case to study, since they’ve done so many web performance optimizations that had to work across a large number of browsers, many of them very old. These web performance techniques were the best options for older browsers, but they interact in interesting, potentially suboptimal ways with modern browsers.

Status of HTTP Pipelining in Chromium

OK, people have asked me this enough times, so it’s time to write down what’s up with pipelining in Chromium. In short, Chromium has a very naive pipelining implementation that is off by default, and it’s unclear if we’ll ever enable it by default. The primary reasons we will not enable it for at least the foreseeable future are:

  • Interoperability concerns
  • Unclear performance benefits

Interoperability Concerns

This is probably the most important reason for not enabling HTTP pipelining by default. First, it’s important to recognize how critical interoperability is. If a browser feature breaks web compatibility, irrespective of which entity (browser/intermediary/server/etc) is at fault, the typical user response is to simply switch browsers. If we’re lucky, the user may reach out to our support forums or bug tracker. Indeed, interoperability concerns are the primary reason we disabled False Start, despite its clear performance benefits.

So, when we discuss the interoperability of HTTP pipelining, what are we worried about? Well, for one, we’re concerned about the failure modes of pipelining. What happens when a server or intermediary doesn’t support HTTP pipelining? Does it close the connection? Does it hang? Does it corrupt responses? If the failure mode is clearly detectable, we can probably retry without pipelining, albeit at the cost of a roundtrip to detect the failure. But what would you do if it hangs? Retry without pipelining after some sort of fixed timeout? If it actually simply corrupts responses, that’s super scary.

Also, where are the failures happening? Is it primarily origin servers? Intermediaries? In mnot‘s internet draft discussing how to improve pipelining in the open web, he proposes some strategies here. For origin servers, he proposes maintaining a blacklist of broken origins. It does seem conceivable that if we could reliably detect pipelining incompatibility, we could maintain a blacklist. It’s not obvious to me that this is reliable though. Well, for what it’s worth, it seems Firefox has a server blacklist and it seems to maybe work for them. Even assuming we could somehow detect all the pipelining incompatible origin servers, we’d still have to detect broken intermediaries. Indeed, this is a huge part of the problem, and Patrick McManus identifies this as the primary reason desktop Firefox does not enable pipelining by default. Again, mnot’s I-D proposes a solution: sending pipelined requests to a known pipeline-compatible origin server in order to detect problematic intermediaries. Now, this is problematic for many reasons. For one, it requires “phoning home” to a known server, which is always concerning from a privacy perspective. Moreover, it requires repeating this test on when switching network topologies, which, given the increased mobility of today’s computers (phones, laptops, etc), impacts its utility, not to mention wasting bandwidth which the user may be paying for (e.g. mobile data). Note that these downsides do not rule out the approach, but they must indeed factor into any decision to rely on such a pipelining compatibility test.

That said, we implemented some of these basic pipelining tests in Chromium and enabled it for, at its peak, 100% of our Google Chrome dev channel users. As always, it’s important to caveat that the different Google Chrome release channels have different populations, and indeed we did see some differences between our canary and dev channel pipelining tests, so it’s important not to put too much faith in the numbers being exactly representative of all users. That said it still offers some cool insights. For example, the test tries to pipeline 6 requests, and we’ve seen that only around 65%-75% of users can successfully pipeline all of them. That said, 90%-98% of users can pipeline up to 3 requests, suggesting that 3 might be a magic pipeline depth constant in intermediaries. It’s unclear what these intermediaries are…transparent proxies, or virus scanners, or what not, but in any case, they clearly are interfering with pipelining. Even 98% is frankly way too low a percentage for us to enable pipelining by default without detecting broken intermediaries using a known origin server, which has its aforementioned downsides. Also, it’s unclear if the battery of tests we run would provide sufficient coverage.

There are some potential modifications we could do to address some of these problems. For example, we could give up on true pipelining and do pseudo pipelining. In other words, only start pipelining the next request once the previous request’s response has started coming back. The assumption here is that some broken intermediaries / servers are not expecting a recv() to include multiple responses, so this helps ensure that doesn’t happen. It obviously loses most of the potential latency reduction of pipelining, but it might have much better interoperability, and is less vulnerable to head of line blocking issues. Another idea is only to use pipelining with HTTPS. That helps eliminate the vast majority of interoperability problems with intermediaries (barring SSL MITM proxies and their ilk), as people trying to deploy WebSockets have found.

I’ve given numbers about problems with intermediaries that our desktop users have seen, but what about mobile? People have been saying how pipelining totally works with mobile, and how it’s important for Chrome on Android to use it. The short answer is I don’t know. Maybe it’d be awesome, or maybe it’s sucky, I really don’t know. I’ve been waiting for us to get better data gathering infrastructure on mobile Chromium before making any claims here. I’ve definitely got some pet theories though. While I do think it’s possible that mobile carriers have fewer pipelining incompatible intermediaries, as far as I know mobile browsers aren’t turning pipelining on/off based on connection type (e.g. WiFi vs 3G), so I don’t think that can explain why mobile browsers seem to have fewer incompatibility issues with pipelining. I can think of two possible explanations. Either the truly problematic intermediaries are client-side (virus scanners and what not), and those are far less prevalent on these mobile devices, or people just have higher tolerance for mysterious hangs and what not on their mobile devices and press reload more often. It’s not like they’ve historically had other browsers to try out on their mobile devices when they get frustrated when the default browser hangs on loading a page :) Anyway, I’m just speculating here and have no real data about mobile pipelining compatibility.

Unclear performance benefits

Pipelining can definitely significantly reduce latency. That said, an optimal pipelining implementation is fairly complicated. Darin notes some of the complexities in a somewhat outdated FAQ on pipelining. Most of the issues revolve around the dreaded head of line blocking issue. On the one hand, pipelining deeper may reduce queueing delay, but if the request gets stuck behind a slow request, then it may have been better to schedule the request on a different pipeline / connection. In order to minimize the likelihood of being stuck behind slow requests, it may be better to have shallower pipelines. This also lets the browser do re-prioritization of HTTP requests before they get assigned to an available pipeline, since if the HTML parser encounters a new high priority resource like an iframe, it will obviously sit behind any other resource requests on the pipeline it is assigned to, despite perhaps being higher priority. It is certainly possible for pipelining to actually worsen page load time if requests end up waiting for a long time behind slower requests. There are mitigation strategies, primarily based on guessing at request latency based on heuristics like resource type, and also re-requesting head of line blocked requests on a different pipeline / connection, perhaps based on a timer (again, possibly wasting bandwidth, which the user may have to pay for). In the end, they’re just heuristics though, and heuristics can be wrong. Moreover, another problem with pipelines is that if we encounter a transport error on that pipeline, we have to retry all the requests on another pipeline / connection, so it might again be slower.

All in all, tuning a pipelining implementation is fairly complicated, and Chromium’s implementation is nowhere near optimal, as we’ve primarily focused on detecting broken intermediaries in our compatibility tests. That said, even if we could tune it, how much of a difference would it make? Guypo’s done some analysis here and believes that pipelining doesn’t make much of a difference web performance wise. Looking at his study, I agree with most of his conclusions, which makes me even more lukewarm about enabling pipelining by default, at least for desktop.

Conclusion

Currently the majority of the Chromium developers discussing pipelining aren’t very excited about its prospects, at least for desktop. SPDY and HTTP/2 have always been our long-term plan for the future, but some of us had been hopeful that we could improve performance for users sooner by getting pipelining to work. For the foreseeable future, the pipelining code will probably stay in its zombie state while we work on other performance initiatives, unless we feel like killing it off or we get/gather new data showing interoperability concerns aren’t a big deal anymore or there are dramatic performance improvements to be had. There’s perhaps more reason to be optimistic about mobile, but I’ll wait until someone shows me the data. For now, I’m focusing my attention on HTTP/2. I’m looking forward to next week’s httpbis interim meeting in Tokyo!

Prioritization Is Critical to SPDY

I get the sense that when people discuss SPDY performance features, they pay attention to features like multiplexing and header compression, but very few note the importance of prioritization. I think it’s a shame, because resource prioritization is critical. SPDY prioritization enables browsers to advise servers on appropriate priority levels for resources, without having to resort to hacks like not requesting a low priority resource until higher priority resources have completed, which make it difficult for browsers to fully utilize the link. Theoretically, SPDY will let you have your cake and eat it too.

In order to demonstrate the effect of SPDY prioritization, my plan was to roll out SPDY on my own server and show the performance improvement we get from disabling WebKit’s ResourceLoadScheduler (and thus send all requests immediately to Chromium’s network stack, rather than throttling them) and thus rely on SPDY prioritization instead. However, to my dismay, disabling WebKit’s ResourceLoadScheduler actually slowed down my personal website. See this video of the page load (Chrome 23 stable on left, Chrome 25 canary on right. Stable still throttles subresources before first paint) where most of the page completes seconds faster on Chrome 23 in comparison to Chrome 25.

Ugh, what’s wrong? Does SPDY prioritization not work as advertised? Check out the Chrome 23 stable release load of my website vs the Chrome 25 canary load of my website to see the difference in behavior.

Chrome 23 Stable load of https://insouciant.org
Chrome 25 Canary load of https://insouciant.org

The key thing to notice here is that in Chrome Canary, the critical JS and CSS resources are delayed due to contention. Why is there contention? Shouldn’t SPDY prioritization solve this issue? I looked at the nginx SPDY patch to find out how they were doing prioritization, and couldn’t figure out how it worked since it didn’t even seem to be present, so I shot Valentin (the nginx dev who authored the SPDY patch) an email asking about SPDY prioritization not working, and he responded with:

Yes, it is known. I’m currently working on an implementation that will respect priorities as much as possible.
That’s one of the reasons of why we do not push current patch into nginx source.

OK! That makes sense. Nginx’s SPDY implementation is still in beta, so it “works” but does not respect prioritization yet. That’s fair, because they’re still working on it. The problem is that people are deploying real websites using nginx’s SPDY support, even though prioritization doesn’t work at all. For example, check out the Chrome 23 stable vs Chrome 25 canary load time waterfalls and video for https://getnodecraft.net:

Looking at the waterfalls I’ve linked above, you can see that Chrome 23 stable achieves a faster first paint and page load time because it reduces contention on the stylesheets and script, which both lets it reach first paint faster and also DOMContentLoaded, which fires off some more requests, so the overall page load also completes sooner, despite Chrome 25 canary’s better early link utilization.

If you’ve heard the SPDY team talk about optional features, you’ll know that we hate them (see Mike’s comment (d) in his blog post). However, despite prioritization’s importance, we sadly aren’t able to make servers incompatible if they don’t respect them properly. That’s because the prioritization has to be advisory, since it’s quite reasonable for a low priority resource to be immediately available for transmission, if say it’s cached, whereas a higher priority resource may need more server side processing. It would be wasteful not to allow the server to transmit lower priority resources while higher priority resources aren’t yet available, and since it’s impossible for the client to distinguish between this case and a lack of prioritization, it’s impossible to enforce server support from the client side.

While I’m excited that it shows people are excited about SPDY, I think it’s a bit unfortunate that nginx’s incomplete SPDY implementation has prematurely seen relatively wide adoption. It would be terrible if we reached a state where a nontrivial chunk of SPDY server deployments did not respect prioritization and there was little chance of change, since that would pressure browsers to give up on SPDY prioritization and fall back to throttling low priority resources before first paint. That said, I’m not too worried yet. SPDY is still in its early stages of deployment, and old SPDY versions will get disabled when we finish the SPDY/4 spec and start deploying that, not to mention HTTP/2. But once we get to the HTTP/2 deployment stage, we’ll have to make sure to push hard on server implementations to get prioritization right so clients can depend on its support server-side.

Some More Reasons Hacking Around HTTP Bottlenecks Sucks and We Should Fix HTTP

It's not you I hate, HTTP. I hate the hacks people write because of you

Web developers keep devising more and more clever hacks that help work around HTTP level issues like overhead from roundtrips and lack of parallelism. The problem is that a lot of these techniques are not magic bullets, but have some important tradeoffs. Here I go through two such hacks and how they are suboptimal and hopefully will be unnecessary if a prioritized multiplexing protocol that looks like SPDY gets standardized as HTTP/2.0. With prioritized multiplexing, the natural way of authoring content will hopefully also be the faster way, and then these hacks can hopefully fade away into irrelevance. This isn’t an exhaustive list of hacks and why they suck. They’re simply two that I have run into as of late that I haven’t seen discussed elsewhere (but could be wrong).

CSS Sprites

CSS Sprites are pretty neat, because they help reduce roundtrips by combining multiple image into a single image, thereby reducing the number of image requests (at some cost). CloudFlare does an awesome job spriting their website. For example, their CDN features page utilizes CSS sprites to great effect, helping reduce the number of requests. But it’s useful to examine the waterfall to see how referencing images in the stylesheet instead of the main document can slow things down.

As you can see, referencing the images in an external stylesheet requires downloading the stylesheet before the image gets requested. One might reasonably ask why we don’t speculatively download image resources in the stylesheet as the stylesheet comes in, but many stylesheets include a bunch of resources that are never used. For example, here’s the stylesheet included on my Google search for [flowers] when I’m signed in. Chrome tells me there are 96 instances of “background:url(” in that stylesheet, and you can bet that we’re not using all those resources on the Google search page. Note that waiting for the external stylesheet download to complete to issue image requests happens anyways in Chrome stable and IE.

So we have the situation where, in order to reduce the number of image requests, many content authors switch from using normal tags to using CSS sprites. This prevents the browser from discovering the resources during speculative parsing (e.g. WebKit’s PreloadScanner) and instead we have to wait for the stylesheet to complete downloading and then for the rendering engine to match the CSS selector for the parsed element to decide to request the appropriate background image. It’s particularly frustrating because CSS sprites are a commonly accepted “best practice” by web developers, because they do indeed help workaround deficiencies in HTTP, but it also prevents the browser from loading resources as early as possible.

Note that CloudFlare’s making the right tradeoffs here in an HTTP world. But they could be even faster if they undid that optimization and just declared the image resources normally in the document when serving over a prioritized multiplexing protocol like SPDY (and hopefully HTTP/2.0 in the future!). CloudFlare’s been pretty quick about iterating on new technologies, so I have no doubt that they’ll pick up on this missed optimization.

Hostname Sharding

Hostname sharding is another commonly accepted “best practice” designed to work around HTTP’s lack of parallelism. It’s true that it does indeed make browsing faster in general, but it has numerous downsides. It increases the number of connections, thereby increasing resource consumption at TCP endpoints and middleboxes, increases DNS traffic and entries (and the TCP connection is blocked on the DNS lookup), and splits congestion control information across multiple connections. Moreover, more parallelism doesn’t necessarily make things faster. It might result in more contention and more interleaving. For more details, see Patrick’s post on the matter.

The benefits have been written about numerous times already, so I won’t bother repeating them. But I will warn about the dangers of hostname sharding, in particular because Patrick McManus showed me that it was being abused. Some sites seem to be using a scary number of shards, and it’s pretty bad for the internet and for their performance. Check out the waterfall for 163.com. It’s pretty absurd. Here we have one of the most popular sites on the web and it’s opening up an astonishing 210 connections to serve its homepage! 72 of those connections are sharded across img[1-6].cache.netease.com, some shards across img[1-3].126.net, and a lot of connections on g.163.com. Glancing at the connection view, it seems that around 100 connections are utilized at the same time. This is obviously absurd and totally bypasses TCP slow start, since all of these connections will start with a server-side initcwnd of at least 3, and very possibly more, meaning an initial congestion window of 300 ! That’s just silly and totally busted, and you can see from CloudShark that the page load exhibits a number of self-inflicted congestion wounds:

Hostname sharding is a hack to work around HTTP’s lack of parallelism, but here the number of connections is simply too much, and it looks like it probably causes the page to load more slowly than it could. Site owners need to be careful with the number of hostname shards they use, as the increased parallelism may not always improve performance. The potential for congestion when using multiple connections is also higher nowadays due to the increased adoption of IW10. The real solution of course is to switch to a prioritized multiplexed protocol like SPDY and send all resources over as few connections as possible.

SSL Performance Case Study

Back when Mike Belshe was still at Google, he used to keep saying that SSL was the unoptimized frontier. Unfortunately, even years later, it still is. There’s low hanging fruit everywhere, and most folks, myself included, don’t know what they’re doing. Tons of people are making basic mistakes. Anyone deploying a website served over HTTPS really ought to read Adam Langley’s post on overclocking SSL. There is a lot of useful information in his post. I’m going to call out a few of them really quickly as they pertain to latency:

  • Send all necessary certificates in the chain – save the DNS TCP HTTP fetch for missing ones.
  • Try to keep the certificate chain short – initcwnd is often small and takes a while to ramp up due to TCP slow start.
  • Don’t waste bytes sending the root certificate – it should already be in the browser certificate store, otherwise it won’t pass certificate verification anyway.
  • Use reasonable SSL record sizes - user agents can only process each record as a whole, so don’t spread them across too many TCP packets or you’ll incur unnecessary delay (potentially roundtrips).

Case Study – CloudFlare

I decided to analyze a website (CloudFlare) to demonstrate how these rules can make a difference in SSL performance. I figured CloudFlare would be a good example, because with their new OCSP stapling support, I could demonstrate how that saved a serial OCSP request in exchange for increased cwnd pressure. Remember though, as Patrick McManus points outmissed optimizations are everywhere, so definitely don’t view this case as the exception.

First step, I loaded it up in WebPageTest (thanks Pat for a great product!). and looked at the SSL connection for the main document:

Here we go, we see that it takes around 800ms to finish the SSL handshake to https://www.cloudflare.com, including 2 OCSP requests to verify the certificates in the certificate chain. Let’s dive into the SSL handshake to see what’s taking all the time. To do so, I take the tcpdump from WebPageTest and feed it into CloudShark (learned about this webapp recently, ain’t it cool?). The TCP stream index for the main document is 4, so I use tcp.stream eq 4 in CloudShark to follow that TCP connection. Let’s check out the handshake messages.

Unfortunately, it appears that the Certificate Certificate Status (OCSP stapling) messages for CloudFlare are too large and thus overflows initcwnd, thus we see frame 79 arrive a full RTT after frame 62. To see why this is the case, we dive into the actual certificate chain.

Yeah, so we see that the Certificate chain is 7123 bytes and consists of 5 certificates. Unfortunately, viewing this certificate chain in CloudShark makes it difficult to see what each certificate is signing. To see it more clearly, I turn to the openssl command line utility and run a command like openssl s_client -connect www.cloudflare.com:443, which gives me output that includes this blurb:

Certificate chain
0 s:/1.3.6.1.4.1.311.60.2.1.3=US/1.3.6.1.4.1.311.60.2.1.2=Delaware/businessCategory=Private Organization/serialNumber=4710875/C=US/ST=California/L=Palo Alto/O=CloudFlare, Inc./OU=Internet Security and Acceleration/OU=Terms of use at www.verisign.com/rpa (c)05/CN=www.cloudflare.com
  i:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=Terms of use at https://www.verisign.com/rpa (c)06/CN=VeriSign Class 3 Extended Validation SSL CA
1 s:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=Terms of use at https://www.verisign.com/rpa (c)06/CN=VeriSign Class 3 Extended Validation SSL CA
  i:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5
2 s:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5
  i:/C=US/O=VeriSign, Inc./OU=Class 3 Public Primary Certification Authority
3 s:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=Terms of use at https://www.verisign.com/rpa (c)06/CN=VeriSign Class 3 Extended Validation SSL CA
  i:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5
4 s:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5
  i:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5

I took a look at this before and was very confused about what was going on. My colleague Sir Ryan of Sleevi, who is always educating me on the grotty corner cases of SSL (including several chunks which appear in this post), helped me grok this by providing this helpful snippet:

[0] www.cloudflare.com
  [1] - Class 3 Extended Validation SSL CA
    [2] - Class 3 Public Primary CA - G5 [cross-signed version]
      [omitted] - Class 3 Public Primary CA - [legacy root]
  [3] - Class 3 Extended Validation SSL CA
    [4] - Class 3 Public Primary CA - G5 [self-signed version]

Really fascinating! [3] is actually a duplicate of [1], and [4] is a root certificate, which user agents should really have in their certificate store, so it’s superfluous and also violates the TLS 1.0 spec, section 7.4.2, which says:

This is a sequence (chain) of X.509v3 certificates. The sender's
certificate must come first in the list. Each following
certificate must directly certify the one preceding it. Because
certificate validation requires that root keys be distributed
independently, the self-signed certificate which specifies the
root certificate authority may optionally be omitted from the
chain, under the assumption that the remote end must already
possess it in order to validate it in any case.

Yeah. So CloudFlare’s setup is weird, violates the spec ([3] signs [0] instead of [2]), and leads to a bloated certificate chain that contributes to overflowing their initcwnd. I reached out to them and they confirmed that this was a configuration error that they would be fixing. Great!

Sadly though, it’s the OCSP stapled response that apparently pushes them over the edge. The CertificateStatus message is 1995 bytes, spilling over both frames 76 and 79. The ServerKeyExchange and ServerHelloDone messages on the other hand are only 331 and 4 bytes respectively, and without the CertificateStatus message, they would have fit in frame 76.

It’s rather unfortunate that this is the case, but at least the OCSP stapled response is saving them a OCSP request (DNS TCP HTTP). You can already see how much (500 ms?) the 2 OCSP requests that Windows’ cert verifier makes in requests 2 and 3 stall the SSL handshake for request 1 (the root document) in the waterfall, so as this example demonstrates, OCSP stapling is totally better than an OCSP request. Also, once CloudFlare fixes their broken certificate chain, they probably won’t overflow initcwnd anymore. But man, wouldn’t it be great if we didn’t do these requests in the first place? I mean, how much value do they really add? Well, one option is to use CRLSets (which don’t apply in this case for EV certs, not to mention on WebPageTest the CRLSets haven’t been downloaded yet, as they are downloaded separately after first run).

Anyway, back to TCP stream 4. Now that we’ve finished the SSL handshake, we’re good to go, right? Well, there are other things to watch out for. For one, we want to make sure that the SSL record sizes are small enough so we don’t unnecessarily delay the user agent from being able to process the HTTP response body. Unfortunately, we see that this is indeed happening:

9k records split across 7 packets, leading to a 45~ms delay for the first packet of data with 622 bytes. I haven’t looked more thoroughly, but it’s likely that this kind of pattern repeats itself during the lifetime of this connection. Hopefully it won’t lead to roundtrips, especially once the congestion window ramps up a bit more.

Looking at the next SSL connection for ajax.cloudflare.com, we see some interesting SSL performance benefits/losses:

In contrast to the SSL connection for www.cloudflare.com, the certificate chain for ajax.cloudflare.com (rooted at GlobalSign) is relatively small (2495 bytes), and fits nicely in the initial congestion window. CloudFlare also staples the OCSP response there too, saving a request which is great, except that serving up the OCSP stapled response seems to take a shocking 280ms. I’m confused at what’s happening here, since when you examine the CertificateStatus message, it’s supposed to be 1507 bytes long, and frames 186 and 189 contain 330 and 1176 bytes respectively, which adds up to 1506 bytes…which is rather unfortunate. It doesn’t appear to be an initcwnd issue since the 280ms delay is way above the RTT. It’s suspiciously near a RTT after the delayed ACK in frame 228. Indeed, this pattern seems to repeat itself multiple times, so it’s not a one time glitch. It smells of some sort of implementation detail. I reached out to CloudFlare earlier and they’re looking into it. In any case, on the upside, Chrome manages to save a roundtrip here because CloudFlare uses both forward secrecy and NPN (not to mention SPDY!), so Chrome can use False Start. Nice!

Next, we examine the SSL connection for ssl.google-analytics.com. It’s rather boring because everything just works great. The Certificate message is 1756 bytes long, there’s no online revocation check, and Google advertise NPN and uses forward secrecy, so Chrome can use False Start and save a roundtrip. And Google uses SSL records capped at 1345 bytes, which generally prevents records from spanning more than one packet. All in all, it looks pretty optimized to me.

Lastly, there’s the SSL connection for cdn01.smartling.com. Now this one highlights another common misconfiguration problem: not including the intermediate certificates. You can see from the waterfall how costly this is:

Despite the obvious cost here, there’s an argument to be made that not including the intermediate certificates will keep the cert chain short, and if the OS is smart it’ll cache the intermediates, thus preventing the painful lookup. Obviously, some intermediate certificates are more likely to be cached than others. Also, to my knowledge, so far only Windows caches intermediate certificates. Given the tradeoffs, I’d still recommend including the intermediate certificates in general.

Interestingly enough, I doublechecked today, and Smartling seems to have fixed their missing intermediate chain issue. That’s awesome, except now they seem to be including too many certificates in their certificate chain and exceed their tiny initcwnd :( They should definitely remove the ValiCert self-signed cert, and possibly the Go Daddy Class 2 Certification Authority one (it’s included in my Mac’s root certificate store, but I’m not sure if it’s on all relevant devices). Actually, now that I look closer, Smartling didn’t actually change, but rather has two EC2 instances running, each with a different configuration (ec2-50-18-112-254.us-west-1.compute.amazonaws.com and ec2-23-21-68-21.compute-1.amazonaws.com). You can see their respective chains here:

---
Certificate chain
 0 s:/O=*.smartling.com/OU=Domain Control Validated/CN=*.smartling.com
   i:/C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certificates.godaddy.com/repository/CN=Go Daddy Secure Certification Authority/serialNumber=07969287
---

---
Certificate chain
 0 s:/O=*.smartling.com/OU=Domain Control Validated/CN=*.smartling.com
   i:/C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certificates.godaddy.com/repository/CN=Go Daddy Secure Certification Authority/serialNumber=07969287
 1 s:/C=US/ST=Arizona/L=Scottsdale/O=GoDaddy.com, Inc./OU=http://certificates.godaddy.com/repository/CN=Go Daddy Secure Certification Authority/serialNumber=07969287
   i:/C=US/O=The Go Daddy Group, Inc./OU=Go Daddy Class 2 Certification Authority
 2 s:/C=US/O=The Go Daddy Group, Inc./OU=Go Daddy Class 2 Certification Authority
   i:/L=ValiCert Validation Network/O=ValiCert, Inc./OU=ValiCert Class 2 Policy Validation Authority/CN=http://www.valicert.com//emailAddress=info@valicert.com
 3 s:/L=ValiCert Validation Network/O=ValiCert, Inc./OU=ValiCert Class 2 Policy Validation Authority/CN=http://www.valicert.com//emailAddress=info@valicert.com
   i:/L=ValiCert Validation Network/O=ValiCert, Inc./OU=ValiCert Class 2 Policy Validation Authority/CN=http://www.valicert.com//emailAddress=info@valicert.com
---

Conclusion

In this single case study, we see that HTTPS pages are full of missed optimizations. All the rules I listed in the beginning of the post are pretty basic, but nobody knows to look for them. I highly recommend examining your website’s SSL packet traces to look for these common mistakes. At least try taking advantage of existing automated tools to do some basic verification of your SSL configuration.

Thanks to my colleague Ryan Sleevi for reading through a draft of this post to make sure I didn’t have any glaring inaccuracies.

Throttling Subresources Before First Paint

In my last post about resource prioritization, I mentioned that WebKit actually holds back from issuing subresources that can’t block the parser before first paint. Well, that’s true, except for the Chromium port, because recently we decided to disable that. You may be wondering, why would we do that?

Well, we’re doing it because the rendering engine is not the best place to be making resource scheduling decisions. The browser has access to more information than the renderer does. For example, if the origin server supports SPDY, then it’s better to simply issue all the requests. Or maybe the tab is in the background, so we should be de-prioritizing or outright suppressing the tab’s resource requests in order to prevent contention. Or maybe given the amount of contention, we should start the hostname resolution or preconnect a socket, rather than doing a full resource requests.

Therefore, we’re disabling WebKit’s various resource loading throttles in order to get all the resource requests to the browser and have it handle resource scheduling. We haven’t re-implemented WebKit’s mechanisms yet browser side, so currently we have a situation where we aren’t throttling subresources before first paint. So if you’re running a Chromium build after r157740 (currently, only Google Chrome dev and canary channels), then your browser isn’t throttling subresources before first paint. It’s interesting to see what effects this has on page load time.

For example, for www.gap.com, we can compare page loads on Chrome 22 (stable) vs Chrome 24 (dev) using WebPageTest. In this case, we see that the onload and speed index are both worse in Chrome 24. The speed index is clearly worse because the first paint occurs way earlier for Chrome 22 than for Chrome 24. Looking a bit more closely at the waterfalls, that seems to be due to the globalOptimized.js taking longer on Chrome 24 than on Chrome 22. Poking into the code, we see the document is synchronously loading that script, thereby blocking parsing and slowing down both the first paint and the DOMContentLoaded event. The DOMContentLoaded event is key here, since it appears to trigger a number of other resource requests, so slowing down the DOMContentLoaded event also slows down onload. Examining the connection view, we see that the reason that globalOptimized.js (264.4 KB) takes longer to download is due to increased bandwidth contention from image resources.

gap.com waterfall, connection, bandwidth for chrome 22
gap.com waterfall
gap.com waterfall, connection, bandwidth on chrome 24
gap.com waterfall

Examining another example (cvs.com), Chrome 22 again has better first paint / speed index scores, but Chrome 24 has a shorter onload time. Diving into the waterfalls again, we can see that the reason Chrome 22 again has shorter first paint times is because it gets the stylesheets sooner, and stylesheets block first paint in order to prevent FOUC. What’s interesting about this case is that there’s no real bandwidth contention this time, since the last 3 stylesheets only add up to around 7KB. The contention is actually on the connections per host (limit is 6), since the images are requested first, and until some complete, the stylesheets (higher priority resources) requests cannot begin. However, unlike the gap.com case, there isn’t a period of low bandwidth utilization due to waiting for resources to be requested during the DOMContentLoaded event, so issuing the subresource requests earlier results in overall better bandwidth utilization, and thus reaching onload sooner.

cvs.com waterfall, connection, bandwidth for chrome 22
cvs.com waterfall
cvs.com waterfall, connection, bandwidth on chrome 24
cvs.com waterfall

Yeah, there are cases where this change worsens the experience, and cases where it actually improves the user experience. In the real world though, what does it usually do? Pat Meenan ran a test of Chrome stable vs Chrome canary for us awhile back to check it out on a bunch of websites, and in aggregate, we saw a minor improvement in onload times with the new behavior, but a hit on the speed index and a significant hit on first paint time. Therefore, we’re calling this a regression and will either fix or revert the change by the time we hit code complete for Chrome 24.

Update: Thanks to Steve for editorial suggestions and teaching me how to compare tests on WebPageTest :)

Resource Prioritization in Chromium

Resource prioritization is a difficult problem for browsers, because they don’t fully understand the nature of the resources in the page, so they have to rely on heuristics. It seems like the main document is probably the most important resource, so it’d be good to prioritize that highly. Scripts block parsing, and stylesheets and block rendering. Both can lead to discovering other resources, so it’s probably a good idea to prioritize them reasonably highly too. Then you have other resources like media and async XHRs and images. For the exact algorithm Chromium uses, you can refer to the code. It’s noticeably suboptimal with regards to sync vs async resources, and there is probably more room for improvement. Indeed, we’re kicking around some ideas we hope to play around with in the near future.

Note that it’s already difficult for browsers to characterize resource priority appropriately, and the techniques that web developers have to employ to get good cross browser performance make the situation worse. For example, Steve and Guy have great recommendations on various techniques to load scripts or stylesheets without blocking. The problem, from a resource prioritization perspective, is these techniques defeat the attempts of the browser to understand the resources and prioritize them accordingly. The XHR based techniques will all get a lower priority. The JS based techniques for adding a script tag defeat browser attempts to speculatively parse when blocked. The script in iframe technique gives the script the priority of a subframe (high). Really, what web devs probably want to do is declare the script/stylesheet resource as async.

Now, how important is resource prioritization? Well, it depends, primarily on the number of domains content is hosted on and the number of resources. That’s because the main places we use resource prioritization are in our host resolution and connection pool priority queues. Chromium unfortunately caps the concurrent DNS lookups to 6 due to the prevalence of crappy home routers. Our connection pools cap concurrent connections per host to 6, per proxy to 32, and total to 256. Once you’ve got a TCP connection though, application level prioritization no longer plays a role, and the existing TCP connections will interleave data “fairly”. What this also implies is that for truly low priority resources, you may actually want to starve them. You want to throttle them to reduce network contention with higher priority resources although when doing so, you run the risk of underutilizing the pipe.

It’s sort of silly that the browser has to throttle itself to prevent network contention. What you really want is to send all the requests to the server tagged with priorities and let the server respond in the appropriately prioritized order. Turns out that some browsers already support this. Now, what happens if the browser and website both support SPDY? Well, then resource prioritization actually has significant effects, and higher priority resources may crowd out lower priority resources on the pipe (which probably is generally, but not always, good for page load time), depending on the server implementation. So if you’re utilizing one of the aforementioned loading techniques to asynchronously load resources, then you may incur performance hits due to suboptimal prioritization or lack of ability for the rendering engine to speculatively issue a request for said resource (when parsing is blocked). Thus, if HTTP/2.0 adopts SPDY features like multiplexing & prioritization, then it’ll become more important for web content to use appropriate markup to allow browsers to figure out appropriate resource prioritization.

Time to Load localStorage Into Memory

For all the reasons that Taras Glek lists in https://blog.mozilla.org/tglek/2012/02/22/psa-dom-local-storage-considered-harmful/, I’ve been very skeptical of using localStorage. Synchronous APIs that do I/O are the suck as far as I’m concerned. I was curious about how much of an effect this had in practice, so I added Chromium histograms
for it. This tracks the time that it takes to load localStorage from the persistent disk store, which Chromium does on the first access, after which subsequent accesses *should* be a simple memory access or IPC roundtrip memory access. There’s potentially room for optimization here, where we could speculatively prime the in-memory data structure before the first access. I’m skeptical that’d have much impact though.

I only have data from a day’s worth of Google Chrome dev channel weekend traffic on desktop Chrome. Mobile may well be a different beast, and I’ll be curious to see the data when I get a hold of it. In any case, here’s what I see:

Time in ms to prime localStorage from disk (Win/Mac/Linux) by percentile:
50th: 0/0/0
75th: 2/0/0
90th: 40/17/17
95th: 160/57/160
99th: 1200/890/1200

This data is very much subject to interpretation. My read of it is, it’s actually not so bad. It’d be interesting to do more slicing and dicing of data (distribution per user, distribution based on localStorage size, yada yada). I used to diss localStorage a lot before, but after seeing this data, I’m less concerned about its effect on performance, at least on desktop. I still think it’s probably a bad idea on mobile, but I’ll reserve judgment until I get data for Chrome on Android and iOS.

Configuring SSL – I Have No Idea What I’m Doing

When I first set up this server, I went to StartSSL to get a certificate. Not having done this ever before, I made a number of errors. First, I had StartSSL generate my private key for me. Probably a bad idea, I hope they don’t record that :P Second, I had them generate a 4096 bit key rather than a 2048 bit key. I had figured, the bigger the better, right? Well, in load testing this wimpy micro EC2 server, I found that the majority of the CPU usage is in nginx, and I have to imagine that it’s in the SSL handshake. Oops. I should have read this Stack Overflow thread first I guess. I went to revoke my certificate today, but apparently they charge $25 per revocation. Oh well, I don’t expect much traffic anyway, and I’ll just change it when my certificate expires in a year. Lesson learned.