How we brought HTTPS Everywhere to the cloud (part 1)

CloudFlare’s mission is to make HTTPS accessible for all our customers. It provides security for their websites, improved ranking on search engines, better performance with HTTP/2, and access to browser features such as geolocation that are being deprecated for plaintext HTTP. With Universal SSL or similar features, a simple button click can now enable encryption for a website.

Unfortunately, as described in a previous blog post, this is only half of the problem. To make sure that a page is secure and can’t be controlled or eavesdropped by third-parties, browsers must ensure that not only the page itself but also all its dependencies are loaded via secure channels. Page elements that don’t fulfill this requirement are called mixed content and can either result in the entire page being reported as insecure or even completely blocked, thus breaking the page for the end user.

What can we do about it?

When we conceived the Automatic HTTPS Rewrites project, we aimed to automatically reduce the amount of mixed content on customers’ web pages without breaking their websites and without any delay noticeable by end users while receiving a page that is being rewritten on the fly.

A naive way to do this would be to just rewrite http:// links to https:// or let browsers do that with Upgrade-Insecure-Requests directive.

Unfortunately, such approach is very fragile and unsafe unless you’re sure that

  1. Each single HTTP sub-resource is also available via HTTPS.
  2. It’s available at the exact same domain and path after protocol upgrade (more often than you might think that’s not the case).

If either of these conditions is unmet, you end up rewriting resources to non-existing URLs and breaking important page dependencies.

Thus we decided to take a look at the existing solutions.

How are these problems solved already?

Many security aware people use the HTTPS Everywhere browser extension to avoid those kinds of issues. HTTPS Everywhere contains a well-maintained database from the Electronic Frontier Foundation that contains all sorts of mappings for popular websites that safely rewrite HTTP versions of resources to HTTPS only when it can be done without breaking the page.

However, most users are either not aware of it or are not even able to use it, for example, on mobile browsers.


CC BY 2.0 image by Jared Tarbell

So we decided to flip the model around. Instead of re-writing URLs in the browser, we would re-write them inside the CloudFlare reverse proxy. By taking advantage of the existing database on the server-side, website owners could turn it on and all their users would instantly benefit from HTTPS rewriting. The fact that it’s automatic is especially useful for websites with user-generated content where it’s not trivial to find and fix all the cases of inserted insecure third-party content.

At our scale, we obviously couldn’t use the existing JavaScript rewriter. The performance challenges for a browser extension which can find, match and cache rules lazily as a user opens websites, are very different from those of a CDN server that handles millions of requests per second. We usually don’t get a chance to rewrite them before they hit the cache either, as many pages are dynamically generated on the origin server and go straight through us to the client.

That means, to take advantage of the database, we needed to learn how the existing implementation works and create our own in the form of a native library that could work without delays under our load. Let’s do the same here.

How does HTTPS Everywhere know what to rewrite?

HTTPS Everywhere rulesets can be found in src/chrome/content/rules folder of the official repository. They are organized as XML files, each for their own set of hosts (with few exclusions). This allows users with basic technical skills to write and contribute missing rules to the database on their own.

Each ruleset is an XML file of the following structure:

  
  
  

  
  

  
  
  

At the moment of writing, the HTTPS Everywhere database consists of ~22K such rulesets covering ~113K domain wildcards with ~32K rewrite rules and exclusions.

For performance reasons, we can’t keep all those ruleset XMLs in memory, go through nodes, check each wildcard, perform replacements based on specific string format and so on. All that work would introduce significant delays in page processing and increase memory consumption on our servers. That’s why we had to perform some compile-time tricks for each type of node to ensure that rewriting is smooth and fast for any user from the very first request.

Let’s walk through those nodes and see what can be done in each specific case.

Target domains

First of all, we get target elements which describe domain wildcards that current ruleset potentially covers.

  

If a wildcard is used, it can be either left-side or right-side.

Left-side wildcard like *.example.org covers any hostname which has example.org as a suffix – no matter how many subdomain levels you have.

Right-side wildcard like example.* covers only one level instead so that subdomains with the same beginning but one unexpected domain level are not accidentally caught. For example, the Google ruleset, among others, uses the google.* wildcard and it should match google.com, google.ru, google.es etc. but not google.mywebsite.com.

Note that a single host can be covered by several different rulesets as wildcards can overlap, so the rewriter should be given entire database in order to find a correct replacement. Still, matching hostname allows to instantly reduce all ~22,000 rulesets to only 3-5 which we can deal with more easily.

Matching wildcards at runtime one-by-one is, of course, possible, but very inefficient with ~113K domain wildcards (and, as we noted above, one domain can match several rulesets, so we can’t even bail out early). We need to find a better way.


CC BY 2.0 image by vige

We use Ragel to build fast lexers in other pieces of our code. Ragel is a state machine compiler which takes grammars and actions described with its own syntax and generates source code in a given programming language as an output. We decided to use it here too and wrote a script that generates a Ragel grammar from our set of wildcards. In turn, Ragel converts it into C code of a state machine capable of going through characters of URLs, matching hosts and invoking custom handler on each found ruleset.

This leads us to another interesting problem. At the moment of writing among 113K domain wildcards we have 4.7K that have a left wildcard and less than 200 which have a right wildcard. Left wildcards are expensive in state machines (including regular expressions) as they cause DFA space explosion during compilation so Ragel got stuck for more than 10 minutes without giving any result – trying to analyze all the *. prefixes and merge all the possible states where they can go, resulting in a complex tree.

Instead, if we choose to look from the end of the host, we can significantly simplify the state tree (as only 200 wildcards need to be checked separately now instead of 4.7K), thus reducing compile time to less than 20 seconds.

Let’s take an oversimplified example to understand the difference. Say, we have following target wildcards (3 left-wildcards against 1 right-wildcard and 1 simple host):

  
  
  
  
  

If we build Ragel state machine directly from those:

%%{
    machine hosts;

    host_part = (alnum | [_-])+;

    main := (
        any+ '.google.com' |
        any+ '.google.co.uk' |
        any+ '.google.es' |
        'google.' host_part |
        'google.com.ua'
    );
}%%

We will get the following state graph:

You can see that the graph is already pretty complex as each starting character, even g which is an explicit starting character of 'google.' and 'google.com' strings, still needs to simultaneously go also into any+ matches. Even when you have already parsed the google. part of the host name, it can still correctly match any of the given wildcards whether as google.google.com, google.google.co.uk, google.google.es, google.tech or google.com.ua. This already blows up the complexity of the state machine, and we only took an oversimplified example with three left wildcards here.

However, if we simply reverse each rule in order to feed the string starting from the end:

%%{
    machine hosts;

    host_part = (alnum | [_-])+;

    main := (
        'moc.elgoog.' |
        'ku.oc.elgoog.' |
        'se.elgoog.' |
        host_part '.elgoog' |
        'au.moc.elgoog'
    );
}%%

we can get much simpler graph and, consequently, significantly reduced graph build and matching times:

So now, all that we need is to do is to go through the host part in the URL, stop on / right after and start the machine backwards from this point. There is no need to waste time with in-memory string reversal as Ragel provides the getkey instruction for custom data access expressions which we can use for accessing characters in reverse order after we match the ending slash.

Here is animation of the full process:

After we’ve matched the host name and found potentially applicable rulesets, we need to ensure that we’re not rewriting URLs which are not available via HTTPS.

Exclusions

Exclusion elements serve exactly this goal.

  
  

The rewriter needs to test against all the exclusion patterns before applying any actual rules. Otherwise, paths that have issues or can’t be served over HTTPS will be incorrectly rewritten and will potentially break the website.

We don’t care about matched groups nor do we care even which particular regular expression was matched, so as an extra optimization, instead of going through them one-by-one, we merge all the exclusion patterns in the ruleset into one regular expression that can be internally optimized by a regexp engine.

For example, for the exclusions above we can create the following regular expression, common parts of which can be merged internally by a regexp engine:

(^http://(www.)?google.com/analytics/)|(^http://(www.)?google.com/imgres/)

After that, in our action we just need to call pcre_exec without a match data destination – we don’t care about matched groups, but only about completion status. If a URL matches a regular expression, we bail out of this action as following rewrites shouldn’t be applied. After this, Ragel will automatically call another matched action (another ruleset) on its own until one is found.

Finally, after we both matched the host name and ensured that our URL is not covered by any exclusion patterns, we can go to the actual rewrite rules.

Rewrite rules

These rules are presented as JavaScript regular expressions and replacement patterns. The rewriter matches the URL against each of those regular expressions as soon as a host matches and a URL is not an exclusion.

  

As soon as a match is found, the replacement is performed and the search can be stopped. Note: while exclusions cover dangerous replacements, it’s totally possible and valid for the URL to not match any of actual rules – in that case it should be just left intact.

After the previous steps we are usually reduced only to couple of rules, so unlike in the case with exclusions, we don’t apply any clever merging techniques for them. It turned out to be easier to go through them one-by-one rather than create a regexp engine specifically optimized for the case of multi-regexp replacements.

However, we don’t want to waste time on regexp analysis and compilation on our edge server. This requires extra time during initialization and memory for carrying unnecessary text sources of regular expressions around. PCRE allows regular expressions to be precompiled into its own format using pcre_compile. Then, we gather all these compiled regular expressions into one binary file and link it using ld --format=binary – a neat option that tells linker to attach any given binary file as a named data resource available to the application.


CC BY 2.0 image by DaveBleasdale

The second part of the rule is the replacement pattern which uses the simplest feature of JavaScript regex replacement – number-based groups and has the form of https://www.google.com.$1/ which means that the resulting string should be concatenation of "https://www.google.com." with the matched group at position 1, and a "/".

Once again, we don’t want to waste time performing repetitive analysis looking for dollar signs and converting string indexes to numeric representation at runtime. Instead, it’s more efficient to split this pattern at compile-time into { "https://www.google.com.", "/" } static substrings plus an array of indexes which need to be inserted in between – in our case just { 1 }. Then, at runtime, we simply build a string going through both arrays one-by-one and concatenating strings with found matches.

Finally, after such string is built, it’s inserted in place of the previous attribute value and sent to the client.

Wait, but what about testing?

Glad you asked.

The HTTPS Everywhere extension uses an automated checker that checks the validity of rewritten URLs on any change in ruleset. In order to do that, rulesets are required to contain special test elements that cover all the rewrite rules.

  

What we need to do on our side is to collect those test URLs, combined with our own auto-generated tests from wildcards, and to invoke both the HTTPS Everywhere built-in JavaScript rewriter and our own side-by-side to ensure that we’re getting same results — URLs that should be left intact, are left intact with our implementation and URLs that are rewritten, are rewritten identically.

Can we fix even more mixed content?

After all this was done and tested, we decided to look around for other potential sources of guaranteed rewrites to extend our database.

And one such is HSTS preload list maintained by Google and used by all the major browsers. This list allows website owners who want to ensure that their website is never loaded via http://, to submit their hosts (optionally together with subdomains) and this way opt-in to auto-rewrite of any http:// references to https:// by a modern browser before even hitting the origin.

This means, the origin guarantees that the HTTPS version will be always available and will serve just the same content as HTTP – otherwise any resources referenced from it will simply break as the browser won’t attempt to fallback to HTTP after domain is in the list. A perfect match for another ruleset!

As we already have a working solution and don’t have any complexities around regular expressions in this list, we can download the JSON version of it directly from the Chromium source and convert to the same XML ruleset with wildcards and exclusions that our system already understands and handles, as part of the build process.

This way, both databases are merged and work together, rewriting even more URLs on customer websites without any major changes to the code.

That was quite a trip

It was… but it’s not really the end of the story. You see, in order to provide safe and fast rewrites for everyone, and after analyzing the alternatives, we decided to write a new streaming HTML5 parser that became the core of this feature. We intend to use it for even more tasks in future to ensure that we can improve security and performance of our customers websites in even more ways.

However, it deserves a separate blog post, so stay tuned.

And remember – if you’re into web performance, security or just excited about the possibility of working on features that do not break millions of pages every second – we’re hiring!

Via Cloudflare.com

No comments yet.

Leave a Reply