Measuring Multi-CDN Performance with Resource Timing Data

If you’re using RUM to tell you how fast your page is loading, you’re only getting a portion of the story; just a headline. Modern browsers widely support Navigation Timing API and as well as Resource Timing API. The possibilities with this data are endless. With distributed teams focused solely on their respective products, resources like CDN are seen as ancillary to the overall page load, however, improperly referenced objects can wreak havoc on your end users.

This post will focus on expanding your RUM solution to isolate and measure CDN delivered content and provide a strategy for comparing individual CDN performance on provider-agnostic hostnames.


CDN Steering and Agnosticism

If you’re only using one CDN provider, then this post may not do much for you. However, if your site is running on multiple CDNs around the world, you may be interested in digging deeper into which CDNs are providing the most benefit to your users. Additionally, you may want to be more informed about your steering decisions.

First, let’s evaluate a few steering options:

Finite Hostnames – you benefit from the ability to granularly steer your users on a user-to-cdn performance basis. This is good if browser caching is not a concern to you, i.e., you have highly dynamic content.

DNS Round-Robin (aka, CDN Roulette) is crap. Don’t do it. If you have to, then you will want to know if your weighted records are having the desired effect. Also important, if you are using DNS Geo Records, are your users being steered to the right CDN or is your DNS provider making mistakes?

Dynamic Steering using RUM and CDN-agnostic hostnames is what all the cool kids are doing these days. Dynamic Steering platforms do their jobs well, but if you’re looking to show their benefit via your RUM instrumentation, you will need to be able to provide granular metrics on how CDNs are performing.

For any of the above methods, the ability to identify and report on CDN performance is necessary for making informed business decisions.

A Note on Resource Timing

Resource Timing provides a ton of additional insight into the objects that were served during page load. Rather than diving too deeply, simply take a look at the Resource Timing Level 1 Model below and start thinking about the things that are important to you, your users, and how this data can be used for your benefit.

https://www.w3.org/TR/resource-timing-1/#processing-model

Of the above metrics, the ones that can be derived are more interesting and useful for measuring CDNs than the ones presented.

  • Useful Metrics
    • Browser Caching: Whether or not the object was fetched from the CDN.
      (resource.transferSize == 0) ? true : false
    • Time to First Byte: On content fetch, what was time to get the first byte of the server response.
      resource.responseStart - resource.startTime
    • Secure Connection Time: Provides TLS handshake timing information.
      (resource.secureConnectionStart > 0) ? (resource.connectEnd - resource.secureConnectionStart) : 0
    • DNS Lookup Time: Time for the browser to lookup a cold DNS record.
      resource.domainLookupEnd - resource.domainLookupStart

The Power of X-Headers

Accessing Resource Timing information is great, but how it’s correlated to the respective CDN is what we’re really shooting for. After all, there’s no benefit to pulling granular data if you can’t turn it into useful information.

Headers already serve a very useful purpose by providing instructions for the browser and its respective RFC implementations on how to handle content despite a lack of uniformity across browsers. In addition to the RFC defined headers, there exists a type of header can be customized: the X-Header.

Despite some misconceptions, X-Headers are not necessarily deprecated; they are merely no longer acceptable for new protocols. X-Headers have historically been used to distinguish permanence, however, some have crept their way into widely used protocols as standard headers, such as “X-Sender” in email. But X-Headers can be used for a variety of other things. For example, a CDN may utilize an X-Cache header to display a resource’s cache status for debugging and troubleshooting.

Almost all CDNs and proxies provide the ability to customize inbound and/or outbound response headers. With this, identifying which CDN served a request becomes fairly trivial. The result is a rule in your configuration that looks like this Varnish rule:

sub vcl_fetch { 
  set resp.http.X-CDN = "CDN-A";
}

Exposing X-Headers with CORS

X-Headers alone are incredibly useful for troubleshooting, but now we’re stuck with a resource with additional headers and a browser that limits what we can access. CORS to the rescue.

Cross-Origin Resource Sharing, or CORS, is a specification designed to enable resources and their data to be shared across “origins” or hostnames. An example of this is an XmlHttpRequest executed by javascript on traffiq.com attempting to access a resource from cdn.traffiq.com. Since they are not the same “origin” and depending on the resource type, most modern browsers will restrict what information is accessible, if at all. With this in mind, we now see that simply using AJAX to access X-headers requires just a bit more work.

To access more information, you simply need to add a few more headers. First, we’ll add Access-Control-Allow-Origin, which basically tells the calling User-Agent which origins may access that resource. Second, we’ll add Access-Control-Expose-Headers with a list of headers that we’re giving the User-Agent access to. Finally, for the sake of completeness, we’ll also add Timing-Allow-Origin. Timing-Allow-Origin is actually part of the window.performance.timing API, but is functionally similar to Access-Control-Allow-Origin.

The same process for adding the X-CDN header above can be followed here. Once added, your response headers should resemble the following:

HTTP/1.1 200 OK
Date: Wed, 26 Apr 2017 21:47:45 GMT
Last-Modified: Fri, 10 Mar 2017 06:06:49 GMT
Content-Length: 42
Content-Type: image/gif
Access-Control-Allow-Origin: *
Timing-Allow-Origin: *
Access-Control-Expose-Headers: X-CDN
X-CDN: CDN-A

Bringing It All Together

Now we have some headers that identify a CDN and we’ve lifted restrictions on what can be done with them. But via Resource Timing API, we won’t be able to access more than performance data. Meaning those CDN identifying headers aren’t really going to do much good. Or will they?

Let’s think about what we want: Resource Timing correlated to CDN.

Here’s a breakdown of how we’re going to accomplish this. First, we need to pull the resource timing for all of our objects. Then, we need to determine which objects are CDN objects; we can either use a regular expression or a finite list of hostnames. Finally, we’ll want to use an uncached CDN resource to determine which CDN served our page load. The javascript below is an example of how this can be achieved.

Sample Code for Javascript Beacon

// helper function to extract hostname
function extractHostname(url) {
 var hostname;
 if (url.indexOf("://") > -1) {
   hostname = url.split('/')[2];
 } else {
   hostname = url.split('/')[0];
 }
 hostname = hostname.split(':')[0];

 return hostname;
}

// helper function to define which hostnames are CDN hostnames
function isCDN(hostname) {
  return hostname == "cdn.traffiq.io";
}

// get the resource timing objects and initialize our curated list
var resources = performance.getEntriesByType("resource");
var cdnResources = [];
for (var i = 0; i < resources.length; i++) {
  if (isCDN(extractHostname(resources[i].name))) {
    cdn_resources.push(resources[i]);
  }
}

var xhr = new XMLHttpRequest();
// pick a CDN object and extract headers.
var cdnObject = cdnResources[0].name;
// alternatively, use a beacon
// var cdnObject = "//cdn.traffiq.io/beacon.txt?unique=" + Date.now()/1000;
xhr.open('GET', cdnObject, true);
xhr.send();
xhr.getResponseHeader("X-CDN");

A beacon object can be used in lieu of an object that has already been served. Resource Timing can tell us if an object is browser cached or not, but a beacon with a unique query string is definitive. The caveat here is that a beacon MUST BE a) small, b) cached at the CDN edge and c) unique enough to not be cached by the browser. Unique objects require additional setup, but the payoff is accuracy.

Conclusion

Integrating CDN identification into your current RUM implementation should be fairly straight forward and the tips above will provide some clarity for better tracking and troubleshooting your multi-CDN strategy.

 

Zero to Anycast in 20 Days

Introduction

Inspired by my friend, Samir Jafferali and his blog post, Build Your Own Anycast Network in 9 StepsI decided to have a go at building my own anycast network. The following breakdown will describe the steps that I had to take using guidance and best practices while ignoring guidance along the way. Additionally, you will find some easy copy paste stuff to make those short-lived moments even shorter.

My experience is in CDN, DNS, Web Performance and Development, i.e., HTTP, TCP, WebSockets, and a bunch of other stuff are my forte and my Sys Admin experience is on the lighter side. With that in mind, on a scale from Utterly Painful to Slept Through It, I rate this project a solid That Wasn’t So Bad.

Chronology of Events

Samir mentioned in his guide the need to do some prep work. The bulk of the time was spent waiting for things to happen. During the waiting periods, system hardening and automation became a key focus. Additionally, there was a desire to build some custom software and to create a feedback loop. Here’s a timeline of that experience.

Day 1

Register with a Regional Internet Registry (RIR). Registration with RIPE is quick and easy. Nuff said.

Request to Acquire /24 & /48 from a LIR.  Prager-IT e.U. [AT], based on Samir’s advice, was the best LIR. My experience with Stefan Prager was awesome. This portion of the transaction took the longest mostly because of identity verification and acquisition of IP space. For identity verification, utility bills, government issued identification, bank transfer verification, and verification of various email addresses associated with RIPE data were all required. Prager sent an ASN verification letter via International Post from Austria.

Day 1 – 3

Build a Network: Prager warned that there would be a minimum 10 day delay to get a /24. This was prime time for learning and setup. Per Samir and Prager’s suggestions, Vultr seemed to be the better choice. They have a decent presence globally, their web portal was super easy to work with, APIs are pretty clean, and they allow BGP sessions in 3 reaally easy steps.

After 4 hours, start to finish, 3 North American Vultr nodes were running a Centos 7 build with all the necessary updates and boot scripts. For the next 10 days, various tweaks had to be made to further automate. See below for a guide including what I did with Vultr.

Day 4

Server Hardening. Some nodes kept going down due to DDoS attempts on open ports, specifically ssh. It was apparent that script kiddies were looking for an in. After asking around, fail2ban seemed to be the right choice. Written in python so it’s easy to understand, it’s clean and effective.

Why not just change ssh ports?  These days were meant to be a learning experience. fail2ban could come in handy for monitoring other logs and banning IPs elsewhere, if not nsd (default configuration exists for nsd already).

Day 5-8

Automation. If this system is going to serve its purpose, a few things have to be done to ensure successful boot up, teardown, graceful service recovery (especially with custom software), and synchronization.

tmux+cssh: Salt, Chef, Puppet, and Ansible are all great options. However, in the interest of time and past experience, tmux-cssh worked well. Synchronized shell with live output was more interesting than simply using mssh. tmux-cssh also looks cool in a Terminal Window.

systemctl: Part of this process involved enabling service to easily be brought up and down, including custom software. That custom software also has external dependencies. systemd/systemctl are reliable standbys on RedHat/CentOS and allow for dependency chains. This is key for services that will rely on bird later.

update.sh: Again, rather than use push automation, tmux+cssh provides an interface for scripted updates. This includes pulling new patches, updating yum, updating system configs, etc.

Day 10

Fix Automation. Part of the boot script includes updating yum and installing some packages. In 10 days, a new version of fail2ban had been released. A new node in Sydney had become available, so a new node launch was attempted, but fail2ban was not running. As it turned out,  the packagers failed to create directories for the fail2ban sock. A bug had been filed earlier that day and is fixed in version 0.10.

Day 11

Acquired /24 and /43. Prager sent the good news that a /24 and /43 (yes, more space) were ready. Unfortunately, the space sat idle until the route object could be created, which required ASN (origin) assignment and MNT (maintainers).

Day 12-18

Spurious Updates. With things moving slowly, this time was spent waiting, thinking, architecting, playing with ideas, writing software. Progress was made, but nothing pertinent to Anycast.

Day 15

Paperwork. From Day 2 to now, Prager had been sending paperwork. This included a Letter of Authorization, Lease Agreement, Verification Letter (for ASN), and other information crucial to the registration process. The day following my request, an ASN verification was posted. Once it was received, signed and scanned, the process could continue moving forward, however, slowly.

Day 19.0

Acquired an ASN. From verification letter to ASN sponsorship request, only 4 days elapsed. Prager’s ASN sponsorship and guidance throughout allowed the process to go very smoothly. He also spent some time to guiding through RIPE updates, made suggestions for what metadata to include or modify, provided pertinent links to shorten the amount of time otherwise spent hunting for them, etc. With an ASN assigned, route objects could be added and anycasting was well within reach.

NOTE: Prager’s sponsorship service is €200 as it is time consuming. Worth every penny.

Day 19.5

Establish BGP Session. This process was truly easy with Vultr. They only need your ASN, IP Prefix, and Letter of Authorization (forIP space). It’s as simple as creating a ticket. Given the amount of time spent thus far, expectation of rapid turn around was low. However, within minutes of filing a request, Vultr handed the process to their Network Administration team. They only needed a few more tidbits of information: description of use case, BGP session password (just make one up), and what routes were needed (no routes at all).

Started announcing. With pertinent information in hand from Vultr, setting up dummy interface and announcing my IP were easy. This was a quick process, however, the routes needed to be whitelisted upstream.

Back to speaking in First Person …

Disappointed, I went to bed.

Day 20

Anycast! My eyes popped open at 4:50 AM. I rolled out of bed and went to my machine. To my amazement, everything was working. The rest of my day was spent playing with what I could do to bring down parts of my network and watch my traffic fail over to another host.

Conclusion

With my anycast project done, it’s time to move onto other things. Access to resources and knowledgable people made this process ludicrously seamless. Using Samir’s blog as a guideline and adding my own tweaks along the way, I was able to get some stable servers in place, running both Apache and a custom GeoDNS server to proceed with some of my experiments. All-in-all, I’m happy and excited to move forward.


Recipe: Anycasting Vultr Nodes with CentOS 7

Ingredients:

• 2+ Vultr Nodes (Centos 7, BGP Session already negotiated with Vultr)

• vim/emacs/whateva

• fail2ban

• bird

• net-tools

• bind-utils

• instance IP addresses

• an IP to anycast + ASN

• BGP Session password

Directions:
  1. Update yum and install the necessary tools …
    1. yum update -y
      
      yum install vim fail2ban bird net-tools bind-utils net-tools -y
  2. Open up the firewall to allow DNS requests
    1. firewall-cmd --permanent --zone=public --add-interface=dummy1
      firewall-cmd --permanent --zone=public --add-port=53/tcp --add-port=53/udp
      firewall-cmd --reload
  3. (Optional) Update fail2ban to protect yourself. You could also change your ssh port, but this could be used for DNS DDOS protection as well.
    • (Sample) /etc/fail2ban/jail.local
    • [DEFAULT]
      # Ban hosts for one hour:
      bantime = 3600
      
      # Override /etc/fail2ban/jail.d/00-firewalld.conf:
      banaction = iptables-multiport
      
      [sshd]
      enabled = true
      port = ssh
      logpath = %(sshd_log)s
      backend = %(sshd_backend)s
      
      
      [sshd-ddos]
      # This jail corresponds to the standard configuration in Fail2ban.
      # # The mail-whois action send a notification e-mail with a whois request
      # # in the body.
      enabled = true
      port = ssh
      logpath = %(sshd_log)s
      backend = %(sshd_backend)s
    • Restart fail2ban
    • systemctl restart fail2ban
    • (Examples) Check that fail2ban is running
    • # fail2ban-client status
      Status
      |- Number of jail: 2
      `- Jail list: sshd, sshd-ddos
    • # fail2ban-client status sshd
      Status for the jail: sshd
      |- Filter
      | |- Currently failed: 1
      | |- Total failed: 19451
      | `- Journal matches: _SYSTEMD_UNIT=sshd.service + _COMM=sshd
      `- Actions
       |- Currently banned: 5
       |- Total banned: 55
       `- Banned IP list: 27.72.65.19 218.87.109.150 186.61.164.149 221.186.116.210 116.31.116.5
  4. Create a dummy interface
    • ip link add dev dummy1 type dummy
      ip link set dummy1 up
      ip addr add dev dummy1 185.190.83.1/32
  5. Update bird config and restart
    • (Sample) /etc/bird.conf
    • router id 1.2.3.4;
      
      protocol static {
       route 185.190.83.0/24 via 1.2.3.4;
      }
      
      protocol bgp vultr {
       local as 43011;
       source address 1.2.3.4;
       import none;
       export all;
       graceful restart on;
       multihop 2;
       neighbor 169.254.169.254 as 64515;
       password "VULTUREPASSWORD";
      }
      
      protocol device {
       scan time 5;
      }
      
      protocol direct {
       interface "dummy*";
       import all;
      }
    • Restart bird (presumes it was previously enabled)
    • systemctl restart bird
    • (Example) Confirm bird is announcing
    • # birdc show proto all vultr
      BIRD 1.4.5 ready.
      name proto table state since info
      vultr BGP master up 19:27:00 Established
        Preference: 100
        Input filter: REJECT
        Output filter: ACCEPT
        Routes: 0 imported, 2 exported, 0 preferred
        Route change stats: received rejected filtered ignored accepted
          Import updates: 0 0 0 0 0
          Import withdraws: 0 0 --- 0 0
          Export updates: 2 0 0 --- 2
          Export withdraws: 0 --- --- --- 0
       BGP state: Established
          Neighbor address: 169.254.169.254
          Neighbor AS: 64515
          Neighbor ID: 104.156.236.170
          Neighbor caps: refresh restart-able AS4
          Session: external multihop AS4
          Source address: 1.2.3.4
          Hold timer: 167/240
          Keepalive timer: 15/80
  6. Start services…
    • NOTE: Apache did well in binding to the dummy interface, but my custom DNS server needed to be told explicitly to bind to and respond for my Anycast IP. This is also the case for nsd.
  7. Drink scotch.