[EN] Peering into CDN and Cloud Availability Problems
by Pete Mastinon
You rely on a cloud or a content delivery network (CDN) to provide uptime, all the time, to insure that users can access your content from wherever they are. But no matter how reliable your CDN is, it’s only part of the story. At any given time, a variety of incidents from peering issues to ISP outages can interrupt or halt content delivery for end users.
A Day in the Life of the Internet
At Cedexis, we developed Real end-User Monitoring (RUM) to measure platform performance from the perspective of an end user. Using RUM, we can look at what a typical day of availability looks like from the standpoint of a billion+ measurements (from end users around the world). What we found was that from the perspective of end users, the Internet can be messy.
How messy can it be? We’ll find out by taking a look at a single incident that occurred starting at 8:51am on April 15, 2014 using heat maps.
What’s in an Incident?
Before we look at heat maps, it’s worth going over how we define an incident. For our purposes, an incident is defined as a reduced ability to reach a platform from a set of ISP’s, where a platform is defined as a CDN, cloud service, or datacenter.
Our methodology for identifying an incident is straightforward: For any 6 hour average, if 5 consecutive one minute averages are significantly under the 6 hour average, it counts as an event. Using this approach, Cedexis identifies between around 200-300 events EVERY month across 100+ platform providers. These events are impactful enough to cause buffering in video or for pages to fail to load. It is important to understand that the events we are talking about here are not the major outages that get tons of press. Those well publicized outages are well known for the financial impact on lost business and SLA violations. The outages that we are looking at here are less well understood – we refer to them as Micro-outages. They contribute significantly to the latency of the Internet worldwide. Every vendor that we measure has them. Micro-outages can be caused by peering issues, Anycast problems, BGP routing errors, Hardware failures, caching server capacity problems or a myriad of other failure types that any complex platform like a Cloud or a Content Delivery Network might have.
Lets take a look at an incident that was identified by Cedexis that affected availability for a major CDN starting at 8:51am on April 15 and lasted about 10 minutes.
One Morning in April
Now that we’ve laid out the methodology for identifying an availability incident, we’ll take a look at what we found. For the heat maps below, we’ve used the following legend:
• Green – Approximately 100% availability
• Yellow – Approximately 75% availability
• Red – Less than 40% availability
Europe at 75% availability
Based on the heat map, we can see that starting around 8:51am, availability for the CDN started to be affected, going quickly from approximately 100% across the globe to around 75% in Europe, with much milder availability problems in Africa. Other continents remained largely unaffected.
To get a more detailed look at where the problem is, we’ll drill down into availability in Europe.
Here, we can see that while several European countries experienced problems reaching the CDN, France and Spain were the most affected, while only Poland experienced little or no problem reaching the CDN during the incident. Though availability was around 75% from Spain, the problem in France was severe, with less than 40% availability for several minutes.
To get even more detail on where availability was most affected, we’ll drill down to ISP level.
ISPs experiencing outage in Europe on April 15th
Here, we can see that availability was severely impacted for several ISPs throughout Europe. In Germany, availability was very good overall, but one ASN for Telefonica Germany had CDN availability drop below 40% for the duration of the incident. Another ASN for the same company was less affected.
The CDN’s availability dropped to less than 40% for many ISPs throughout France. Some, like Colt and Completel, recovered fairly quickly. Others remained in the red for the duration of the incident. Free Sas continued to experience availability problems reaching the CDN for several more minutes after other networks had recovered to 100% availability.
End User Perception is Reality
While RUM helps provide great information about how end users experience CDN availability, allowing us to take a look at where and when users experience availability problems, we won’t necessarily know why there was a problem. As many of us have experienced as end users, when we have problems reaching sites or streaming content, we often never know why the problem occurs: whether it’s the site, the CDN, the ISP, the device, or any point along the way between the content and our device.
From the end user’s perspective, availability problems are frustrating, and those frustrations can hurt the reputation, brand, and revenue of organizations that depend upon CDNs to deliver content quickly and reliably to those users. According to an article on Techcrunch, a study found that only 79% of mobile users would retry a mobile app if it failed the first time, and only 16% would try it more than twice.
Understanding how users experience availability is vital to ensuring they return. Real end-User Measurements can give you insight into what your end users experience. You can read about how this works on our website.
In the next couple of days I will publish more of these heatmaps of micro-outages to illustrate the differences (and similarities). Hope you find this interesting!