• Tue. Nov 26th, 2024

Lessons Learned from Recent Major Outages

Byadmin

Aug 1, 2022




In 1988, one broken power line kicked off a series of events that cut off phone service to over 50,000 Chicago-area businesses, hospitals, Chicago’s O’Hare and Midway airports, and consumers for more than two weeks. At the time, that event, the Hinsdale Central Office Fire was called the greatest telecommunications disaster ever.
Yet even the impact of the largest pre-Internet/cloud event ever does not compare to what happens on a regular basis these days with cloud outages.
The nature of today’s more interconnected business world makes cloud infrastructure and service disruptions more damaging. In the past, an outage was typically restricted to a small geographical area, and there were relatively easy ways to minimize the impact. For example, a cable cut would disrupt service to those on that one circuit. Many companies would routinely protect themselves by using services from two providers, such as a leased T1 line from one and an ISDN from another. If the primary line was down due to a cable cut, a site could still run core traffic over the lower speed link until service was restored.
Putting an Outage’s Impact into Perspective
Examples include:
CloudFlare, June 2022
The provider suffered a roughly one-hour outage impacting many companies and sites, including Discord, Shopify, Fitbit, and Peloton. Traffic in 19 of CloudFlare’s sites was impacted due to a change to the network configuration in those locations that caused the outage.
Microsoft Azure and M365 Online, June 2022
East coast companies that accessed services via Microsoft’s Virginia data center suffered a 12-hour outage. The cause of the outage, according to Microsoft, was “an unplanned power oscillation in one of our data centers” … “Components of our redundant power system created unexpected electrical transients, which resulted in the Air Handling Units (AHUs) detecting a potential fault, and therefore shutting themselves down pending a manual reset.” Customers with always-available or zone-redundant services in that region were not impacted.
Google Cloud, March 2022
Several sites and services, including Spotify, Discord, and others, experienced a two-hour-plus outage. The source of the problem: A change to the Traffic Director code that processes the configuration was updated. The code change assumed that the configuration data format migration was fully completed. In fact, the data migration had not been completed.
IBM Cloud, January 2022
Global users experience five hours of problems with provisioning and other resource management actions. The source of the problem was not publicly identified.
Amazon Web Services, December 2021
Amazon had three outages in the month. The smallest was due to a power outage
at its North Virginia data center. Another outage, due to service problems with malfunctioning network devices, knocked off Amazon Ring and Roomba vacuums. And a five-hour outage (in early December) was due to a glitch in some automated software that led to “unexpected behavior” that then “overwhelmed” AWS networking devices and hit computer systems on the East Coast.
Google Cloud, November 2021
A two-hour long disruption took down Home Depot, Snapchat, Etsy, Discord, Spotify, and many more businesses. The outage was caused by a configuration change to Google Cloud service’s load balancing system.
Facebook, WhatsApp, Instagram, October 2021
A six-hour outage impacted not only the main sites (Facebook, WhatsApp, and Instagram) but also any sites and applications that rely on Facebook for logins. The outage was due to faulty configuration changes (related to Border Gateway Protocol) on the backbone routers.
See also: Two-Minute Toolkit: Workspot CEO Amitabh Sinha on Coping with Cloud Outages
CDNs Come into Focus
The increased use of content delivery networks (CDNs) to improve site performance and the user experience makes them as important as the underlying cloud services used by many enterprises. Two recent outages show just how much damage can be done when these services have problems.
In July 2021, Akamai had a roughly one-hour disruption impacting many sites and services, including Fidelity, Charles Schwab, Vanguard, Ally Bank, UPS, Delta Air Lines, Airbnb, The Home Depot, Southwest Airlines, HBO Max, McDonald’s, Sony’s Playstation Network, and more.
The cause of the outage was tied to a “severe disruption,” later explained to be due to a software configuration update that triggered a bug in the DNS system. The problem resulted in a global outage for as many as 29,000 websites. (The company handles roughly 15% to 30% of the total web traffic.)
Comparably, CDN provider Fastly experienced a roughly hour-long outage impacting many sites, including eBay, PayPal, the Financial Times, Reddit, Twitch, The Guardian, The White House, and more. The cause of the problems was stated to be “a service configuration that triggered disruptions across our POPs globally.”
Other Sources of Disruption
Technical issues were not the only things last year that people had to be concerned about with respect to the fragility of our interconnected world. The role of mother nature and human nature were also on display.
The Hunga Tonga-Hunga Ha’apai volcano eruption cut the island nation off from the rest of the world. While the impact on international internet traffic was minimal, the event brought new attention to the fragility of the global undersea cable network that carries about 95% of intercontinental global data traffic. It is subject to disruptions from accidental cuts, malicious damage, and damage caused by natural disasters like hurricanes, tsunamis, and other incidents. Making matters worse, certain regions of the world, including the Hawaiian Islands and the Suez Canal, are major points where many cables converge and are also locations where natural disasters occur.
With respect to human nature, the war in the Ukraine focused attention on potential disruptions to the core of the Internet, its DNS servers. Some seeking to isolate Russia made a request to the Internet Corporation for Assigned Names and Numbers (ICANN) to revoke specific country code top-level domains operated from within Russia, invalidate associated TLS/SSL certificates, and shut down Russian DNS root servers. ICANN noted that technically it could not do what was requested. And added that it had no sanction-levying authority; its role is to ensure that the workings of the Internet are not politicized.
Takeaways: What Can Enterprises Do?
The numerous outages over the last year have caused wide-scale disruption to businesses worldwide. Most were caused by configuration changes done by the providers themselves, a handful were due to mother nature, and some were due to long-standing familiar issues like power outages.
Unfortunately, enterprises have few options to minimize the impact on their business. In a few rare cases, such as in the June Azure outage, customers with premium services avoided problems.
The main thing enterprises can do to minimize the impact of outages is to better understand the work providers and organizations like ICAAN are doing to reduce outages in the future and to put pressure on the providers to accelerate those efforts.



Source link