Cloud Reboot Causes Cold Sweat at Netflix
by Jason Verge on October 3, 2014
Another tale has emerged from the great server reboot of 2014 to apply a Xen security patch that affected major cloud providers, including Amazon Web Services and Rackspace. Netflix, an AWS customer, lost 218 database servers during the reboot but managed to stay online.
Last week, a known issue that effects Xen environments forced AWS and Rackspace, in addition to other Xen users, to reboot portions of their clouds to apply a Xen security patch. It was done on short notice to customers. Luckily, the reboot went smoothly for both. Now a major customer story has emerged. Netflix was concerned about its Cassandra database.
The company has over 2,700 production Cassandra nodes, of which 218 were rebooted. A total of 22 of those were on hardware that did not reboot successfully. However, Netflix’s automation detected the failed node and replaced them all with minimal human intervention.
Chaos Monkey, the company’s homegrown tool for testing resiliency, had already wreaked potential havoc on the system in the past. It didn’t mean, however, that the company was not concerned.
“When we got the news about the emergency EC2 reboots, our jaws dropped,” said Christos Kalantzis, Netflix manager for cloud database engineering. “When we got the list of how many Cassandra nodes would be affected, I felt ill. Then I remembered all the Chaos Monkey exercises we’ve gone through. My reaction was, “Bring it on!”
Netflix has a massive AWS infrastructure. The streaming movie provider has perfected the art of resiliency through its “Simian Army” resiliency tools.
Chaos Monkey is a resiliency tool that randomly disables virtual machine instances that are in production on the Amazon cloud. The goal is to engineer applications so they can tolerate random instance failures. Chaos Gorilla disables all Netflix infrastructure in an AWS Availability Zone.