Resultados 1 a 3 de 3
  1. #1
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    [EN] AWS: At Scale, Rare Events aren’t Rare

    James Hamilton
    April 5, 2017

    I’m a connoisseur of failure. I love reading about engineering failures of all forms and, unsurprisingly, I’m particularly interested in data center faults. It’s not that I delight in engineering failures. My interest is driven by believing that the more faults we all understand, the more likely we can engineer systems that don’t suffer from these weaknesses.

    It’s interesting, at least to me, that even fairly poorly-engineered data centers don’t fail all that frequently and really well-executed facilities might go many years between problems. So why am I so interested in understanding the cause of faults even in facilities where I’m not directly involved? Two big reasons: 1) the negative impact of a fault is disproportionately large and avoiding just one failure could save millions of dollars and 2) at extraordinary scale, even very rare faults can happen more frequently.

    Today’s example is from a major US airline last summer and it is a great example of “rare events happen dangerously frequently at scale.” I’m willing to bet this large airline has never before seen this particular fault under discussion and yet, operating at much higher scale, I’ve personally encountered it twice in my working life. This example is a good one because the negative impact is high, the fault mode is well-understood, and although a relatively rare event there are multiple public examples of this failure mode.

    Before getting into the details of what went wrong, let’s look at the impact of this failure on customers and the business. In this case, 1,000 flights were canceled on the day of the event but the negative impact continued for two more days with 775 flights canceled the next day and 90 on the third day. The Chief Financial Office reported that $100m of revenue or roughly 2% of the airline’s world-wide monthly revenue was lost in the fall-out of this event. It’s more difficult to measure the negative impact on brand and customer future travel planning, but presumably there would have been impact on these dimensions as well.

    It’s rare that the negative impact of a data center failure will be published, but the magnitude of this particular fault isn’t surprising. Successful companies are automated and, when a systems failure brings them down, the revenue impact can be massive.

    What happened? The report was “switch gear failed and locked out reserve generators.” To understand the fault, it’s best to understand what the switch gear normally does and how faults are handled and then dig deeper into what went wrong in this case.

    In normal operation the utility power feeding a data center flows in from the mid-voltage transformers through the switch gear and then to the uninterruptible power supplies which eventually feeds the critical load (servers, storage, and networking equipment). In normal operation, the switch gear is just monitoring power quality.

    If the utility power goes outside of acceptable quality parameters or simply fails, the switch gear waits a few seconds since, in the vast majority of the cases, the power will return before further action needs to be taken. If the power does not return after a predetermined number of seconds (usually less than 10), the switch gear will signal the backup generators to start. The generators start, run up to operating RPM, and are usually given a very short period to stabilize. Once the generator power is within acceptable parameters, the load is switched to the generator. During the few seconds required to switch to generator power, the UPS has been holding the critical load and the switch to generators is transparent. When the utility power returns and is stable, the load is switched back to utility and the generators are brought back down.

    The utility failure sequence described above happens correctly almost every time. In fact, it occurs exactly as designed so frequently that most facilities will never see the fault mode we are looking at today. The rare failure mode that can cost $100m looks like this: when the utility power fails, the switch gear detects a voltage anomaly sufficiently large to indicate a high probability of a ground fault within the data center. A generator brought online into a direct short could be damaged. With expensive equipment possibly at risk, the switch gear locks out the generator. Five to ten minutes after that decision, the UPS will discharge and row after row of servers will start blinking out.

    This same fault mode caused the 34-minute outage at the 2012 super bowl: The Power Failure Seen Around the World.

    Backup generators run around 3/4 of million dollars so I understand the switch gear engineering decision to lockout and protect an expensive component. And, while I suspect that some customers would want it that way, I’ve never worked for one of those customers and the airline hit by this fault last summer certainly isn’t one of them either.

    There are likely many possible causes of a power anomaly of sufficient magnitude to cause switch gear lockout, but the two events I’ve been involved with were both caused by cars colliding with aluminum street light poles that subsequently fell across two phases of the utility power. Effectively an excellent conductor landed across two phases of a high voltage utility feed.

    One of two times this happened, I was within driving distance of the data center and everyone I was with was getting massive numbers of alerts warning of a discharging UPS. We sped to the ailing facility and arrived just as servers were starting to go down as the UPSs discharged. With the help of the switch gear manufacturer and going through the event logs, we were able to determine what happened. What surprised me is the switch gear manufacturer was unwilling to make the change to eliminate this lockout condition even if we were willing to accept all equipment damage that resulted from that decision.

    What happens if the generator is brought into the load rather than locking out? In the vast majority of the situations and in 100% those I’ve looked at, the fault is outside of the building and so the lockout has no value. If there was a ground fault in the facility, the impacted branch circuit breaker would open and the rest of the facility would continue to operate on generator and the servers downstream of the open breaker would switch to secondary power and also continue to operate normally. No customer impact. If the fault was much higher in the power distribution system and without breaker protection or the breaker failed to open, I suspect a generator might take damage but I would rather put just under $1m at risk than be guaranteed that the load will be dropped. If just one customer could lose $100m, saving the generator just doesn’t feel like the right priority.

    I’m lucky enough to work at a high-scale operator where custom engineering to avoid even a rare fault still makes excellent economic sense so we solved this particular fault mode some years back. In our approach, we implemented custom control firmware such that we can continue to multi-source industry switch gear but it is our firmware that makes the load transfer decisions and, consequently, we don’t lockout.

  2. #2
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010


    Vermont Fearer
    April 5, 2017 at 8:05 pm

    Really enjoyed this post and learned a lot about how data centers are designed. I had read about a similar case involving a rare event at an airline data center in Arizona that you might find interesting:

    James Hamilton
    April 6, 2017 at 1:17 pm

    Thanks for sending the Arizona Supreme court filing on Qwest vs US Air. US Air lost but the real lesson here is they need a better supplier. One cut fiber bundle shouldn’t isolate a data center.

    Here’s a funny one I’ve not talked about before. Years ago I was out in my backyard (back when we still had a backyard) getting the garden ready for some spring planing when I found just below the surface an old damaged black cable with some exposed conductors buried a 6″ below the surface. Weird I hadn’t found it before but I cut it at the entry and exit of the garden proceeded. Later that day I noticed our phone didn’t work but didn’t put the two events together. The next day I learned that the entire area had lost phone coverage the previous day and was still down. Really, wonder what happened there? :-)

    Florian Seidl-Schulz
    April 5, 2017 at 9:24 am

    You offered to take all legal responsibilities, if the generator went online, no matter what?
    As in, humans harmed by continuous power supply to a short and contract damages upon prolonged outage due to generator destruction?

    James Hamilton
    April 5, 2017 at 10:32 am

    You raised the two important issues but human safety is the dominant one. Legal data centers have to meet juristrictional electrical safety requirements and these standards have evolved over more than 100 years of electrical system design and usage. The standards have become pretty good, reflect the industry focus on safety first, and industry design practices usually exceed these requirements on many dimensions.

    The switch gear lockout is not required by electrical standards but it’s still worth looking at lockout and determining whether it adds safety factors or reduces it at the benefit of the equipment. When a lockout event occurs an electrical professional will have to investigate and they have two choices with the first being by far the most common. They can first try re-engaging the breaker to see if it was a transient event or they probe the conductors looking for faults. All well designed and compliant data centers have many redundant breakers between the load and the generator. Closing that breaker via automation allows the investigating electrical professional to have more data when they investigate the event.

    Equipment damage is possible but, again, well designed facilities are redundant with concurrent maintainability which means they have equipment off line for maintenance and then have an electrical fault and still safely hold the load. Good designs need to have more generators than required to support the entire facility during a utility fault. A damaged generator represents a cost but it should not lead to outage in a well designed facility.

    Human safety is a priority in all facilities. Equipment damage is not something any facility wants but, for many customers, availability is more important than possible generator repair cost avoidance.

    Ruprecht Schmidt
    April 5, 2017 at 7:22 am

    Loved the piece! I’m now inspired to ask someone about the switch gear situation at the data center where we host the majority of our gear. Thanks!

    James Hamilton
    April 5, 2017 at 10:15 am

    It’s actually fairly complex to chase down and getting the exact details on what are the precise triggering events that cause different switch modes to be entered. Only the switch gear engineering teams really knows the details, the nuances, and the edge cases that cause given switch modes to be entered. It’s hard data to get with precision and completeness.

    In many ways it’s worth asking about this fault mode but, remember, this really is a rare event. It’s a super unusual facility where this fault mode is anywhere close to the most likely fault to cause down time. Just about all aspects of the UPSs need scrutiny first.


  3. #3
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    April 5, 2017 at 5:13 am


    Back on topic, if there was such a fault, I have to wonder if more than mere equipment damage might be at stake in some cases. I suspect they also want to limit their own liability for doing something dangerous.

    James Hamilton
    April 5, 2017 at 10:07 am

    Human risk factors are the dominant concern for the data center operators and equipment operators. Data centers have high concentrations of power but these concerns are just as important in office buildings, apartment buildings, and personal homes and that’s why we have juristictional electrical standards designed to reduce the risk directly to occupants and operators and indirectly through fire. The safe guards in place are important, required by all jurstictions but these safety regulations do not include switch lock out. All data centers have 5+ breakers between the generator and the load. There are breakers at the generator, the switch gear itself, the UPS, and downstream in some form of remote power panel and, depending upon the design, many more locations. As an industry we have lots of experience in electrical safety and the designs operate well even when multiple faults are present because they all have many layers of defense.

    Let’s assume that the switch gear lockout is a part of this multi-layered human defense system even though not required by electrical codes. Is it possible that this implementation is an important part of why modern electrical systems have such an excellent safety record? With the lockout design, the system goes dark and professional electrical engineers are called. Many critical facilities have electrical engineers on premise at all times but, even then, it’ll likely take more time than the UPS discharge time to get to the part of the building that is faulting. The first thing a professional engineer will do when investigating an electrical system switch gear lockout is to re-engage the breaker and see if it was a transient event or is an actual on-premise issue. Another investigative possibility is to probe the system for ground fault but most professionals chose to engage the breaker first and it seems like a prudent first choice since probing potentially hot conductors is not a task best taken on under time pressure and 99th percentile of the cases, the event is outside the facility and just re-engaging the breaker is both safer than probing.

    Doing this first level test of re-engaging the open breaker through automation has the advantage of 1) not dropping the load in the common case, and 2) not requiring a human to be at the switch gear to engage it in a test. I hate closing 3,000A breakers and, if I personally have to do it, I always stand beside them rather than in front the breaker. As safe as it is, it’s hard to feel totally comfortable with loads that high. Doing the first level investigation in automation reduces human risk and puts more information on the table for the professional engineer who will investigate the issue. Of course, all issues, whether resolved through automation or not, still need full root cause investigation.

    Denis Altudov
    April 4, 2017 at 11:29 pm

    A more, ahem, “pedestrian” story along the same lines.

    Electric bikes have batteries which may overheat under certain condition such as weather, load, manufacturing defects, motor problems, shorts, etc. The battery controller in this case if programmed to cut the power, saving the battery from overheating and/or catching fire. Fair enough, right? A fire is avoided, the equipment is saved, and the user just coasts along for a while and come to a safe stop.

    Fast froward to the “hoverboard” craze – the self-balancing boards with two wheels, one on each side. The batteries and controllers have been repurposed in a hurry to serve the new hot market. When a battery overheats the controller cuts the power to the motor, self-balancing feature turns off and the user face-plants into the pavement. But the $100 battery is saved!

    Sadly, I don’t have the Amazon’s scale to rewrite the battery controller firmware, hence why I lead my post with the word “pedestrian”. Off I go for a stroll.


    James Hamilton
    April 5, 2017 at 10:47 am

    Hey Denis. Good to hear from you. Lithion-Ion batteries have massive power density and just over the last few months have been in the news frequently with reports of headphones suffering from an explosive discharge and cell phone catching fire. The hoverboard mishaps have included both the permanent battery lockout you describe and also fire from not isolating faulty cells.

    Safety around Li-Ion batteries, especially large ones, is super important. Good battery designs include inter-cell fusing, some include a battery wide fuse, and include charge/discharge monitoring firmware that includes temperature monitoring. Some of the more elaborate designs include liquid cooling. Tesla has been particularly careful in their design partly since they couldn’t afford to have an fault in the early days but mostly because they were building a positively massive battery with 1,000s of cells.

    Good Li-ion battery designs use stable chemistries and have fail-safe monitoring with inter-cell fusing and often battery-wide fusing. These safety system will cause the odd drone to crash and may cause sudden drive loss on hoverboards. In the case of hoverboards because the basic system is unstable without power, a good design would have to ensure that there is sufficient reserve energy to safely shutdown on battery fault. I’m sure this could be done but, as you point out, it usually wasn’t.

    My take is that transportation vehicles that are unstable or unsafe when not powered are probably just not ideal transportation vehicles. I’m sure hoverboards could be designed with sufficient backup power to allow them to shut down safely but the easy mitigation is get an electric bike :-)

Permissões de Postagem

  • Você não pode iniciar novos tópicos
  • Você não pode enviar respostas
  • Você não pode enviar anexos
  • Você não pode editar suas mensagens