Resultados 1 a 2 de 2
  1. #1
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    [EN] Data Center Power Outage Brings Down GitHub was unavailable for two hours and six minutes.

    It’s unclear where San Francisco-based GitHub’s primary data center is.

    by Yevgeniy Sverdlik on January 29, 2016

    GitHub, the most popular online repository for open source code and hosting services, went down for two hours Thursday due to a power outage in its primary data center.

    “A brief power disruption at our primary data center caused a cascading failure that impacted several services critical to’s operation,” Sam Lambert, GitHub’s director of systems, wrote in a status update on the company’s blog Friday morning. “While we worked to recover service, was unavailable for two hours and six minutes.”

    The outage started around 4pm Pacific. The final procedure to fully restore the facility’s power infrastructure was completed Thursday evening.

    Utility power outages do not bring down data centers in most cases, since these facilities are designed with UPS units, backup generators, and transfer systems that fail over to the generators automatically. When they do happen, power-related data center outages are caused by failure of those backup systems.

    It’s unclear where San Francisco-based GitHub’s primary data center is. Lambert told us in an earlier interview that the company does not disclose its data center locations.

    What we do know is that it has to expand its data center capacity rapidly to keep up with growing popularity of open source software. “We have a massive intake of new repositories and just new data, so we’re continually expanding our storage infrastructure,” Lambert said.

    GitHub doesn’t use virtual machines. Instead, it has built a bare-metal cloud, which allows it to provision physical machines the same way cloud VMs are provisioned.

    “We deploy onto physical machines,” he said. “We have a system internally that allows us to deploy physical machines as the cloud.”

    Effects of the outage were widely felt across startup and enterprise developers around the world, many of whom use GitHub for coding day-to-day.

  2. #2
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    January 28th Incident Report

    Resumo: The interruption in power meant just over a quarter of the repository's servers and network devices rebooted. Load balancers and application servers were unaffected, but couldn't serve up requests as the backend systems were down.

    Initially, GitHub thought it was once again under a large distributed denial of service attack. As a result, GitHub spent time early on during the incident bringing up its DDoS defences, before it realised no such attack was taking place.

    Several factors conspired to slow down restoration of service for GitHub.

    • Servers using specific type of hardware, and spanning many racks and rows in the data centre, meant that several members of Github's Redis in-memory data structure cluster were inaccessible.
    • As GitHub had by mistake added a hard dependency on the Redis cluster being fully available in the boot path of its application code, the programs failed to start.
    • A firmware bug in the servers that rebooted meant they weren't able to see disks after power came back.

    Customers meanwhile were none the wiser as GitHub's ChatOps team communications systems were on the servers that rebooted. This lead to an eight-minute delay before the service indicator was set to red, providing users with a notification that the site was down.

    Scott Sanders
    February 3, 2016

    Last week GitHub was unavailable for two hours and six minutes. We understand how much you rely on GitHub and consider the availability of our service one of the core features we offer. Over the last eight years we have made considerable progress towards ensuring that you can depend on GitHub to be there for you and for developers worldwide, but a week ago we failed to maintain the level of uptime you rightfully expect. We are deeply sorry for this, and would like to share with you the events that took place and the steps we’re taking to ensure you're able to access GitHub.

    The Event

    At 00:23am UTC on Thursday, January 28th, 2016 (4:23pm PST, Wednesday, January 27th) our primary data center experienced a brief disruption in the systems that supply power to our servers and equipment. Slightly over 25% of our servers and several network devices rebooted as a result. This left our infrastructure in a partially operational state and generated alerts to multiple on-call engineers. Our load balancing equipment and a large number of our frontend applications servers were unaffected, but the systems they depend on to service your requests were unavailable. In response, our application began to deliver HTTP 503 response codes, which carry the unicorn image you see on our error page.

    Our early response to the event was complicated by the fact that many of our ChatOps systems were on servers that had rebooted. We do have redundancy built into our ChatOps systems, but this failure still caused some amount of confusion and delay at the very beginning of our response. One of the biggest customer-facing effects of this delay was that wasn't set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

    Initial notifications for unreachable servers and a spike in exceptions related to Redis connectivity directed our team to investigate a possible outage in our internal network. We also saw an increase in connection attempts that pointed to network problems. While later investigation revealed that a DDoS attack was not the underlying problem, we spent time early on bringing up DDoS defenses and investigating network health. Because we have experience mitigating DDoS attacks, our response procedure is now habit and we are pleased we could act quickly and confidently without distracting other efforts to resolve the incident.

    With our DDoS shields up, the response team began to methodically inspect our infrastructure and correlate these findings back to the initial outage alerts. The inability to reach all members of several Redis clusters led us to investigate uptime for devices across the facility. We discovered that some servers were reporting uptime of several minutes, but our network equipment was reporting uptimes that revealed they had not rebooted. Using this, we determined that all of the offline servers shared the same hardware class, and the ones that rebooted without issue were a different hardware class. The affected servers spanned many racks and rows in our data center, which resulted in several clusters experiencing reboots of all of their member servers, despite the clusters' members being distributed across different racks.

    As the minutes ticked by, we noticed that our application processes were not starting up as expected. Engineers began taking a look at the process table and logs on our application servers. These explained that the lack of backend capacity was a result of processes failing to start due to our Redis clusters being offline. We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code.

    By this point, we had a fairly clear picture of what was required to restore service and began working towards that end. We needed to repair our servers that were not booting, and we needed to get our Redis clusters back up to allow our application processes to restart. Remote access console screenshots from the failed hardware showed boot failures because the physical drives were no longer recognized. One group of engineers split off to work with the on-site facilities technicians to bring these servers back online by draining the flea power to bring them up from a cold state so the disks would be visible. Another group began rebuilding the affected Redis clusters on alternate hardware. These efforts were complicated by a number of crucial internal systems residing on the offline hardware. This made provisioning new servers more difficult.

    Once the Redis cluster data was restored onto standby equipment, we were able to bring the Redis-server processes back online. Internal checks showed application processes recovering, and a healthy response from the application servers allowed our HAProxy load balancers to return these servers to the backend server pool. After verifying site operation, the maintenance page was removed and we moved to status yellow. This occurred two hours and six minutes after the initial outage.

    The following hours were spent confirming that all systems were performing normally, and verifying there was no data loss from this incident. We are grateful that much of the disaster mitigation work put in place by our engineers was successful in guaranteeing that all of your code, issues, pull requests, and other critical data remained safe and secure.

    Future Work

    Complex systems are defined by the interaction of many discrete components working together to achieve an end result. Understanding the dependencies for each component in a complex system is important, but unless these dependencies are rigorously tested it is possible for systems to fail in unique and novel ways. Over the past week, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GitHub being unavailable for over two hours. We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users.

    We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet. Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment.

    We will be updating our application’s test suite to explicitly ensure that our application processes start even when certain external systems are unavailable and we are improving our circuit breakers so we can gracefully degrade functionality when these backend services are down. Obviously there are limits to this approach and there exists a minimum set of requirements needed to serve requests, but we can be more aggressive in paring down the list of these dependencies.

    We are reviewing the availability requirements of our internal systems that are responsible for crucial operations tasks such as provisioning new servers so that they are on-par with our user facing systems. Ultimately, if these systems are required to recover from an unexpected outage situation, they must be as reliable as the system being recovered.

    A number of less technical improvements are also being implemented. Strengthening our cross-team communications would have shaved minutes off the recovery time. Predefining escalation strategies during situations that require all hands on deck would have enabled our incident coordinators to spend more time managing recovery efforts and less time navigating documentation. Improving our messaging to you during this event would have helped you better understand what was happening and set expectations about when you could expect future updates.

    In Conclusion

    We realize how important GitHub is to the workflows that enable your projects and businesses to succeed. All of us at GitHub would like to apologize for the impact of this outage. We will continue to analyze the events leading up to this incident and the steps we took to restore service. This work will guide us as we improve the systems and processes that power GitHub.

Permissões de Postagem

  • Você não pode iniciar novos tópicos
  • Você não pode enviar respostas
  • Você não pode enviar anexos
  • Você não pode editar suas mensagens