Resultados 1 a 6 de 6
  1. #1
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    [EN] AWS: Massive S3 Outage

    Affected sites include Airbnb, Business Insider, Chef, Docker, Expedia, Heroku, Mailchimp, SendGrid, News Corp, Imgur, GitHub, Pinterest, Slack, and Quora, as well as parts of AWS’ own site, and ironically and Down Detector.

    Jordan Novet
    February 28, 2017

    Cloud infrastructure provider Amazon Web Services (AWS) today confirmed that it’s looking into issues with its widely used S3 storage service in the major us-east-1 region of data centers in Northern Virginia. Other services are affected as well.

    “We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region,” AWS said at the top of its status page.

    The issues appear to be affecting Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Autodesk Live and Cloud Rendering, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic,, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, OrderAhead, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage (which Atlassian recently acquired), Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), The Verge, Vermont Public Radio, VSCO, Wix, Xero, and Zendesk, among other things. Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji, and Time Inc. are currently working slowly.

    Apple is acknowledging issues with its App Stores, Apple Music, FaceTime, iCloud services, iTunes, Photos, and other services on its system status page, but it’s not clear they’re attributable to today’s S3 difficulties.

    Parts of Amazon itself also seems to be facing technical problems at the moment. Ironically, it’s restricting AWS’ ability to show errors.

    The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.

    — Amazon Web Services (@awscloud) February 28, 2017

    AWS outages do happen from time to time. In 2015 an outage lasted five hours. And AWS plays an increasingly prominent role in the finances of Amazon; in the fourth quarter it yielded $926 million in operating income and $3.53 billion in revenue for its parent company.

    Update at 10:30 a.m. Pacific: AWS has provided slightly more information about the S3 outage. “We’ve identified the issue as high error rates with S3 in US-EAST-1, which is also impacting applications and services dependent on S3. We are actively working on remediating the issue,” AWS said on its status page.

    Update at 10:51 a.m. Pacific: AWS has another S3 status update. “We’re continuing to work to remediate the availability issues for Amazon S3 in US-EAST-1. AWS services and customer applications depending on S3 will continue to experience high error rates as we are actively working to remediate the errors in Amazon S3,” AWS said on the status page.

    Update at 11:40 a.m. Pacific: A bit of good news from Amazon. “We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue,” AWS said on the status page.

    Update at 11:52 a.m. Pacific: Among the services based out of Northern Virginia that are affected today are Athena, CloudWatch, EC2, Elastic File System, Elastic Load Balancing (ELB), Kinesis Analytics, Redshift, Relational Database Service (RDS), Simple Email Service (SES), Simple Workflow Service, WorkDocs, WorkMail, CodeBuild, CodeCommit, CodeDeploy, Elastic Beanstalk (EBS), Key Management Service (KMS), Lambda, OpsWorks, Storage Gateway, and WAF (web application firewall). Yikes, that’s a lot.

    Update at 11:59 a.m. Pacific: AppStream, Elastic MapReduce (EMR), Kinesis Firehose, WorkSpaces, CloudFormation, CodePipeline are also dealing with issues now, according to the status page.

    Update at 12:06 p.m. Pacific: Some more services are down now. We’ve got API Gateway, CloudSearch, Cognito, the EC2 Container Registry, ElastiCache, the Elasticsearch Service, Glacier cold storage, Lightsail, Mobile Analytics, Pinpoint, Certificate Manager, CloudTrail, Config, Data Pipeline, Mobile Hub, and QuickSight. Wow.

    Update at 12:15 p.m. Pacific: Added information on issues affecting Apple.

    Update at 12:51 p.m. Pacific: On its status page, Trello just said that “S3 services appear to be slowly coming back up now.”

    Update at 12:52 p.m. Pacific: And now we hear from AWS! “We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour,” AWS said on its status page.

    Update at 1:19 p.m. Pacific: Things are looking better now. “S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3,” AWS said.

    Update at 2:35 p.m. Pacific: The S3 problems have been resolved! “As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally,” Amazon said in a 2:08 p.m. update. But several other AWS services are still having problems.

    Update at 6:38 p.m. Pacific: Almost all the services affected in today’s outage are back up and running. CloudTrail, Config, and Lambda are still not fixed.

    Update at 10:16 p.m. Pacific: The status now indicates that all of today’s issues have been resolved. Back to work, everyone, move along.
    Última edição por 5ms; 01-03-2017 às 09:07.

  2. #2
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    11-hour AWS failure hits websites and apps

    Sharon Gaudin
    Feb 28, 2017

    Amazon Web Services, a major cloud hosting service, had a system failure on Tuesday that affected numerous websites and apps.

    The issue was not fixed until just before 5 p.m. ET.

    “As of [4:49 p.m. ET] we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally,” the company reported.

    The problem had lasted for approximately 11 hours and caused problems for websites and online services throughout the day.

    AWS had reported on its Service Health Dashboard at 2:35 p.m. ET that its engineers were working on the problem, which affected websites including Netflix, Reddit and Adobe.

    The Associated Press reported that its own photos, webfeeds and other online services were also affected. And at approximately 3 p.m. ET, Mashable tweeted that it was also struggling.

    "We can't publish our story about AWS being down because, well, AWS is down," the news outlet tweeted.

    Even Amazon had issues. AWS tweeted that the performance of its own Service Health Dashboard was affected for a while.

    While companies were dealing with the outage, some people were having a laugh about it on Twitter.

    "Due to the #AWS outage and its impact to Snapchat & other popular apps, millions of millennials just looked up for the first time in years."

  3. #3
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    54 of the top 100 retail sites had slow load times during AWS's S3 outage

    Brandon Butler
    Mar 1, 2017

    Yesterday Amazon Web Services had a bad day. And when AWS has a bad day, so do a lot of other sites.

    Vendor Apica is a website monitoring services that keeps a close eye on some of the top retail websites around the country. All in all, the retail website Apica tracks had trouble dealing with the elevated errors rates AWS reported in S3 starting around mid-day Eastern Time.

    The most common problem was not sites being completely unavailable, but rather very slow load time. Apica found that 54 of the top 100 retail sites it monitors were impacted, with an average decrease of 20% slower load time.

    Who had it worst? Apica found three sites that were completely down during the four-hour disruption, including Express, Lulu Lemon, One Kings Lane.

    Other sites had extremely long load times. Disney, for example was 1,165% slower than it normally is to load; Target was 991% slower. Nike, Nordstrom and Victoria’s Secret rounded out the top five slowest sites yesterday during the disruption.

    The outage impacted different sites different ways. Apple, for example didn’t see any slowdown in its retail site’s load time, but its cloud service was impacted by numerous errors, particularly in its iCloud service.

    Read more about Apica's analysis of the S3 disruption here.

  4. #4
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Regi

    We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

    S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.

    We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

    From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

    Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

  5. #5
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Lydia Leong‏ @cloudpundit 46 minutes ago

    AWS's uncomfortably similar to Joyent 2014

    Postmortem for outage of us-east-1

    May 28, 2014 - by The Joyent Team

    We would like to share the details on what occurred during the outage on 5/27/2014 in our us-east-1 data center, what we have learned, and what actions we are taking to prevent this from happening again. On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and your customers.


    In order to understand the event, first we need to explain a few basics about the architecture of our data centers. All of Joyent's data centers run our SmartDataCenter product, which provides centralized management of all administrative services, and compute nodes (servers) used to host customer instances. The architecture of the system is built such that the control plane, which includes both the API and boot sequences, is highly-available within a single data center and survives any two failures. In addition to this control plane stack, every server in the data center has a daemon on it that responds to normal, machine generated requests for things like provisioning, upgrades, and changes related to maintenance.

    In order for the system to be aware of all servers in the data center and their current instances (VMs), software levels, and relevant information., we have our own DHCP/TFTP system that responds to PXE boot requests from servers in the data center. The DHCP/TFTP system caches the state required to service these requests, ordinarily accessed via the control plane.

    Because of this, existing capacity that is already a part of the SDC architecture is able to boot, even when the control plane is not fully available such as during this event.

    Given this architecture, and the need to support automated upgrades, provisioning requests and many innocuous tasks across all data centers in the fleet, there exists tooling that can be remotely executed on one or more compute nodes. This tooling is primarily used by automated processes, but as part of specific operator needs it may be used by human operations as well.

    What Happened?

    Due to an operator error, all us-east-1 API systems and customer instances were simultaneously rebooted at 2014-05-27T20:13Z (13:13PDT). Rounded to minutes, the minimum downtime for customer instances was 20 minutes, and the maximum was 149 minutes (2.5 hours). 80 percent of customer instances were back within 32 minutes, and over 90 percent were back within 59 minutes. The instances that took longer than others were due to a few independent isolated problems which are described below.

    The us-east-1 API was available and the service was fully restored by 2014-05-27T21:30Z (1 hour and 17 minutes of downtime). Explanation of the extended API outage is also covered below.

    Root cause of this incident was the result of an operator performing upgrades of some new capacity in our fleet, and they were using the tooling that allows for remote updates of software. The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay.

    Once systems rebooted, they by design looked for a boot server to respond to PXE boot requests. Because there was a simultaneous reboot of every system in the data center, there was extremely high contention on the TFTP boot infrastructure, which like all of our infrastructure, normally has throttles in place to ensure that it cannot run away with a machine. We removed the throttles when we identified this was causing the compute nodes to boot more slowly. This enabled most customer instances to come online over the following 20-30 minutes.

    The small number of machines that lagged for the remaining time were subject to a known, transient bug in a network card driver on legacy hardware platforms whereby obtaining a DHCP lease upon boot occasionally fails. In our experience, platforms with this network device will encounter this boot-time issue about 10% of the time. The mitigation for this is for an operator to simply initiate another reboot, which we performed on those afflicted nodes as soon as we identified them. Our newer equipment uses different network cards which is why the impact was limited to a smaller number of compute nodes.

    The extended API outage time was due to the need for manual recovery of stateful components in the control plane. While the system is designed to handle 2F+1 failures of any stateful system, rebooting the data center resulted in complete failure of all these components, and they did not maintain enough history to bring themselves online. This is partially by design, as we would rather systems be unavailable than come up "split brain" and suffer data loss as a result. That said, we have identified several ways we can make this recovery much faster.

    Next Steps

    We will be taking several steps to prevent this failure mode from happening again, and ensuring that other business disaster scenarios are able to recover more quickly.

    First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously. We have already begun putting in place a number of immediate fixes to tools that operators use to mitigate this, and we will be rethinking what tools are necessary over the coming days and weeks so that "full power" tools are not the only means by which to accomplish routine tasks.

    Secondly, we are determining what extra steps in the control plane recovery can be done such that we can safely reboot all nodes simultaneously without waiting for operator intervention. We will not be able to serve requests during a complete outage, but we will ensure that we can record state in each node such that we can recover without human intervention.

    Lastly, we will be assessing migrating customer instances off our older legacy hardware platforms more aggressively. We will be doing this on a case by case basis, as there will be impact to instances as we do this, which will require us to work with each customer to do so.


    We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again.

    The Joyent Team

  6. #6
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    $310M AWS fog storage mega-failure: Why everyone put their eggs in one region

    Shaun Nichols
    2 Mar 2017

    With Amazon now recovered from a four-hour outage that brought a large portion of the internet to a grinding halt, analysts are looking back to see what lessons companies can learn from the ordeal.

    The system breakdown – or as AWS put it, "increased error rates" – knocked out a single region of the AWS S3 storage service on Tuesday. That in turn brought down AWS's hosted services in the region, preventing EC2 instances from launching, Elastic Beanstalk from working, and so on. In the process, organizations from Docker and Slack to Nest, Adobe and had some or all of their services knocked offline for the duration.

    According to analytics firm Cyence, S&P 500 companies alone lost about $150m (£122m) from the downtime, while financial services companies in the US dropped an estimated $160m (£130m).

    The epicenter of the outage was one region on the east coast of America: the US-East-1 facility in Virginia. Due to its lower cost and familiarity with application programmers, that one location is an immensely popular destination for companies that use AWS for their cloud storage and virtual machine instances.

    As a result of developers centralizing their code there, when it fell over, it took out a chunk of the web. Startups and larger orgs find it cheaper and easier to use US-East-1 out of all the other regions AWS provides. It's Amazon's oldest location, and the one they are most familiar with.

    Coders are, ideally, supposed to spread their software over multiple regions so any failures can be absorbed and recovered from. This is, to be blunt, too difficult to implement for some developers; it introduces extra complexity which means extra bugs, which makes engineers wary; and it pushes up costs.

    For instance, for the first 50TB, S3 storage in US-East-1 costs $0.023 per GB per month compared to $0.026 for US-West-1 in California. Transferring information between apps distributed across multiple data centers also costs money: AWS charges $0.010 per GB to copy data from US-East-1 to US-East-2 in Ohio, and $0.020 to any other region.

    Then there are latency issues, too. It obviously takes time for packets from US-East-1 to reach US-West-1. In the end, it's easier to plonk your web application and smartphone app's backend in one friendly region, and ride out any storms. It's rare for a whole region to evaporate.

    "Being the oldest region, and the only public region in the US East coast until 2016, it hosts a number of their earliest and largest customers," said IDC research director Deepak Mohan. "It is also one of their largest regions. Due to this, impacts to the region typically affect a disproportionately high percentage of customers."

    Cost was a big factor, says Rob Enderle, principal analyst at the Enderle Group. "The issue with public cloud providers – particularly aggressively priced ones like Amazon – is that your data goes to the cheapest place. It is one of the tradeoffs you make when you go to Amazon versus an IBM Softlayer," Enderle said.

    "With an Amazon or Google you are going to have that risk of a regional outage that takes you out."
    'Pouring one hundred gallons of water through a one gallon hose'

    While those factors made the outage particularly difficult for customers who had come to rely on the US-East-1 region for their service, even those who had planned for such an occurrence and set up multiple regions were likely caught up in the outage. After US-East-1's cloud buckets froze and services vanished, some developers discovered their code running in other regions was unable to pick up the slack for various reasons.

    The takeaway, say the industry analysts, is that companies should consider building redundancy into their cloud instances just as they would for on-premises systems. This could come in the form of setting up virtual machines in multiple regions or sticking with the hybrid approach of keeping both cloud and on-premises systems. And, just like testing backups, testing that fail overs actually work.

    "I think we have grown accustomed to the idea that the cloud has become a panacea for a lot of companies," King said. "It is important for businesses to recognize that the cloud is their new legacy system, and if the worst does occur the situation can be worse for businesses using cloud than those choosing their own private data centers, because they have less visibility and control."

    While the outage will probably do little to slow the move of companies into cloud services, it could give some a reason to pause, and that might not be a bad thing.

    "What this emphasizes is the importance of a disaster recovery path, for any application that has real uptime requirements, be it a consumer-facing website or an internal enterprise application," said IDC's Mohan.

    "The biggest takeaway here is the need for a sound disaster recovery architecture and a plan that meets the needs and constraints of the application. This may be through usage of multiple regions, multiple clouds, or other fallback configurations."

Permissões de Postagem

  • Você não pode iniciar novos tópicos
  • Você não pode enviar respostas
  • Você não pode enviar anexos
  • Você não pode editar suas mensagens