29-06-2012, 18:46 #1
AWS EBS US-East-1 novamente apresenta problemas
O problema começou às 11:45 (hora de Brasilia) e foi confirmado pela Amazon. Algumas aplicações enfrentaram dificuldades nos acessos de leitura e gravação a volumes EBS e novas instâncias não foram lançadas imediatamente. Às 13:30, a Amazon informou que as novas instâncias já estavam sendo lançadas normalmente e os volumes EBS afetados estavam sendo recuperados (re-mirroring), o que estaria causando aumento da latência para aqueles volumes.
Amazon Web Services is reporting another service outage this morning for some customers of its EC2 cloud computing service. Amazon has reported connectivity issues in its US-East-1 availability zone, the same zone which was hit by an outage earlier this month.
The problems began at about 10:45 a.m. Eastern time, and were confirmed by Amazon a short time later. “We can confirm network connectivity issues for some EC2 instances in a single Availability Zone in the US-EAST-1 region,” Amazon reported in its Service Health Dashboard. ” Customers may be experiencing impaired read/write access to their EBS (Elastic Block Storage) volumes. New instance launches are also delayed. We are applying mitigations to address the connectivity issues … and connectivity is beginning to recover.” dotCloud also reported downtime due to the AWS problems.
UPDATE: As of 12:30 p.m. Eastern, Amazon reports progress. “Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.”
Amazon experienced an outage June 15 in its US-East-1 availability zone that was triggered by a series of failures in the power infrastructure in a northern Virginia data center, including the failure of a generator cooling fan while the facility was on emergency power. The downtime affected AWS customers including Heroku, Pinterest, Quora and HootSuite, along with a host of smaller sites.
Today’s problems seem to have affected fewer customers than the June 15 incident. One service reporting availability problems was the AppFog platform. “More AWS outages this morning (EC2, RDS, EBS), attempting to work around as best as we can,” the company reported on its Twitter feed. “Sorry for any inconvenience this has caused.”
It’s not clear whether the smaller number of visible customer problems were due to the issue being more limited, or whether companies impacted by the incident two weeks ago have since opted to extend their infrastructure across additional EC2 availability zones, as recommended by Amazon.
Today’s incident was the fourth in the last 14 months for the US-East-1 availability zone, which is Amazon’s oldest availability zone and resides in a data center in Ashburn, Virginia. The US-East-1 zone also had downtime in April 2011 and another less serious incident in March.
29-06-2012, 19:01 #2
Quem sabe não foi a mão peluda do Ubuntu ...
29-06-2012, 22:36 #3
- Data de Ingresso
- Oct 2010
- Rio de Janeiro
Welcome to the Jungle!
29-06-2012, 23:11 #4
Welcome to the Jungle!
- Data de Ingresso
- Aug 2011
30-06-2012, 02:07 #5
Foi o Google para desviar clientes para o Google Computing Engine anunciado este semana...
30-06-2012, 09:43 #6
E tem mais
Amazon Data Center Loses Power During Storm
An Amazon Web Services data center in northern Virginia lost power Friday night, causing extended downtime for services includng Netflix, Heroku, Pinterest , Instagram and many others. The incident occurred as a powerful electrical storm struck the Washington, D.C. area, leaving as many as 1.5 million residents without power.
The data center in Ashburn, Virginia that hosts the US-East-1 region lost power for about 30 minutes, but customers were affected for a longer period as Amazon worked to recover virtual machine instances. “We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area,” Amazon reported at 8:30 pm Pacific time. An update 20 minutes later said that “power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.”
By 1:42 AM Pacific time, Amazon reported that it had “recovered the majority of EC2 instances and are continuing to work to recover the remaining EBS (Elastic Block Storage) volumes.”
Latest in Series of Outages
The outage marked the second time this month that the Amazon data center hosting the US-East-1 availability zone lost power during a utility outage. Major data centers are equipped with large backup generators to maintain power during utlity outages, but the Amazon facility was apparently unable to make the switch to backup power.
Amazon experienced an outage June 15 in its US-East-1 availability zone that was triggered by a series of failures in the power infrastructure, including the failure of a generator cooling fan while the facility was on emergency power. The same data center also experienced problems early Friday, when customers experienced connectivity problems.
Even Netflix Impacted
The latest outage was unusual in that that it affected Netflix, a marquee customer for Amazon Web Services that is known to spread its resources across multiple AWS availability zones, a strategy that allows cloud users to route around problems at a single data center. Netflix has remained online through past AWS outages affecting a single availability zone.
The Washington area was hit by powerful storms late Friday that left two people dead and more than 1.5 million residents without power. Dominion Power’s outage map showed that sporadic outages continued to affect the Ashburn area. Although the storm was intense, there were no immediate reports of other data centers in the region losing power. Ashburn is one of the busiest data center hubs in the country, and home to key infrastructure for dozens of providers and hundreds of Internet services.
30-06-2012, 10:13 #7What’s interesting is Netflix seems to have the multi-region redundancy built in, but ran into issues with Elastic Load Balancing, which is the portion of Amazon’s service that tells web page requests which servers to get them from, connecting user requests to functioning instances.
AWS Power Outage Questions Reliability Of Public Cloud - Forbes
Esses caras não são de nada. São "lideres" porque estão sozinhos no mercado. Quero ver o que vai ser dessa empresa quando as soluções da Microsoft, HP, Google, Rackspace entrarem em produção.
8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.
8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.
8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.
9:20 PM PDT We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.
9:54 PM PDT EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.
10:36 PM PDT We continue to bring impacted instances and volumes back online. As a result of the power outage, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the "Status Checks" column in the Volume list in the AWS console listed as "Impaired." If your instances or volumes are not available, please login to the AWS Management Console and perform the following steps:
1) Navigate to your EBS volumes. If your volume was affected and has been brought back online, the "Status Checks" column in the Volume list in the console will be listed as "Impaired."
2) You can use the console to re-enable IO by clicking on "Enable Volume IO" in the volume detail section.
3) We recommend you verify the consistency of your data by using a tool such as fsck or chkdsk.
4) If your instance is unresponsive, depending on your operating system, resuming IO may return the instance to service.
5) If your instance still remains unresponsive after resuming IO, we recommend you reboot the instance from within the Management Console.
More information is available at: http://docs.amazonwebservices.com/AW...me-status.html
11:19 PM PDT
We continue to make progress in recovering affected instances and volumes. Approximately 50% of impacted instances and 33% of impacted volumes have been recovered.
Jun 30, 12:15 AM PDT We continue to make steady progress recovering impacted instances and volumes. Elastic Load Balancers were also impacted by this event. ELBs are still experiencing delays in provisioning load balancers and in making updates to DNS records.
Jun 30, 12:37 AM PDT ELB is currently experiencing delayed provisioning and propagation of changes made in API requests. As a result, when you make a call to the ELB API to register instances, the registration request may take some time to process. As a result, when you use the DescribeInstanceHealth call for your ELB, the state may be inaccurately reflected at that time. To ensure your load balancer is routing traffic properly, it is best to get the IP addresses of the ELB's DNS name (via dig, etc.) then try your request on each IP address. We are working as fast as possible to get provisioning and the API latencies back to normal range.
Jun 30, 1:42 AM PDT We have now recovered the majority of EC2 instances and are continuing to work to recover the remaining EBS volumes. ELBs continue to experience delays in propagating new changes.
Jun 30, 3:04 AM PDT We have now recovered the majority of EC2 instances and EBS volumes. We are still working to recover the remaining instances, volumes and ELBs.
Jun 30, 4:42 AM PDT We are continuing to work to recover the remaining EC2 instances, EBS volumes and ELBs.
Última edição por 5ms; 30-06-2012 às 10:28.
01-07-2012, 10:46 #8
UPDATE: While most Amazon customers recovered within several hours, a number of prominent services were offline for much lnger. The photo-sharing service Instagram was unavailable until about Noon Pacific time Saturday, more than 15 hours after the incident began. Cloud infrastructure provider Heroku, which runs its platform atop AWS, reported 8 hours of downtime for some services.Adrian Cockroft, the Director of Architecture at Netflix, said the problem was a failure of Amazon’s Elastic Load Balancing service.”We only lost hardware in one zone, we replicate data over three,” Cockroft tweeted. “Problem was traffic routing was broken across all zones.”
03-07-2012, 12:42 #9
Multiple Generator Failures Caused Amazon Cloud Outage
Amazon Web Services says that the repeated failure of multiple generators in a single data center caused last Friday night’s power outage, which led to downtime for Netflix, Instagram and many other popular web sites. The generators in this facility failed to operate properly during two utility outages over a short period Friday evening when the site lost utility power, depleting the emergency power in the uninterruptible power supply (UPS) systems.
Amazon said the data center outage affected a small percentage of its operations, but was exacerbated by problems with systems that allow customers to spread workloads across multiple data centers. The company apologized for the outage and outlined the steps it will take to address the problems and prevent a recurrence.
The generator failures in the June 29 incident came just two weeks after a June 14 outage that was caused by a series of problems with generators and electrical switching equipment.
Just 7 Percent of Instances Affected
The incident affected only one availability zone within its US-East-1 region, and that only 7 percent of instances were offline. It did not identify the location of the data center, but said it was one of 10 facilities serving the US-East-1 region.
When the UPS units ran out of power at 8:04 p.m., the data center was left without power. Shortly afterward, Amazon staffers were able to manually start the generators, and power was restored at 8:24 p.m. Although the servers lost power for only 20 minutes, recovery took much longer. “The vast majority of these instances came back online between 11:15pm PDT and just after midnight,” Amazon said in its incident report.
Amazon said a bug in its Elastic Load Balancing (ELB) system that prevented customers from quickly shifting workloads to other availability zones. This had the affect of magnifying the impact of the outage, as customers that normally use more than one availability zone to improve their reliability (such as Netflix) were unable to shift capacity.
Amazon: We Tested & Maintained The Generators
Amazon said the generators and electrical switching equipment that failed were all the same brand and all installed in late 2010 and early 2011, and had been tested regularly and rigorously maintained. “The generators and electrical equipment in this datacenter are less than two years old, maintained by manufacturer representatives to manufacturer standards, and tested weekly. In addition, these generators operated flawlessly, once brought online Friday night, for just over 30 hours until utility power was restored to this datacenter. The equipment will be repaired, recertified by the manufacturer, and retested at full load onsite or it will be replaced entirely.”
In the meantime, Amazon said it would adjust several settings in the process that switches the electrical load to the generators, making it easier to transfer power in the event the generators start slowly or experience uneven power quality as they come online. The company will also have additional staff available to start the generators manually if needed.
Amazon also addressed why the power outage was so widely felt, even though it apparently affected just 7 percent of virtual machine instances in the US-East-1 region.
Though the resources in this datacenter … represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers. The impact manifested in two forms. The first was the unavailability of instances and volumes running in the affected datacenter. This kind of impact was limited to the affected Availability Zone. Other Availability Zones in the US East-1 Region continued functioning normally. The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region. While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.
Load Balancing Bug Limited Workload Shifts
The incident report provides extensive details on the outage’s impact on control planes for its EC2 compute service, Elastic Block Storage (EBS) services and Relational Database Service (RDS). Of particular interest is Amazon’s explanation of the issues affecting its Elastic Load Balancing (ELB) service. The ELB service is important because it is widely used to improve customer reliability, allowing them to shift capacity between different availability zones, an important strategy in preserving uptime when a single data center experiences problems. Here’s a key excerpt from Amazon’s incident report regarding the issues with ELB on the June 29 outage.
During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.
While direct impact was limited to those ELBs which had failed in the power-affected datacenter and hadn’t yet had their traffic shifted, the ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones. For multi-Availability Zone ELBs, if a client attempted to connect to an ELB in a healthy Availability Zone, it succeeded. If a client attempted to connect to an ELB in the impacted Availability Zone and didn’t retry using one of the alternate IP addresses returned, it would fail to connect until the backlogged traffic shift occurred and it issued a new DNS query. As mentioned, many modern web browsers perform multiple attempts when given multiple IP addresses; but many clients, especially game consoles and other consumer electronics, only use one IP address returned from the DNS query.
Última edição por 5ms; 03-07-2012 às 12:44.
03-07-2012, 15:46 #10
- Data de Ingresso
- Oct 2010
- Rio de Janeiro
Terceirizou para a DRL, deu no que deu...