Resultados 1 a 3 de 3
  1. #1
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    [EN] Lessons From EMC VNX 5400 Failure at OVH

    The real root cause of the problem: a leakage of cooling liquid that has 'touched' the EMC array.

    Philippe Nicolas

    One of our articles already covered recent OVH service failure especially in Europe and France happening at one of their datacenters, P19 more precisely.

    Even if the failure is related to a Dell EMC storage array, here a VNX 5400, deployed in 2012, it seems that procedures and configurations around that array and services related to it were limited and surely not enough to satisfy such IT services and operations. The impact was huge with 50,000 sites down as this array stored many databases used by all these sites.

    OVH has published an update (available here in French) and indicated the real root cause of the problem: a leakage of cooling liquid that has 'touched' the EMC array that shouldn't be present in that room at that time due to changes in the original computing room. The monitoring tool was not operated and this array was planned to be replaced. So finally it seems that this failure was the result of a series of several bad steps with a real lack of luck. The problem is that the procedure was not updated during this period of time as the failure exposure was bigger in that context.

    OVH tried a few things to restore the service but honestly these actions appeared to be old school with backup/restoration procedure and the action to move a new array from a remote site with one day old data sets to swap the failed one.

    What really surprising is the lack of modern process to protect and maintain service up and running.

    First, OVH used a RPO of 24 hours as the backup procedure is done every day and sent remotely to a system at a few hundreds of kilometers. Even if this model seems to be good on the paper, having a RPO of 24 hours appears to be not aligned with current cloud service providers goal and mission and users expect different protection models and surely a better RPO. Back to the report published by OVH, the restoration took many hours and 24 hours later 76 servers were up and running on the 96 SSDs.

    Some confusion seems to exist here as, even if the array was configured in active/active mode, only one array was used and the protection against the full stop not designed.

    Historically, IT architects used to deploy a volume manager to consider at least two storage entities with mirroring between arrays in addition to path redundancies. In that case, if an array fails, a path fails, a power supply, a volume… servers continue to interact with the surviving mirror without any downtime. You have then plenty of time to swap the failed component and initiate a re-hydratation of the new storage entity from the live one. It is just a simple design, normally the default one, but a very efficient one that has proven its robustness for decades.

    Second, data protection technologies such snapshot, replication even CDP exist for a long time and for sensitive data. These three approaches are mandatory. But of course in the current case, it has to be designed with very small RPO and things around near-CDP could be the right and minimum choice. The VNX array offers all the needed features to activate these data services, and architects can even consider to rely on array-, network- or server-based data services.

    Third, it appears that architects have made some confusion between data protection and application availability. In other words, the two work together and it you have an application without data, it’s useless and if you have 'good' data without applications, your business is stopped as well. Again a model to protect data aligned with RPO goals and RTO to maintain the service is the basis of the design of data services and operations.

    As an example, we can illustrate this approach with three classical examples:

    • Imagine a source code server, the protection model must protect data frequently to avoid any loss of line of codes. As there is no direct revenue associated with that, you can choose a long restore time. In that case, it means (very) short RPO and variable, flexible (but long) RTO.
    • The second example is a static web site that finally doesn't change often even if you wish to capture all changes each time they occur. But as the company visibility vehicles on the Internet, you wish to restart the web server as soon as the failure occurs. Here it means long RPO (days, weeks or even months) and very short RTO.
    • And finally, a mixed of the two for dynamic web sites, commerce-oriented applications with associated revenues, you don't wish to loose any transactions and disappointed users but maintain business in all circumstances. With 50,000 web sites, OVH should have implemented this model to satisfy users and protect its image.

    For OVH, architects have designed a weak data protection mechanism and finally completely forget the application availability. It seems that no application clusterization, automatic or manual, were set up. A big mistake.

    So this case raises several questions, among them:

    • Why only one array, as a single point of failures, is used in this configuration?
    • Why the protection of data is so weak with a RPO of 24 hours?
    • Why technologies such snapshot, replication and CDP wre not used or badly used?
    • Why a volume manager was not configured?
    • Why applications were not protected with automatic failover mechanism to make downtime transparent? and,
    • Did this configuration changes its role during its lifetime with more and more critical applications without changing the data protection procedure?

    We hope that OVH will start to study again all procedures related to data protection and application availability and as they recently raised €400 million. It should be obvious to consider a serious new plan. Of course, we always hear bad news and never good ones as failures become transparent when everything is perfectly designed. And final point, OVH could publish RPO/RTO goals and realities for every datacenter, service, application instance… It will help users to decide which services to pick and subscribe to the one they need and require.

    We invite OVH and especially Octave Klaba, CTO, to demonstrate lessons learned from this downtime.

  2. #2
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    ICYMI: Breakdown of EMC VNX 5400 Storage Subsystem at OVH - 50,000+ web sites down

    Jean Jacques Maleval

    OVH Group revealed that one of its storage bay in Paris, France, an EMC VNX 5400, was unable to boot, affecting more than 50,000 web sites of its customers.

    The French firm is one of the largest European hosting provider, with 20 data centers, more than one million customers and three million hosted web sites across 138 countries and four continents. Revenue were close to €400 million in 2016/2017. The company just raised €400 million to support its development.

    The company stated on its Web site that it "ensures stable and reliable product and service offerings to clients across all its brands."

    But what it encounters was a huge technical problem on January 29 at 4:30PM, as an incident appears on one of its VNX 5400 in its Paris P19 data center containing databases and being enable to boot. This system comprises 96 SSDs configured with active/active technology on several physical bays. But these databases are only backuped each day on another data center in another French site (RBX1) in Roubaix.

    The hosting firm stated that the EMC technology is not the source of the incident. Octave Klaba, OVH technical director said: "Our data centers are not adapted for this type of incident. Only some rooms are especially prepared pour this type of hosting but not this bay, which explains the origin of the problem."

    OVH is working with the vendor to find a solution. Another VNX 5400 was urgently moved from Roubaix to Paris but the hosting provider does not know how long it will take to restart the bay and recuperate the data. In the morning, 15% of the databases have been recuperated in read only mode. On June 30, at midnight, all the databases were operational and OVH is currently studying the state of the storage bay before making accessible the data.

  3. #3
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010

    What did OVH learn from 24-hour outage? Water and servers do not mix

    Citação Postado originalmente por 5ms Ver Post
    Jean Jacques Maleval

    But what it encounters was a huge technical problem on January 29 at 4:30PM, as an incident appears on one of its VNX 5400 in its Paris P19 data center containing databases and being enable to boot.

    Chris Mellor
    13 Jul 2017


    The failure took place around 7pm in OVH's P19 data centre on June 29. This was OVH's first data centre (opened in 2003).

    OVH has developed its own liquid cooling concept, which is used in the P19 facility. It has coolant circulating through the centre of server racks and other components to cool them down via component-level heat exchangers, hooked up to a top-of-rack water-block heat exchanger. The heated water is then cooled down by thermal exchange with ground water. This scheme saves electricity by avoiding the use of air conditioners.

    According to the incident log, P19 has some gear in the basement, making cooling via outside air problematic, hence the water-cooling development.

    OVH later bought several VNX 5400 arrays from EMC. The one in question had 96 SSDs in three chassis, 15 shelves of local disk, and the standard active-active pair of controllers. The host says: "This architecture is designed to ensure local availability of data and tolerate failures of both data controllers and disks."

    Since then it has developed the use of non-proprietary commodity arrays using Ceph and ZFS at Gravelines and is migrating off the proprietary gear.

    The incident account says: "At 6:48pm, Thursday, June 29, in Room 3 of the P19 datacenter, due to a crack on a soft plastic pipe in our water-cooling system, a coolant leak causes fluid to enter the system.

    "One of the two proprietary storage bays (racks) were not cooled by this method but were in close proximity. This had the direct consequence of the detection of an electrical fault leading to the complete shutdown of the bay."

    OVH admits installing them in the same room as the water-cooled servers was a mistake. "We made a mistake in judgment, because we should have given these storage facilities a maximum level of protection, as is the case at all of our sites."

    Fault on fault

    Then the crisis was compounded by a fault in the audio warning system. Probes able to detect liquid in a rack broadcast an audio message across the data centre. But an update to add multi-language support had failed and the technicians were alerted to the leak late, arriving 11 minutes after it happened.

    At 6:59pm they tried to restart the array. By 9:25 they hadn't succeeded and decided to carry on trying to restart the failed array (plan A) while (plan B) trying to restore its data to a second system using backups.

    Plan A

    Dell EMC support was called at 8pm and eventually restarted the array but it stopped 20 minutes later when a safety mechanism was triggered. So the OVH techs decided to fetch a third VNX 5400 from a Roubaix site and transfer the failed machine's disk drives to this new chassis, using its power modules and controllers.

    The system from Roubaix arrived at 4.30am with all the failed system's disks moved over by 6am. The system was fired up at 7am but, disaster, the data on the disks was still inaccessible. Dell EMC support was recontacted at 8am and an on-site visit arranged.

    Plan B

    A daily backup was the resource used for Plan B, OVH noting: "This is a global infrastructure backup, carried out as part of our Business Recovery Plan, and not the snapshots of the databases that our customers have access to in their customer space.

    "Restoring data does not only mean migrating backup data from cold storage to a free space of the shared hosting technical platform. It is about recreating the entire production environment."

    This meant it was necessary, in order to restore the data, to:

    • Find available space on existing servers at P19
    • Migrate a complete environment of support services (VMs running the databases, with their operating systems, their specific packages and configuration)
    • Migrate data to the new host infrastructure

    This process had been tested in principle but not at a 50,000-website scale. A procedure was scripted and, at 3am the next day, VM cloning from a source template started.

    At 9am, 20 per cent of the instances had been recovered. Hours passed and "at 23:40, the restoration of the [last] instance ends, and all users find a functional site, with the exception of a few users whose database was hosted on MySQL 5.1 instances and was restored under MySQL 5.5."


    It was obvious that disaster recovery procedures for the affected array were inadequate and, in the circumstances, OVH's technical support staff did a heroic job but should not have had to.

    The VNX array was in the wrong room but, even so, the failover arrangements for it were non-existent. Active DR planning and testing were not up to the job.

    Communication with affected users was criticised and OVH took this in the chin. "About the confusion that surrounded the origin of the incident – namely a coolant leak from our water-cooling system – we do our mea culpa."

    And what have we learned?

    1. Do not mix storage arrays and water
    2. Do full DR planning and testing for all critical system components
    3. Repeat at regular intervals as system components change
    4. Do not update critical system components unless the update procedure has been bullet-proof tested
    Última edição por 5ms; 28-07-2017 às 16:03.

Permissões de Postagem

  • Você não pode iniciar novos tópicos
  • Você não pode enviar respostas
  • Você não pode enviar anexos
  • Você não pode editar suas mensagens