Página 1 de 4 123 ... ÚltimoÚltimo
Resultados 1 a 10 de 32
  1. #1
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042

    OVH: 1 mês grátis para VPS atingidos por incidente no armazenamento Ceph

    12+ horas de downtime.



    Octave Klaba / Oles ‏@olesovhcom 14 hours ago
    If your VPS is still down, please reboot it.
    Si votre VPS est encore down, merci de le rebooter


    Vesta Vuurwerk ‏@VestaCoupons 14 hours ago
    @olesovhcom Any compensations for the downtime?


    Octave Klaba / Oles ‏@olesovhcom 14h14 hours ago
    free month of course for all 5000 VPS ! sorry again for the downtime !
    Última edição por 5ms; 02-10-2016 às 10:26.

  2. #2
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042

    Fs#20636 — vps cloud 2016 - gra ceph

    Details
    Nous avons un incident sur le cluster ceph du GRA pour VPS. Nous investiguons sur le problème.

    There are an issue on CEPH cluster on GRA for VPS cloud. Ceph team working on it.


    Comment by OVH - Saturday, 01 October 2016, 10:47AM

    because of a misbehaving storage node we experience degraded performance


    Comment by OVH - Saturday, 01 October 2016, 12:23PM

    we lose one OSD, and for unknown reason all the cluster slow down. Ceph team working on it.

    No ETA.

    Comment by OVH - Saturday, 01 October 2016, 13:13PM

    Our team is still on it, the cluster is still lock, we are trying to unlock it.

    Comment by OVH - Saturday, 01 October 2016, 16:56PM

    We are still trying do debug what is locking the ceph cluster. Some hexadump has been done without succes for the moment.
    In parallel we are trying to restore some backup

    Comment by OVH - Saturday, 01 October 2016, 17:20PM

    To summarize - we had 1 failing HDD. After removing OSD from it, 17 objects were unfound in Ceph after recovery. 7 of them are in 6 PGs which we cannot query or tell them to lose the objects. We have tried to manually force operations but did not succeed. We are still looking for a solution to unblock those PGs.

    Comment by OVH - Saturday, 01 October 2016, 18:12PM

    Bonjour,
    Nous avons un souci technique sur l'un de cluster CEPH.
    Il s'agit d'un cluster de 200 disques de 2TB. L'ensemble
    de données sont copiés 3x sur les disques et donc la
    capacité totale du cluster est de 120To.

    L'un de disques a été cassé et nous l'avons retiré. Pour
    une raison on cherche encore, le cluster trouve qu'il
    manque quelques données (17 objects) et se met en sécurité:
    il s’arrête de fonctionner pour tous les données.

    Les équipes travaillent sur le problème depuis ce matin
    6h30. On a essayé beaucoup de choses depuis pour redémarrer
    le cluster mais ça ne marche pas encore. Ca fait 11h30 que
    certains devops sont sur le problème et donc on a décidé
    de faire une pause d'1 heure et on recommence à bosser à 19h00.
    On va commencer par faire un point entre toutes les équipes
    qui bossent sur le souci Wroclaw (Pologne), Roubaix et Brest
    au bureau par la telepresence. Le but est de voir si on a
    raté quelque chose et regarder ce qu'on peut encore tenter.

    Les données sont là. Il n'y a pas de raison qu'on ne trouve
    pas comment remettre en marche le cluster.

    Nous savons que l'impact est important. 5000 VPS utilise ce
    stockage block qui est sensé fonctionner même quand ça ne
    marche plus. On s’aperçoit que CEPH est très sensible et
    en même temps ne réplique pas correctement les données. Il
    est fort probable qu'on choisisse une autre technologie
    dans l'avenir. Mais on n'en est pas là. Aujourd'hui, là
    maintenant, ce qui compte c'est de remettre le cluster en
    route.

    Nous sommes sincèrement désolé pour cette longe panne.
    Soyez rassuré sur une chose: on va bosser jusqu'à ce que
    vos VPS soient UP. Malheureusement je ne peux pas encore
    vous donner d'ETA car on n'a pas encore trouvé l'origine
    du souci.

    Amicalement
    Octave

    ---------------------------EN version -------------------

    Good evening,
    We have a technical issue on 1 Ceph cluster. We have about 200
    harddisk in this cluster. Each 2TB. We have 3 copy of each data
    on 3 disks. This cluster manage 120To of data.

    1 of the disks was broken and we removed it. For some reasons,
    Ceph stopped to working : 17 objectfs are missed. It should not.

    The teams has been working on the issue from 6 :30am. We’ve tried
    lot of things to get it up. Some members of the team have 11 hours
    uptime, that is why we’ve decided to stop working during 1 hour.
    We will restart working at 7pm with a meeting between the team in
    Wroclaw (Poland), Roubaix (France) and Brest (France). The goal
    is to find out an action plan to resolv the issue.

    We know the impact is important. About 5000 VPS are using this
    Ceph cluster. The deal is simple : it has to work even if we lose 66.66%
    of the hosts. Here we lost 1 hard disk and it’s broken. Once the data
    are UP, we will write the post-mortem and see if we can find out an
    another technology for the block storage. Right now, the goal is to
    restore the data and we will.

    We are very sorry about this long downtime. Be sure, we are working
    on this issue and we will resolv this problem. I can’t give you right now
    the ETA.

    Regards,
    Octave

    Comment by OVH - Saturday, 01 October 2016, 21:27PM

    From War Room.

    Bonsoir,
    On recommande de ne pas rebooter votre VPS.

    Le cluster Ceph impacté fonctionne sur 24 serveurs avec 12 disques
    chacun. Nous avons un souci sur 6 hosts sur 24. C'est pourquoi toutes
    les 5000 VPS ne sont pas down, car Ceph continue à fonctionner
    sur les 18 serveurs restants. En tout, nous avons 6 Placement Group
    (PG) en fail sur 10533 en tout sur le cluster. On essaie de faire
    l'estimation le nombre de VPS réellement impactés car chaque donnée
    est écrite sur beaucoup de PG.

    Le probleme est suivant: nous avons 17 objects qui sont écrits
    sur les disques dans la version 696135'83055746 et dans les
    metadata de Ceph la version est 696135'83055747. Du coup Ceph
    essaie de lire le fichier et dit que la version n'est pas bonne
    puis s’arrête. Nous avons forcé Ceph à oublier ces fichiers,
    mais ça ne fonctionne pas. Ceph se bloque car n'arrive pas à
    ouvrir le PG (Placement Group).

    On pense que le disque en défaut a flappé et Ceph a probablement
    écrit le fichier en version 696135'83055747 sur ce disque et
    n'a pas écrit les fichiers en version 696135'83055746 sur les
    autres disques. Un probable bug dans la version qu'on fait
    tourner en production 0.94.4. La dernière version est 0.94.9

    On a un plan d'action en 4 points:

    1) on va patcher l'outil d'import/export des objects pour lire
    le fichier en version 696135'83055746 et forcer l’écriture en
    version 696135'83055747. une équipe bosse dessus.

    2) on regarde pourquoi Ceph se bloque quand on essaie de le
    démarrer en forçant l'oublie de 17 objects. une autre équipe
    bosse dessus.

    3) dans le cas où tout ceci ne fonctionne pas, on va essayer
    de démarrer le Ceph sans le 6 PG qui posent le probleme. Mais
    les autres PG sur les 6 hosts seront UP.
    cela veut dire qu'on va perdre quelques données. On ne sait pas
    estimer qui serait impacté et qui ne serait pas, mais c'est
    probable que le gros de VM vont revenir. Et les autres vont
    probablement revenir avec FSCK dans la VM. Et peut-être on
    aura quelques VM qui ne vont pas démarrer. C'est pourquoi on
    vient de lancer un backup local sur chaque serveur de Ceph.
    Il va prendre 6 heures environ.
    Si jamais, je répète, si jamais on fait cette stratégie, on
    va travailler sur les données du backup afin de récupérer
    les données pour les éventuelles VM qui aurait perdu les
    données. Mais quelques heures, quelques jours plus tard.
    On espère qu'on n'aura pas à faire cette stratégie. une autre
    équipe bosse sur cette stratégie. en cas où.

    4) en parallèle, on a lancé la restauration de données pour
    les VPS à partir de backup mais c'est hyper lent. On n'a pas
    encore l'estimation de restauration. une autre équipe bosse
    dessus.

    Voilà le plan d'action en 4 points.

    Amicalement
    Octave


    (cont)
    Última edição por 5ms; 02-10-2016 às 10:26.

  3. #3
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042
    ----------------------EN version--------------------------

    From War Room.

    Good evening
    We recommand not to reboot your VPS.

    The Ceph cluster is based on 24 servers. each with 12 disks. We have
    the issue on 6 servers, not 24. That is why, not all 5000 VPS are down.
    Some of them are down. The others continue to work with the remains
    18 servers. We have an issue on 6 Placement Group (PG) and we have
    10533 PG in this cluster. A small part of the data are fail but it can
    impact lot of VPS. We are trying to estimate how many VPS are really
    impacting, but it depends if 1) the VPS is using the 6 PG 2) the VPS wants
    to read/write the data that are in the 6 PG.

    The main issue is the version of the 17 objects. The objects are in the
    version 696135'83055746. The version in the Ceph’s metadata is
    696135'83055747. So Ceph doesn’t want to start. We’ve forced to
    forget the bad files but it doesn’t work. Ceph is freezing.

    We think that Ceph was trying to write the data on the failed disk.
    Ceph has done it on the failed disk and updated the metadata, but
    Ceph hasn’t done it on the others disks. Probably a bug in the version
    we run 0.94.4. The last version is 0.94.9

    4 action plans:

    1)
    we are patching the tool that can import/export the objects to force
    the version of the object to 696135'83055747. 1 team is working on
    this

    2)
    we are looking how to restart Ceph wihout 17 objects. an another
    team is working on that.

    3)
    in the case, nothing works, we will start the Ceph without 6 PG, but
    it means lost of some data. we don’t know which data would bel ost.
    that is why we lauch a local backup of the data in the case we have
    to work on it and restore the lost data in the futur. it’s a strategy
    the « last case ». 1 team is working on that.

    4)
    we’are preapring the recovery of the data, but we are talking about
    120TB. it will be slow, and the backup has 24 hours. 1 team is working
    on that

    Regards,
    Octave

    Comment by OVH - Saturday, 01 October 2016, 22:09PM

    On commence d'avoir quelques bons résultat sur la solution 2)

    Sur la 4) on essaie d'identifier les VM lesquels on peut
    restaurer.

    ----------------------EN version--------------------------

    The solution 2) begins to work

    4) We are trying to find out the VPS that are down to restore
    the data.

    Comment by OVH - Saturday, 01 October 2016, 22:46PM

    2) il nous reste 1 objet qui empêche Ceph de redémarrer.
    le verdict dans quelques minutes.

    en tout on perd 17 objects de 4Mo/8Mo chacun. donc environ
    100Mo sur 120Tb. 0.08%

    ----------------------EN version--------------------------

    2) we have only 1 objet that avoids to restart the Ceph.

    we lose 17 objets with 4/8MB each. so 100MB/120TB. 0.8%

    Comment by OVH - Saturday, 01 October 2016, 22:55PM

    2)
    sous Ceph il existe une commande pour oublier un object.
    elle a fonctionné pour 10 objects mais pas pour 7. pour
    réussir à faire passer cette commande sur 7, on force
    la commande d'effacement avec une commande "while true"
    en boucle et pendant ce temps on essaie de redémarrer le
    cluster en boucle aussi. les 2 boucles en même temps et
    la magie fait qu'un moment ceph démarre en oubliant cet
    objet.

    il reste 1 PG fail. les 5 autres sont UP.

    !!!!

    ----------------------EN version--------------------------

    2)

    in Ceph, there is a command to forget 1 objet. it's worked
    on 10 objets but it doesn't on the others 7 objectss.
    we've found out that if we run the command to forget the
    objet with "whie true" loop and in the same time we run
    the command to restart ceph in the "while true" loop too
    after a while, ceph starts

    the 5 PG of 6 are UP now. we still have the last objet
    fail and 1 PG fail

    !!!


    Comment by OVH - Saturday, 01 October 2016, 23:05PM

    2) il y a beaucoup de lecture/écriture sur le cluster
    ceph car les 5 PG sont UP. on n'arrive pas faire oublier
    le dernier objet.

    ----------------------EN version--------------------------

    2) we have lot of read/write on the ceph right now because
    5 PG are UP now. the last one it's hard to make UP.

    Comment by OVH - Saturday, 01 October 2016, 23:12PM

    UP !

    Comment by OVH - Saturday, 01 October 2016, 23:14PM

    Si votre VPS est down, vous pouvez lancer le reboot.

    --------------------EN version----------------

    If your VPS is still down, please reboot it.


    http://travaux.ovh.net/?do=details&id=20636

  4. #4
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042
    "The teams has been working on the issue from 6 :30am"

    Correção: 17 horas de downtime.

    Resumo: Cluster CEPH formado por ~290 HDDs de 2TB, sendo cada arquivo (objeto) duplicado em 3 HDDs. Após 1 (um) HDD falhar e ser removido da configuração do cluster, surgiram graves problemas que comprometeram a integridade de dados em parte do sistema causando a indisponibilidade.

    The Ceph cluster is based on 24 servers. each with 12 disks. We have
    the issue on 6 servers, not 24. That is why, not all 5000 VPS are down.
    Some of them are down. The others continue to work with the remains
    18 servers. We have an issue on 6 Placement Group (PG) and we have
    10533 PG in this cluster. A small part of the data are fail but it can
    impact lot of VPS.

    The main issue is the version of the 17 objects. The objects are in the
    version 696135'83055746. The version in the Ceph’s metadata is
    696135'83055747. So Ceph doesn’t want to start. We’ve forced to
    forget the bad files but it doesn’t work. Ceph is freezing.
    Última edição por 5ms; 02-10-2016 às 11:02.

  5. #5
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042
    "We know the impact is important. About 5000 VPS are using this
    Ceph cluster. The deal is simple : it has to work even if we lose 66.66%
    of the hosts. Here we lost 1 hard disk and it’s broken. Once the data
    are UP, we will write the post-mortem and see if we can find out an
    another technology for the block storage.
    "

  6. #6
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042
    Meus R$ 0,01:

    1. Descontado o "calor" do momento, não ficou claro o tamanho do estrago. Do primeiro instante, "desempenho degradado" pula para cluster "locked", em outro que nem todos 5 mil VPS estão down (The Ceph cluster is based on 24 servers. each with 12 disks. We have the issue on 6 servers, not 24. That is why, not all 5000 VPS are down.), em outro a possibilidade de restore completo de *todo* o cluster (we are talking about 120TB [x3]. it will be slow, and the backup has 24 hours), para em seguida condescender que "apenas 0,8%" dos dados armazenados foram perdidos (em definitivo?). Depois de 17 horas de esforço e gatilhos, conseguem inicializar o CEPH (o cluster?). Entre mortos e feridos, o lider máximo anuncia 1 mês grátis para 5 mil VPS (depreende-se que, ao contrário do que tinha sido afirmado, todos foram afetados). A razão do incidente? Será verificado (Probably a bug in the version we run 0.94.4. The last version is 0.94.9).

    2. A promessa de procurar outro software soa vazia. Nem de longe é o primeiro caso do armazenamento CEPH ter tirado os VPS da OVH por longas horas e nem assim a versão era a mais atual. Ademais Octave tem outros compromissos (#OVH is nominated for the Super User Award for the OpenStack Summit) e não vai chutar o balde do leite ordenhado das agências européias de incentivos tecnológico. Não é segredo nenhum as deficiências do CEPH e do suporte da Inktomi (Ceph Apocalypse 05-02-2015 Storage unavailability Friday November 21st – 26th, 2014):

    Try as they may to be redundant, OpenStack and Ceph architecturally force non-obvious single points of failure. Ceph is a nice transition away from traditional storage, but at the end of the day it is just a different implementation of the same thing. SAN and Software Defined Storage are all single points of failure when used for virtual machine storage. OpenStack enabled us to scale massively with commodity hardware, but proved unsustainable operationally speaking.
    Última edição por 5ms; 02-10-2016 às 13:01.

  7. #7
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042

    Confiabilidade Bambu

    FS#18746 — Cluster de stockage Strasbourg

    Details
    Nous rencontrons un problème avec un serveur du cluster, certaines données ne sont pas accessibles aux VPS.

    Comment by OVH - Tuesday, 21 June 2016, 12:10PM

    0,04 % des données du cluster sont inaccessibles.
    Si un VPS tente d'accéder à ces données le cluster ne retournera pas l'information ce qui génère des erreurs sur le VPS.

    Nous continuons à faire le nécessaire pour rendre l'intégralité du cluster accessible.

    Comment by OVH - Tuesday, 21 June 2016, 12:13PM

    We have some issues with one server of the cluster, some data can't be reach by VPS.

    0,04% of datas are inaccessible.
    If a VPS tries to get those datas, the cluster will not give the information and it will generate some errors on VPS

    We are working on making the whole cluster available.

    Comment by OVH - Tuesday, 21 June 2016, 14:33PM

    Une solution a été testée sur un environment de test, nous allons l'utiliser en production.

    Comment by OVH - Tuesday, 21 June 2016, 14:33PM

    A solution had been tested en test environnement, we will use it on production.

    Comment by OVH - Tuesday, 21 June 2016, 14:51PM

    L'ensemble du cluster est maintenant accessible, plus d'informations à venir.

    Comment by OVH - Tuesday, 21 June 2016, 14:51PM

    The whole cluster is now reachable, more information to come.

    Comment by OVH - Wednesday, 22 June 2016, 10:01AM

    Nous avons manuellement exporté et importé sur deux nouveaux OSD les deux PG impactés.

    Comment by OVH - Wednesday, 22 June 2016, 10:02AM

    We have manually exported and imported both PG impacted on two new OSD.

  8. #8
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042

    Ceph as a Service - €150 excl. VAT / 2TB / month

    "We know the impact is important ... The deal is simple : it has to work even if we lose 66.66% of the hosts. Here we lost 1 hard disk and it’s broken. Once the data are UP, we will write the post-mortem and see if we can find out an another technology for the block storage." Octave Klaba







    Distributed:

    Ceph is a massively scalable distributed storage system, it relies on a placement algorithm that stores data accross your storage cluster. Through this automated process, data is replicated three times in 3 separate failure domains thus greatly increasing availability of your data from anywhere, anytime.


    Features
    Storage Space Dedicated 2 TB (don't share the disks) !
    Datacenter Strasbourg - SBG1 (Central Europe)
    Beauharnois - BHS1 (North America)
    Roubaix - RBX1 (West Europe)
    Ceph Version Infernalis (9.2.1)
    Replication x3 on different availability zones
    Erasure Coding Soon
    IOPS 6k IOPS (random-write, 1 job, 4k)
    Protocols RBD (block device)
    Soon: RadosGateway (object storage)
    Live resize (without downtime) Soon!
    Security Configure ACLs directly in Managerv6 / API
    Compatibility Any Server in OVH network (VPS, Dedicated Server, Kimsufi, SoYouStart, Dedicated Cloud, Public Cloud)
    Price €149.99 excl. VAT /month

    https://www.runabove.com/ceph-as-a-service.xml
    Última edição por 5ms; 02-10-2016 às 17:31.

  9. #9
    WHT-BR Top Member
    Data de Ingresso
    Dec 2010
    Posts
    15,042

    Scality Partners With OVH For Large-Scale Cloud Storage

    Adam Armstrong
    August 30th, 2016


    Today at VMworld 2016 in Las Vegas, Scality announced that it has partnered with the world’s third-largest hosting and Internet infrastructure provider, OVH, to deliver a joint solution designed for large-scale storage needs. Scality will now be able to run its software on OVH servers, which the two companies state is ideal for Hosted Private Clouds environments. OVH customers will now be able to run Scality RING and build Petabyte-scale storage pools, giving them both efficient hosting and efficient storage.



    Founded in 1999, OVH specializes in cloud and Internet infrastructure. OVH has been able to gain nearly a million customers since founding, including Scality that uses OVH for its private cloud. OVH states that its keys to success are its innovation that enables users to keep full control over the supply chain, from server manufacturing and in-house maintenance of their infrastructure, right down to customer assistance.

    Hosted private clouds are becoming more and more attractive to several organizations. These clouds provide them the cost-effectiveness and scale of public clouds with the security of private clouds. In fact, IDC states that in two years nearly 50% off all enterprises will create or use industry cloud platforms for their own innovations or source others. The combination of Scality and OVH will provide companies with a high-performance, highly reliable petabyte-scale storage solution.

    This joint solution is designed to tackle to specific needs:

    • Scale-Out NAS for 200TB usable and more
    • S3 Object Storage for 200TB usable and more


    http://www.storagereview.com/scality..._cloud_storage

  10. #10
    Membro
    Data de Ingresso
    Jan 2015
    Posts
    25
    Será que outros nomes populares como Vultr e DO utilizam Celph? Seria nestes produtos: https://www.vultr.com/pricing/blockstorage/ e https://www.digitalocean.com/communi...n-digitalocean

    Sei que Dreamhost aposta muito no Celph. E a Linode não tem ainda algo semelhante aos concorrentes mais jovens citados acima.

Permissões de Postagem

  • Você não pode iniciar novos tópicos
  • Você não pode enviar respostas
  • Você não pode enviar anexos
  • Você não pode editar suas mensagens
  •