O C14 da Online.net objetiva armazenamento de longo prazo (3-5 anos) com finalidade semelhante ao Glacier da Amazon.
No C14, inicialmente o usuário cria uma área intermediária temporária para onde serão copiados os arquivos desejados. A transferência para o C14 é iniciada via interface ou após decorrido um prazo, durante o qual é possivel adicionar arquivos na mesma área. Todos os arquivos são criptografados com chaves únicas, que podem ou não ser mantidas na Online.net. Caso opte por não armazenar as chaves, o que seria recomendável, a perda implica em impossibilidade de recuperação do lote gravado correspondente à chave perdida. A disponibilidade dos arquivos requer a transferência do C14 para uma área de trabalho. Existem garantias diferentes de durabilidade e tempo de recuperação. A cobrança é realizada na conta do cliente na Online.net. Não requer servidores na Online.net. Todos os arquivos gravados no C14 em outro data center serão transferidos para o DC-4.
Standard Reed-Solomon erasure codes suffer from an efficiency versus repair trade-off. Repair is big issue: both time and overhead.
by Robin Harris
21 June, 2013
We want our data protected from device failures. When there is a failure we want to get our data back quickly. And we want to pay as little as possible for the protection and the restore. How?
Recent research by hyper-scale system managers – mostly Microsoft and Facebook engineers and scientists – has tried to answer that question. And the answers are way better than what we have today.
In XORing Elephants: Novel Erasure Codes for Big Data, authors Maheswaran Sathiamoorthy of USC, Alexandros G. Dimakis, Megasthenis Asteris, and Dimitris Papailiopoulos of [then] USC and now UT Austin, and Dhruba Borthakur, Ramkumar Vadali and Scott Chen of Facebook delve deeply into the issue.
RAID repair problem
Standard Reed-Solomon erasure codes aren’t well suited to hyperscale Hadoop workloads. Repair is big issue: both time and overhead.
Reed-Solomon codes suffer from an efficiency versus repair trade-off. Standard RAID5 or RAID6 needs a wide stripe for capacity efficiency, but a wide stripe makes the time to repair a failed disk much longer. Data has to be transferred from every other disk in the stripe, using up scarce disk I/Os and internal storage bandwidth.
To avoid this problem a decade ago scale out storage dispensed with Reed-Solomon codes in favor of simple double or triple replication. Because they used inexpensive disks rather than expensive arrays this was economical.
But exponential data growth has overwhelmed the ability of big web companies to build infrastructure fast enough and large enough to handle the tsunami of data. Something had to give. Triple replication was it.
But even triple replication isn’t enough for them. Since they can’t back up they need a level of redundancy that even RAID6 cannot approach.
These systems are now designed to survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with a capacity overhead of only 60%.
An optimal storage solution would not only be capacity efficient, but also reduce network repair traffic.The authors developed a new family of erasure codes called Locally Repairable Codes or LRCs that are efficiently repairable in disk I/O and bandwidth requirements. They implemented these codes in a new Hadoop module called HDFS–Xorbas and tested it on Amazon and within Facebook.
LRC test results found several key results.
Disk I/O and network traffic were reduced by half compared to RS codes.
The LRC required 14% more storage than RS, information theoretically optimal for the obtained locality.
Repairs times were much lower thanks to the local repair codes.
Much greater reliability thanks to fast repairs.
Reduced network traffic makes them suitable for geographic distribution.
Here’s the table comparing replication, Reed Solomon and LRC. The (10, 6, 5) refers to data stripe blocks, parity blocks and local redundancy blocks respectively.
The company set up the systems so that in each tray, only one hard drive could be running at any given time. With fewer disks running, the system can get by on less power and stay cool with fewer fans, while the task set out for the cold storage -- replenishing data to the hot storage that feeds users directly -- still runs fast enough.
Desenvolvimento: tray of hard drives
May 4, 2015
Two billion photos are shared daily on Facebook services. Many of these photos are important memories for the people on Facebook and it's our challenge to ensure we can preserve those memories as long as people want us to in a way that's as sustainable and efficient as possible. As the number of photos continued to grow each month, we saw an opportunity to achieve significant efficiencies in how we store and serve this content and decided to run with it. The goal was to make sure your #tbt photos from years past were just as accessible as the latest popular cat meme but took up less storage space and used less power. The older, and thus less popular, photos could be stored with a lower replication factor but only if we were able to keep an additional, highly durable copy somewhere else.
Instead of trying to utilize an existing solution — like massive tape libraries — to fit our use case, we challenged ourselves to revisit the entire stack top to bottom. We're lucky at Facebook — we're empowered to rethink existing systems and create new solutions to technological problems. With the freedom to build an end-to-end system entirely optimized for us, we decided to reimagine the conventional data center building itself, as well as the hardware and software within it. The result was a new storage-based data center built literally from the ground up, with servers that power on as needed, managed by intelligent software that constantly verifies and rebalances data to optimize durability. Two of these cold storage facilities have opened within the past year, as part of our data centers in Prineville, Oregon, and Forest City, North Carolina.
The full-stack approach to efficiency
Reducing operating power was a goal from the beginning. So, we built a new facility that used a relatively low amount of power but had lots of floor space. The data centers are equipped with less than one-sixth of the power available to our traditional data centers, and, when fully loaded, can support up to one exabyte (1,000 PB) per data hall. Since storage density generally increases as technology advances, this was the baseline from which we started. In other words, there's plenty of room to grow.
Since these facilities would not be serving live production data, we also removed all redundant electrical systems — including all uninterruptible power supplies (DCUPS) and power generators, increasing the efficiency even further.
Cold storage racks
To get the most data into this footprint, we needed high density, but we also needed to remain media-agnostic and low-frills. Using the theme that “less is more,” we started with the Open Vault OCP specification and made changes from there. The biggest change was allowing only one drive per tray to be powered on at a time. In fact, to ensure that a software bug doesn’t power all drives on by mistake and blow fuses in a data center, we updated the firmware in the drive controller to enforce this constraint. The machines could power up with no drives receiving any power whatsoever, leaving our software to then control their duty cycle.
Not having to power all those drives at the same time gave us even more room to increase efficiency. We reduced the number of fans per storage node from six to four, reduced the number of power shelves from three to one, and even reduced the number of power supplies in the shelf from seven to five. These server changes then meant that we could reduce the number of Open Rack bus bars from three to one.
After modifying Open Rack to support these power-related hardware tweaks, we were able to build racks with 2 PB storage (using 4 TB drives) and operate them at one-quarter the power usage of conventional storage servers.
We knew that building a completely new system from top to bottom would bring challenges. But some of them were extremely nontechnical and simply a side effect of our scale. For example, one of our test production runs hit a complete standstill when we realized that the data center personnel simply could not move the racks. Since these racks were a modification of the OpenVault system, we used the same rack castors that allowed us to easily roll the racks into place. But the inclusion of 480 4 TB drives drove the weight to over 1,100 kg, effectively crushing the rubber wheels. This was the first time we'd ever deployed a rack that heavy, and it wasn't something we'd originally baked into the development plan!
Making it work
With our efficiency measures in place, we turned our attention to designing the software to be flexible enough to support our cold storage needs. For example, the software needed to handle even the smallest power disruptions at any time, without the help of redundant generators or battery backups, while still retaining the integrity and durability of the data on disk.
We approached the design process with a couple of principles in mind. First, durability was a main concern, so we built it into every system layer. We wanted to eliminate single points of failure and provide an ability to recover the entire system with as few pieces as possible. In fact, the separate system that managed the metadata would be considered a “nice to have” service for data recovery in case of catastrophic failure. In other words, let the data self-describe itself enough to be able to assist in recovery. After all, cold storage was intended to be the last point of recovery in case of data loss in other systems.
Second, the hardware constraints required careful command batching and trading latency for efficiency. We were working with the assumption that power could go away at any time, and access to the physical disks themselves would be limited based on the on/off duty cycle derived from the drive’s mean time before failure.
This last part was especially relevant since we were using low-end commodity storage that was by no means enterprise-quality.
We also needed to build for the future. Too often, we’ve all seen systems begin to get bogged down as they grow and become more utilized. So, right from the beginning, we vowed to make a system that not only didn’t degrade with size but also would get better as it grew.
With the major system hardware and software decisions made, we still had several big unknowns to tackle: most notably, actual disk failure rates and reliability, and data center power stability, since cold storage would not include any battery backup. The easiest way to protect against hardware failure is to keep multiple copies of the data in different hardware failure domains. While that does work, it is fairly resource-intensive, and we knew we could do better. We wondered, “Can we store fewer than two copies of the same data and still protect against loss?”
Fortunately, with a technique called erasure coding, we can. Reed-Solomon error correction codes are a popular and highly effective method of breaking up data into small pieces and being able to easily detect and correct errors. As an example, if we take a 1 GB file and break it up into 10 chunks of 100 MB each, through Reed-Solomon coding, we can generate an additional set of blocks, say four, that function similar to parity bits. As a result, you can reconstruct the original file using any 10 of those final 14 blocks. So, as long as you store those 14 chunks on different failure domains, you have a statistically high chance of recovering your original data if one of those domains fails.
Picking the right number of initial and parity blocks took some investigation and modeling based on the specific drive’s failure characteristics. However, while we were confident our initial Reed-Solomon parameter selections matched current drive reliability, we knew this was a setting that had to be flexible. So we created a re-encoding service that could reoptimize the data to support the next generation of cheap storage media, whether it was more reliable or less so.
As a result, we can now store the backups for 1 GB of data in 1.4 GB of space — in previous models, those backups would take up larger amounts of data, housed across multiple disks. This process creates data efficiency while greatly increasing the durability of what's being stored. However, this is not enough. Knowing that there is always a high chance that data will get corrupted, we create, maintain, and recheck checksums constantly to validate integrity. We keep one copy of the checksums next to the data itself, so we can quickly verify the data and replicate it as fast as possible somewhere else if an error is detected.
We've learned a lot running large storage systems over the years. So we were particularly concerned about what’s affectionately known as “bit rot,” where data becomes corrupted while completely idle and untouched. To tackle this, we built a background “anti-entropy” process that detects data aberrations by periodically scanning all data on all the drives and reporting any detected corruptions. Given the inexpensive drives we would be using, we calculated that we should complete a full scan of all drives every 30 days or so to ensure we would be able to re-create any lost data successfully.
Once an error was found and reported, another process would take over to read enough data to reconstruct the missing pieces and write them to new drives elsewhere. This separates the detection and root-cause analysis of the failure from reconstructing and protecting the data at hand. As a result of doing repairs in this distributed fashion, we were able to reduce reconstruction from hours to minutes.
By separating the storage of low-traffic content from that of high-traffic content, we’ve been able to save energy and other resources while still serving data when requested. Our two cold storage data centers are currently protecting hundreds of petabytes of data, an amount that's increasing every day. This is an incredible start, but as we like to say at Facebook, we’re only 1 percent finished.
We still want to dig into the wide range of storage mediums, such as low-endurance flash and Blu-ray discs, while also looking at new ways to spread files across multiple data centers to further improve durability.
One recent change that came to light only after we launched production was the fact that most modern file systems were not built to handle many frequent mounts and unmounts in fairly short times. We quickly began to see errors that were hard to debug since they were far below our software layer. As a result, we migrated all drives to a “raw disk” configuration that didn't use a file system at all. Now, we had deep visibility into the full data flow and storage throughout the system, which allowed us to feel even more confident about our durability guarantees.
A large number of people in multiple disciplines throughout the entire Infrastructure team worked on the cold storage project, and thus contributed to this blog post in their own unique way. Some joined the project as part of their regular jobs, while others did it as a dedicated initiative. Together, their joint contributions demonstrated a true entrepreneurial spirit spread across four offices in two countries and two data centers.
Special thanks to David Du, Katie Krilanovich, Mohamed Fawzy, and Pradeep Mani for their input on this blog post.