11-03-2016, 14:14 #1
[EN] Facebook: the rack is the server
Timothy Prickett Morgan
March 10, 2016
Hyperscalers have hundreds of millions to more than a billion users, which requires infrastructure on a vast scale. Those hyperscalers that are not running applications on behalf of others on public cloud slices have it easier than their cloudy peers because they have a relatively small number of applications to support and therefore they can keep the type of machines in their datacenters to a minimum and therefore to keep the per unit and operational costs low.
You might be surprised to know how few different servers it takes to run social media juggernaut Facebook, the driving force behind the Open Compute Project that was founded five years ago to shake up server, storage, and datacenter design and its supply chain. And it is perhaps even more surprising to see that the number of different types of systems deployed by Facebook is going down, not up.
Facebook, like other hyperscalers, deploys its infrastructure at the rack level and software is designed to make the fullest use possible of the compute, storage, and networking shared across the rack to run specific workloads. At the Open Compute Summit this week in San Jose, Facebook representatives and manufacturers of Open Compute gear were showing off some new hardware toys, which we will cover separately. Facebook, interestingly, gave a peek inside the current configurations of its racks, which are used to support web, database, and various storage workloads.
At the moment, Facebook has six different configurations that it rolls into its datacenters, which are characterized by its workloads and which are based on two different servers – code-named “Leopard” and “Yosemite” – and its “Wedge” top of rack switch. Here is the lineup:
The servers and storage servers in the racks are configured with different amounts of main memory, disk, and flash and depending on the workload have adjacent “Knox” Open Vault disk arrays for additional capacity. Rather than changing the form factors of the servers to accommodate more or less storage, Facebook keeps the nodes the same form factor for its Open Rack sleds and adjusts the storage using local bays and Open Vault bays. In a very real sense, the rack is the server for Facebook and its peers.
The Type I rack at Facebook is used to run its web services front end, which hosts its HipHop virtual machine and PHP stack. At the moment, this web front end has 30 servers per rack, and it can be based on either Leopard two-socket server nodes based on “Haswell” Xeon E5 v3 processors or on Yosemite quad-node sleds based on single-socket Xeon D processors. With the custom 16-core Xeon D chip that Intel created in conjunction with Facebook, each sled can have 64 cores. With Xeon E5 processors, the highest Facebook could drive that is 36 cores, and that would be using top-bin Xeon E5 parts that cost three times as much as middle SKU chips, and even more compared to the Xeon Ds. Facebook configures 32 GB per server node on the web service racks, and the Yosemite sleds have a better memory to core ratio and probably cost a lot less, too. Which is why we think Facebook is probably not adding a lot of Leopard machines for the web tier right now. Yosemite, which we detailed here, was designed explicitly for this kind of work to drive up density and drive up cost. (We will be looking at the performance of Yosemite and Facebook’s new “Lightning” all-flash storage arrays separately.) The web tier nodes have a 500 GB disk drive each, and as you can see, it only takes one Wedge switch to link the nodes together and two power shelves to feed all the gear. As you can also see, the Type I rack has some empty bays for further expansion should a slightly different workload come along that needs more compute or storage or power.
The Type II rack, whatever it was, has been retired.
The next size up is the Type III rack which is aimed at supporting the MySQL databases that underpin the Facebook PHP application stack. This rack is based on the Leopard two-socket sleds, which have 256 GB of main memory each (which Facebook characterizes as a high amount but we don’t think that until you are pushing 768 GB). The server nodes each have a 128 GB microSATA drive plus two high I/O flash drives that come in at 3.2 TB each. (Facebook did not say which drive it uses.) There are two power shelves and one Wedge switch to glue it all together and again plenty of room to expand the rack with more compute and storage if necessary for workloads other than MySQL.
The Type IV rack at Facebook is configured for Hadoop data warehousing storage and analytics, and two years ago Facebook had over 25,000 nodes dedicated to Hadoop and this has probably doubled at least since that time. The Hadoop racks have 18 Leopard servers, each configured with a RAID disk controller and linking up to nine Knox Open Vault storage shelves. Each Knox storage shelf has two bays of disk drives, each supporting 15 3.5-inch SATA drives. While the industry has moved on to 6 TB and 8 TB drives, Facebook is still using 4 TB drives, for a total of 120 TB per Knox array. Facebook partitions the nine Knox storage array into two slices, with 15 drives allocated to each of the 18 Leopard nodes. Each Leopard node has 128 GB of main memory (which Facebook calls a medium configuration) and has 60 TB of disk across those 15 spindles. We do not know what processors Facebook is using for Hadoop, but it seems likely that it is keeping close to parity between processor core count per node and the number of spindles attached to it. As you can see from the diagram, there is not a lot of empty space in this rack.
The Type V rack is used for Facebook’s “Haystack” object storage, which is used to house exabytes – yes, exabytes – of photos and videos, and it dials up the number of Knox arrays and down the number of Leopard servers within the Open Rack. The Leopard servers have a low 32 GB of memory per node, but each of the dozen nodes in the rack is allocated with an entire Knox Open Vault array, yielding 30 4 TB drives per node for 120 TB total capacity. We are amazed that Facebook has not moved to fatter disk drives for Haystack, but when you buy in bulk, you can probably stay off the top-end parts. This time last year, Jason Taylor, vice president of infrastructure foundation at Facebook, told The Next Platform that users were uploading 40 PB of photos per day at Facebook, and with its new push into streaming video, the rate of capacity expansion here must be enormous.
The Type VI rack is for heavy cache applications like the Facebook News Feed, ad serving, and search. There are no Knox disk arrays in this setup, but each of the 30 Leopard servers has 256 GB of main memory (a high configuration) and a 2 TB disk drive (a midsized one in Facebook’s categorization). We don’t know this, but the rack is a high compute/high memory setup designed to accelerate fast access to data, and it probably has a high core count per CPU. You night wonder that the News Feed is backed by disk instead of flash, but as Bobby Johnson, the creator of Facebook’s Haystack object storage, explained in a contributed article recently, the size of the News Feed data quickly outgrew the size of the flash and they had to move to disk. Obviously, with flash drives now pushing 10 TB, size is not the issue, but cost still is.
The final Facebook rack is actually not a single rack, but one of Facebook’s triplet Open Rack setups crammed with a total of six Leopard servers and 48 Knox storage arrays for implementing Facebook’s cold storage. Here’s what it looks like:
Each Leopard server has 128 GB of memory and has 240 drives in total attached to it for a total of 960 TB of capacity. The way the power management and erasure coding works on Facebook’s cold storage, data is spread across one drive per storage shelf per rack per server, and at any given time, only sixteen drives can be fired up and accessing data. The other drives in the rack are spun down and sitting quietly, waiting for an access. This allows Facebook to do 1 exabyte of storage in a power envelope of about 1.5 megawatts.
Última edição por 5ms; 11-03-2016 às 14:21.
11-03-2016, 16:18 #2
Facebook Data Centers: Huge Scale at Low Power DensityYevgeniy Sverdlik
March 11, 2016
When Facebook was launching the first data center it designed and built on its own, the first of now several Facebook data centers in rural Oregon, Jason Taylor, the company’s VP of infrastructure, expected at least a little fallout from the new power distribution design that was deployed there.
Instead of a centralized UPS plant in a separate room behind the doors of the main data hall, Facebook had battery cabinets sitting side by side with IT racks, ready to push 48V DC power to the servers at moment’s notice. What made him nervous was that the servers needed 12V AC power, and mechanism that switched between two different combinations of voltage and current had to work like a Swiss watch if you didn’t want to fry some gear.
“I would have expected at least some fallout,” Taylor said in an interview. But, the system was tested in Prineville many times in the first couple of years in Prineville – both intentionally and unintentionally – and, eventually, he learned to stop worrying and [insert the rest of the cliché].
Facebook later open sourced some of the innovations in data center design that were introduced in Prineville through the Open Compute Project, an initiative started by the social networking giant to bring some open source software ethos to IT hardware, power, and cooling infrastructure.
Better Efficiency Through Disaggregation
One of the interesting aspects about Facebook data center designs is that the company has been able to scale tremendously without increasing power density. Many data center industry experts predicted several years ago that the overall amount of power per rack is going to grow in data centers – a forecast that for the most part has not materialized.
“Rather than targeting 20kW per rack or 15kW per rack, we actually targeted about 5.5kW per rack,” Taylor said. “We understood that the low power density on racks was just fine.”
One big reason Facebook has been able to keep its data centers low-density is that its infrastructure and software teams have been willing to completely rethink their methods on a regular basis. This, coupled with advances in processors and networking technology, has resulted in new levels of efficiency that enabled Facebook to do more with less.
One of the most powerful concepts that resulted from this kind of rethinking is disaggregation, or looking at an individual component of a switch or a server as the basic infrastructure building block – be it CPU, memory, disk, or a NIC – not the entire box.
Disaggregation in Action
An example that demonstrates just how powerful disaggregation can be is the way the backend infrastructure that populates a Facebook user’s news feed is set up. Until sometime last year, Multifeed, the name of the news feed backend, consisted of uniform servers, each with the same amount of memory and CPU capacity.
The query engine that pulls data for the news feed, called Aggregator, uses a lot of CPU power. The storage layer it pulls data from keeps it in memory, so it can be delivered faster. This layer is called Leaf, and it taxes memory quite heavily.
The previous version of a Multifeed rack contained 20 servers, each running both Aggregator and Leaf. To keep up with user growth, Facebook engineers continued adding servers and eventually realized that while CPUs on those servers were being heavily utilized, a lot of the memory capacity was sitting idle.
To fix the inefficiency, they redesigned the way Multifeed works – the way the backend infrastructure was set up and the way the software used it. They designed separate servers for Aggregator and Leaf functions, the former with lots of compute, and the latter with lots of memory.
This resulted in a 40 percent efficiency improvement in the way Multifeed used CPU and memory resources. The infrastructure went from a CPU-to-RAM ratio of 20:20 to 20:5 or 20:4 – a 70 to 80-percent reduction in the amount of memory that needs to be deployed.
Network – the Great Enabler
According to Taylor, this Leaf-Aggregator model, which is now also used for search and many other services, couldn’t have been possible without the huge increases in network bandwidth Facebook has been able to enjoy.
“A lot of the most interesting stuff that’s happening in software at large scale is really being driven by the network,” he said. “We’re able to make these large long-term software bets on the network.”
Today, servers and switches in Facebook data centers are interconnected with 40-Gig links – up from 1 Gig links from the top-of-rack switch to the server just six years ago. New Facebook data centers being built today will use 100-Gig connectivity, thanks to the latest Wedge 100 switch the company designed and announced earlier this year.
“As of January of next year, everything will be 100 Gig,” Taylor said.
With that amount of bandwidth, having memory next to CPU is becoming less and less important. You can split the components and optimize for each individual one, without compromises.
“Locality is starting to become a thing of the past,” Taylor said. “The trend in networking over the last six years is too big to ignore.”
Disaggregation Keeps Density Down
Disaggregation has also helped keep overall power density in Facebook data centers at bay.
Some compute-heavy racks, such as the ones populated with web servers, can be between 10kW and 12kW per rack. Others, such as the ones packed with storage servers, can be about 4.5kW per rack.
As long as the overall facility averages out to about 5.5kW per rack, it works, Taylor said.
One of the disaggregation extremes Facebook has gone to recently is designing storage servers specifically for rarely accessed user content, such as old photos, and designing separate facilities next to its primary data centers optimized just for those servers.
The “cold storage racks are unbelievably cold,” Taylor said, referring to the amount of power they consume. They are at 1 to 1.5kW per rack, he said.
As a result, it now takes 75 percent less energy to store and serve photos people dig out of their archives to post on a Thursday with a #tbt tag than it did when those photos were stored in the primary data centers.
As it looks for greater and greater efficiency, Facebook continues to re-examine and refine the way it designs software and the infrastructure that software runs on.
The concept of disaggregation has played a huge role in helping the company scale its infrastructure, increase its capacity without increasing the amount of power it requires, but disaggregation at that scale could not have been possible without rapid progress in data center networking technology over the recent years.