13-09-2016, 14:18 #1
[EN] How DreamHost Builds Its (OpenStack) Cloud
Caminho das pedras para quem está pensando em datacenter-em-casa ...
This series of blog posts gives you a look from the datacenter, shows you the exact hardware your VM is running on and explains how we came to those choices. We are going to describe how we picked each major component of DreamHost’s OpenStack-based cloud architecture, describe the different choices available, and explain why we made each decision.
The beta period taught us that the requirements of a large object storage cluster and a cloud computing storage cluster were quite different. ... Our DreamCompute beta cluster had almost 10 times more storage than we needed, but the IO performance on it was subpar because the cluster had a low end processor, RAM, and RAID card.
The racks we use in our data centers are 9 feet (58 rack units) tall. These racks are already installed so we had to use them. The racks have two 60-amp, 3-phase, 208-volt power strips installed. To give you an idea how much power that is, we can redundantly power about 250 60W bulbs (or 3,000 LED bulbs) per rack if we really want a bright data center.
03-10-2016, 14:52 #2
Selecting Hard Drives
October 1, 2016
To begin picking what drives we wanted to use in our new DreamCompute cluster, we first looked at the IO usage and feedback from the beta cluster. The amount of data moving around our beta cluster was very small. Despite the fact that we had over-provisioned, and had very little disk activity in our beta cluster, we still received a lot of feedback requesting faster IO. There were three main reasons that the throughput and latency in the beta cluster was slow.
The first reason was our choice of processor and RAM amount. We used machines of the same specs we used for DreamObjects. This worked well there, where there is a lot of storage but it is not accessed very often. However, in DreamCompute we store much less data, but the data is comparatively much more active. The second reason was density. Ceph functions better when you have fewer drives per machines. Recovery is also faster when you have smaller drives.
With the original DreamCompute cluster, we had machines that contained 12 Ceph drives. Each drive was 3TB in size, for a total of 36TB of storage per node. This was too dense for our needs. The third problem we had were with the SAS expanders we used. We used RAID cards with “4 channel”, meaning they can access four drives at a time. However, we had 14 drives attached (12 Ceph, two boot drives). This required us to use a “SAS expander”, a device that sits between the RAID card and drives, acting as a traffic light. Unfortunately, the SAS expander we were using could only handle enough traffic to fully utilize two lanes. Imagine 14 cars traveling on one two-lane road. As long as the cars are parked most of the time, it’s fine. If they all want to drive at the same time, traffic gets slow. For the new cluster, we wanted to remove the latency of an SAS expander and use the maximum amount of drives than either the RAID card or motherboard supported. We also made sure that we had an ample amount of RAM and fast processors to avoid other hardware-related latency.
There two broad categories of drives for both the server and consumer markets. There are traditional hard drives (HDD) which magnetically store data on spinning platters, and solid state drives (SSD) which store data in integrated memory circuits. SSDs are much faster and have a lower latency than HDDs because they don’t have to wait to get the data from a spinning disk. However, SSDs are currently substantially more expensive. From the feedback we had gotten from our customers, fast IO and low latency were very important so we decided to go with SSDs!
There are three choices on how the SSD will interface with the server:
- The first is SATA. Most consumer desktops and laptops interface with their attached storage devices using the SATA interface. This is an improvement on the old PATA interface, which was also commonly a consumer interface. Historically, SATA isn’t used very often in data center environments as it doesn’t play very well with RAID cards. However, we won’t be putting these drives in a RAID array. SATA also has a max throughput of 6 Gbit/s which is slower than our other options. Serial Attached SCSI (SAS) was developed from the more enterprise focused SCSI (pronounced like “scuzzy”). SAS has many advantages, but the main benefit for us is that the speed is limited to 12 Gbit/s, double that of SATA. The major disadvantage of SAS is that it isn’t directly supported by any server chipsets. This means that you have to place it behind a RAID card, even if you don’t need any RAID functionality. SAS drives are also much more expensive because the SAS controllers they use are very expensive. NVMe is an interface designed specifically for SSDs. NVMe drives are more than twice as fast as SATA or SAS SSDs, but they are just beginning to hit the market so they are much more expensive than SATA or SAS. All factors considered, we felt that SATA was the best choice. The drives were still four times faster than the drives in our original cluster and each drive was only 1TB instead of 3TB. We could now measure recovery time after drives were added or failed in hours instead of days.
- Another factor to consider with SSDs is how much information each memory cell will hold. The three options are one bit per cell (SLC), two bits per cell (MLC), or three bits per cell (TLC). The more information each cell is holding, the fewer times it can be rewritten in the lifetime of the cell, so durability decreases as data per cell increases. Because of the large price difference between MLC and SLC, manufacturers have also released a couple intermediary options like eMLC (MLC with a greater endurance and higher over-provisioning) and pSLC (pseudo SLC which uses MLC but only writes one bit per cell). Based on testing in the original DreamCompute cluster, we wouldn’t even be near the TLC limitations for years. However, we purchased MLC-based drives just in case usage was higher than expected. Unfortunately we had firmware issues with the MLC drives. We got those worked out, but as a precaution, we replaced half of the drives in our cluster with a different manufacturer’s TLC-based drives so we now have a 50/50 mixture of MLC and TLC enterprise drives.
Enterprise SSDs, especially SATA ones, are very similar to the SSDs used in consumer desktops and laptops with a few critical differences. The biggest difference is the use of an integrated supercapacitor. On enterprise drives, should power fail in the middle of a write, the super capacitor will keep a drive powered long enough to finish the write it was working on and prevent data corruption. Another difference is the amount of over-provisioning. All SSDs display less storage than they actually have installed. The reason for this is memory cells will sometimes need to be replaced. When this happens, the SSD will start using one of its spare cells. As enterprise SSDs are usually under more stress, more memory cells are allocated as spares. For example, a consumer drive with 512 GB of internal storage is usually sold as a 500GB drive with 12GB over-provisioning. The drives we buy are either 400 or 480 depending on the workload. The firmware for enterprise SSD drives is also tuned for an enterprise environment which is usually much more write heavy than consumer workloads. For these reasons, we only use enterprise-rated SSDs in DreamCompute.
- The final factor to consider was if we should put the SSDs behind a RAID card or directly attach them to the motherboard. A RAID array can be created via software, but to be able to add additional features, arrays are often implemented via a separate card. The redundancy that a RAID card provides isn’t needed with Ceph. Ceph provides its own redundancy. RAID cards can also provide protection against data corruption by having a battery or capacitor that provides power to the card so that it can hold incomplete writes in memory during unexpected reboots. This also isn’t needed as enterprise SSDs have capacitors built in. The final advantage of a RAID card is the onboard cache. The RAID card cache isn’t any faster than an SSDs onboard cache, but by setting each disk up as a single disk RAID array, we can use the drive’s capacitor protected cache as a write cache and the RAID card’s cache as a read ahead cache. Our testing showed a slight increase in speed using this setup.
The systems we plan on using have 10 SATA ports behind 2 controllers (four on CPU 1, six on CPU 2) on the motherboard. We could also use an 8-port RAID card for the eight Ceph drives and only have the two boot drives directly connected. In the end, we decided the expense and potential for problems weren’t worth the marginal speed increase and decided to just directly attach the drives to the motherboard.
So the final layout is eight 960GB Enterprise SSDs (half TLC, half MLC) directly attached to the motherboard with four Ceph drives and both boot drives going to one CPU, and four Ceph drives going to the second CPU.
Última edição por 5ms; 03-10-2016 às 14:57.
03-10-2016, 14:56 #3
September 26, 2016
In this post, we are going to be looking at what processor we are using in the new DreamCompute cluster, and how we picked it!
A processor is one of the most crucial components of a machine. The processor, also known as the CPU, is the metaphorical brain of the computer. It does all the “thinking” for the computer. A CPU can have one or more cores. A core is the part of the processor that does the actual computing. The first thing to consider is the instruction set. This is the language that the processor speaks. The popular instruction sets for use in servers are Sparc, ARM, and x86-64. Processors using the Sparc instruction set are made by Oracle. While Linux can run on them, the processors are designed to run Solaris, an operating system that is also made by Oracle. ARM is an instruction set that has been around for 30 years and had primarily been used in embedded devices. Cell phones, TVs, and tablets are some of the many places you will find ARM processors.
Recently, we’ve been testing some servers with ARM processors. The advantage of these servers is they use very little power and have many low-power cores, which is useful in environments where you have many processes running at the same time. The most common instruction set you will find in a data center is x86-64. This is the 64-bit version of the x86 instruction set that has been in use since 1978. Almost all consumer laptop, desktop, and enterprise servers are made using processors based on x86-64. Because of this, we decided to use a processor based on x86-64.
The server x86-64 processor manufacturers are AMD and Intel. AMD, which we used in our beta cluster, last released a major x86-64 server processor in 2012. Since that release, AMD has been working on Zen, a new architecture. Unfortunately, Zen has not yet been released. Because of this, the AMD processors currently for sale are much higher wattage and slower than others on the market.
In the time since the last major AMD update, Intel has released three new generations of chips, each faster and more power efficient. This makes Intel the best choice right now, so we decided to use them. Every year or so, Intel releases a new “generation” of processors. With each generation, Intel does one of two things. They either change the transistor size, making the processor smaller and more power efficient, or they keep the transistor size the same and focus on adding features.
Within the server line of processors, there are four generations currently being produced. Ivy Bridge, which was released in 2012, is a upgrade of the older Sandy Bridge processors with smaller transistors and is denoted with a v2 at the end of the processor name for most server processors. Haswell, released in 2013, is a refinement of Ivy Bridge and uses the same transistor size. For servers, it is denoted as v3. Broadwell, released in 2014, was a refinement of the Haswell processor with smaller transistors and is denoted with a v4. Finally, SkyLake, released last year, is a refinement of Broadwell at the same transistor size and is denoted with a v5.
Individual product lines within the server category are upgraded at different times. Even though a generation has been released, it may be years before a certain product line begins using that generation, and certain product lines may skip over entire generations. The five current product lines within the server category are Atom, Xeon D, Xeon E3, Xeon E5, and Xeon E7. The Atom line is an ultra-low power line. These aren’t really powerful enough to use in our hypervisors, and the low number of cores would significantly limit the size of virtual machine we could offer. The next line is Xeon D. These are higher wattage and faster than the Atom processors, but still not quite the power we wanted to be able to give our customers.
The next line is the E3. The E3 line is the only server based line that has been upgraded to SkyLake. It has plenty of power, but you are limited to a single processor per system and four cores per processor. The E3 line lacks the density that would make it usable. At the time we were designing the new DreamCompute cluster, the E5 was Haswell-based but we knew Broadwell was coming soon. As we can only use what is currently being produced, we only looked at the Haswell line. The E5 line is marketed to data centers. Within the E5 line, you have both single, dual, and quad processor options with many choices within each of those. The E5 line just might work! The E7 line is last line we looked at. The E7 line was, at the time, Ivy Bridge-based, though we knew Haswell was coming soon. The E7 line is focused on density. E7s have both quad and octo processor options with up to 18 cores per processor. They are primarily used in environments where you need a single computer to be able to do a lot of work. That made E7s a possible, but not ideal, fit for DreamCompute as we probably don’t want that much density.
Now that we knew what processor lines would work, we needed to consider two more factors. We needed to figure how many processors we wanted in each system and how many cores each processor should have. Originally, processors had a single core, but modern x86-64 processors can have up to 22 cores on a single processor. This is especially useful in a shared resource environment (like DreamCompute) where you don’t want the processes of one user to affect everyone else on that hypervisor. Intel also has a technology called “hyperthreading,” where it presents each core to the operating system twice to allow for more efficient use of each core. You can also put multiple processors in each server. You can have up to eight 18-core processors in a system for a total of 144 cores or 288 threads. Though, as we know from our previous cluster, density isn’t everything. We wanted to be able to balance power and density, and limit the single points of failure. Based on the RAM-to-core ratio we wanted and the maximum density we were willing to have, we decided to test dual and quad processor systems with eight to 14 cores in each system. We tested both the E5 dual and quad processor lines as well as the E7 quad processor line.