The reason that the SSD vs. HDD debate is so critical for the enterprise is the sheer weight of today’s data. Part of the challenge is that this hugely growing data threatens threatening traditional computing infrastructure based on HDD, or hard disk drive storage.
The problem isn’t only the growth. If that’s all there was to it, then data center administrators would simply slap more spindles, install a tape library, and send secondary data to the cloud where it becomes the provider’s problem. But the problem isn’t just growth; it is also the speed at which applications operate. Processor and networking speeds have kept up with application velocity and growth but production storage has not.
Granted that computing bottlenecks may exist other than in the HDD. Switches fail, bandwidth overloads, VM hosts go down: nothing in the computing path is 100% reliable. But disk drives are the major slowdown culprit in high IO environments. The nature of the mechanical device is the offending party.
Very fast SSD performance is the increasingly popular fix for the problem. However, SSDs are not the automatic choice over HDDs. First, one-to-one, SSD pricing is a good deal more expensive than HDDs. There are certainly factors that narrow the purchasing gap between SSDs and HDDs, and in practice the cost for SSDs can be less. (For a detailed look at HDD and SDD cost comparisons, see Henry Newman’s article SSD vs. HDD Pricing: Seven Myths That Need Correcting) A second factor is what to replace: SSD performance will be faster than disk, but this does not necessarily mean that IT needs this performance level for secondary disk tiers.
A third factor that mitigates against universal replacement is reliability: are SSDs reliable enough to replace HDDS in the data center? In fact, that is a tricky question. SSD/HDD reliability depends on many factors: usage, physical environment, application IO, vendor, mean time before failure (MTBF), and more. This is big discussion topic, so to keep this performance/reliability discussion to a useful focus, let’s set some base assumptions:
- We’ll discuss SSDs in data centers, not in consumer products like desktops or laptops. SSDs have a big place there especially for devices carried into hostile environments. However, the enterprise has a distinct set of requirements for storage based on big application and data growth, and the to-use-or-not-to-use question is critical in these data centers.
- We’ll limit our discussion to NAND flash memory-based SSDs with the occasional foray into DRAM. This limits the universe of flash technology as the discussion point: DRAM is not a flash technology at all. And in the case of NAND SSDs, remember that while NAND is always flash, flash is not always NAND.
- We’re not covering other storage flash technologies, which lets out all-flash arrays with ultra-performance flash module components, or server-side flash-based acceleration. These are big stories in and of themselves but do not represent the majority of the SSD market today, particularly in mid-sized business and SMB.
Performance: SSD Wins
Hands down, SSD performance is faster. HDDs have the inescapable overhead of physically scanning disk for reads/writes. Even the fastest 15 RPM HDDs may bottleneck a high-traffic environment. Parallel disk, caching, and lots of extra RAM will certainly help. But eventually the high rate of growth will pull well ahead of the finite ability of HDDs to go faster.
DRAM-based SSD is the faster of the two but NAND is faster than hard drives by a range of 80-87% -- a very narrow range between low-end consumer SSDs and high-end enterprise SSDs. The root of the faster performance lies in how quickly SSDs and HDDs can access and move data: SSDs have no physical tracks or sectors and thus no physical seek limits. The SSD can reach memory addresses much faster than the HDD can move its heads.
The distinction is unavoidable given the nature of IO. In a hard disk array, the storage operating system directs the IO read or write requests to physical disk locations. In response, the platter spins and disk drive heads seek the location to write or read the IO request. Non-contiguous writes multiply the problem and latency is the result.
In contrast, SSDs are the fix to HDDs in high IO environments, particularly in Tier 0, high IO Tier 1 databases, and caching technologies. Since SSDs have no mechanical movement they accelerate IO requests far faster than even the fastest HDD.
Reliability: HDD Scores Points
Performance may be a slam dunk but reliability is not. Granted that SSD’s physical reliability in hostile environments is clearly better than HDDs given their lack of mechanical parts. SSDs will survive extreme cold and heat, drops, and multiple G’s. HDDs… not so much.
However, few data centers will experience rocket liftoffs or sub-freezing temperatures, and SSDs have their own unique stress points and failures. Solid state architecture avoids the same type of hardware failures as the hard drive: there are no heads to misalign or spindles to wear out. But SSDs still have physical components that fail such as transistors and capacitors. Firmware fails too, and wayward electrons can cause real problems. And in the case of a DRAM SSD, the capacitors will quickly fail in a power loss. Unless IT has taken steps to protect stored data, that data is gone.
Wear and tear over time also enters the picture. As an SSD ages its performance slows. The processor must read, modify, erase and write increasing amounts of data. Eventually memory cells wear out. Cheaper consumer TLC is generally relegated to consumer devices and may wear out more quickly because it stores more data on a reduced area. (Thus goes the theory; studies do not always bear it out.)
For example, since the MLC stores multiple bits (electronic charges) per cell instead of SLC’s one bit, you would expect MLC SSDs to have a higher failure rate. (MLC NAND is usually two bits per cell but Samsung has introduced a three-bit MLC.) However, as yet there is no clear result that one-bit-per-cell SLC is more reliable than MLC. Part of the reason may be that newer and denser SSDS, often termed enterprise MLC (eMLC), has more mature controllers and better error checking processes.
So are SSDS more or less reliable than HDDs? It’s hard to say with certainty since HDD and SSD manufacturers may overstate reliability. (There’s a newsflash.) Take HDD vendors and reported disk failure rates. Understandably, HDD vendors are sensitive to disk failure numbers. When they share failure rates at all, they report the lowest possible numbers as the AFR, annualized (verifiable) failure rates. This number is based on the vendor’s verification of failures: i.e., attributable to the disk itself. Not environmental factors, not application interface problems, not controller errors: only the disk drive. Fair enough in a limited sort of way, although IT is only going to care that their drive isn’t working; verified or not. General AFR rates for disk-only failures run between .55% and .90%.
However, what the HDD manufacturers do not report is the number of under-warranty disk replacements each year, or ARR – annualized rates of return. If you substitute these numbers for reported drive failures, you get a different story. We don’t need to know why these warrantied drives failed, only that they did. These rates range much, much higher from about 0.5% to as high as 13.5%.
Now, in practice those higher percentages are not earth shattering. Most modern storage has redundant technology that minimizes data damage from a failed disk and allows hot replacements. But when you are talking about drive reliability, clearly that number is worth talking about.
Again, small blame to the HDD vendors for putting their best foot forward. No one really expects them to publish reams of data on how often their products fail… especially since the SSD vendors do the same thing. And on the whole, HDDs tend to fail more gracefully in that there may be more warning than a suddenly failing SSD. This does not negate the huge performance advantages of SSD but does give one pause.
SSD’s Reliability Failures
Some SSD failures are common to any storage environment, but they do tend to have different causes than HDD failures. Common points of failure include:
- Bit errors: Random data bits stored to cells, although it sounds much more impressive to say that the electrons leaked.
- Flying or shorn writes: Correct writes written in the wrong location, or truncated writes due to power loss.
- Unserializability: A hard-to-pronounce term that means writes are recorded in the wrong order.
- Firmware: Ah, firmware. Firmware fails, corrupts, or upgrades improperly throughout the computing universe: SSD firmware is no exception.
- Electronic failures: In spite of no moving parts, physical components like chips and transistors fail, taking the SSD down right along with it.
- Power outages: DRAM SSDs have volatile memory and will lose data if they lack a battery power supply. NAND SSDs are also subject to damaged file integrity if they are reading/writing during power interruptions.
As SSDs mature, manufacturers are improving their reliability processes. Wear leveling is a controller-run process that tracks data movement and component wear across cells, and levels writes and erases across multiple cells to extend the life of the media. Wear leveling maps logical block addresses (LBA) to physical memory addresses. It then either rewrites data to a new block each time (dynamic), or reassigns low usage segments to active writes (static) in order to avoid consistent wear to the same segment of memory. Note that writes are not the only issue: so is deletion. HDDs can write and read from the same sector, and in case of modified data can simply overwrite the sector. SSDs don’t have it this easy: they cannot overwrite but must erase blocks and write to new ones.
Data integrity checks are also crucial for data health. Error correction code (ECC) checks data reads and corrects hardware-based errors to a point. Cyclic Redundancy Check (CRC) checks written data to be sure that it is returned intact to a read request. Address translation guards against location-based errors by verifying that a read is occurring from the correct logical address, while versioning retrieves the current version of data.
Garbage collection helps to reclaim sparsely used blocks. NAND SSD only writes to empty blocks, which will quickly fill up an SSD. The firmware can analyze the cells for partially filled blocks, merge data into new blocks, and erase the old ones to free them up for new writes.
Data redundancy is also a factor. External redundancy of course occurs outside of the SSD with backup, mirroring, replication, and so on. Internal redundancy measures include internal batteries in DRAM SSDs, and striped data parity in NAND flash memory.
So Which Wins, SDD or HDD?
SSDs are clearly faster in performance, and if an HDD vendor argues otherwise then consider the source. However, reliability is an ongoing issue outside of hostile environments. We find that SSD reliability is improving and is commensurate with, or moving slightly ahead of, HDDs. SSD warranties have stretched from 3 to 5 years with highly reliable Intel leading the way. Intel and other top NAND SSD manufacturers like Samsung (at present, the world’s largest NAND developer), Kingston and OCZ are concentrating on SSD reliability by improving controllers, firmware, and troubleshooting processes.
The final score between NAND/DRAM SSDs and HDDs? Costs are growing commensurate. Reliability is about the same. Performance is clearly faster, and should rule the final decision between SSD and HDD. Hard drives will replace them for a long time yet in secondary storage, but I believe that they have lost their edge in high IO computing. For that, look to SSDs.