Category Archives: Solid State Disk

Moore’s Law May Be The Death of NAND Flash

"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so." - Mark Twain

I try and keep this quote in my mind whenever I’m teaching about new technologies. You often hear the same things parroted over and over again long after they quit being true. This problem is compounded by fast moving technologies like NAND Flash.

If you have read my previous posts about Flash memory you are already aware of NAND flash endurance and reliability. Just like CPU’s manufacturing processes flash receive boost in capacity as you decrease the size of the transistors/gates used on the device. In CPU’s you get increases in speed, on flash you get increases in size. The current generation of flash manufactured on a 32nm process. This nets four gigabytes per die. Die size isn’t the same as chip, or package size. Flash dies are actually stacked in the actual chip package giving us sixteen gigabytes per package. With the new die shrink to 25nm we double the size to eight gigabytes and thirty two gigabytes respectively. That sounds great, but there is a dark side to the ever shrinking die. As the size of the gate gets smaller it becomes more unreliable and has less endurance than the previous generation. MLC flash suffers the brunt of this but SLC isn’t completely immune.

Cycles And Errors

One of the things that always comes up when talking about flash is the fact it wears out over time. The numbers that always get bantered about are SLC is good for 100,000 writes to a single cell and MLC dies at 10,000 cycles. This is one of those things that just ain’t so any more. Right now the current MLC main stream flash based on the 32nm process write cycles are down to 5000 or so. 25nm cuts that even further to 3000 with higher error rates to boot.

Several manufactures has announced the transition to 25nm on their desktop drives. Intel and OCZ being two of the biggest. Intel is a partner with Micron. They are directly responsible for developing and manufacturing quite a bit of the NAND flash on the market. OCZ is a very large consumer of that product. So, what do you do to offset the issues with 25nm? Well, the same thing you did to offset that problem with 32nm, more spare area and more ECC. At 32nm it wasn’t unusual to see 24 bits of ECC per 512 bytes. Now, I’ve seen numbers as high as 55 bits per 512 bytes to give 25nm the same protection.

To give you an example here is OCZ’s lineup with raw and usable space listed.

Drive Model	Production Process	Raw Capacity (in GB)	Affected Capacity (in GB)
OCZSSD2‐2VTXE60G	25nm	64	55
OCZSSD2‐2VTX60G	32nm	64	60
OCZSSD2‐2VTXE120G	25nm	128	118
OCZSSD2‐2VTX120G	32nm	128	120

As you can clearly see the usable space is significantly decreased. There is a second problem specific to the OCZ drives as well. Since they are now using higher density modules they are only using half as many of them. Since most SSD’s get their performance from multiple read/write channels cutting that in half isn’t a good thing.

SLC is less susceptible to this issue but it is happening. At 32nm SLC was still in the 80,000 to 100,000 range for write cycles but the error rate was getting higher. At 25nm that trend continues and we are starting to see some of the same techniques used in MLC coming to SLC as ECC creeps up from 1 bit per 512 bytes to 8 bits or more per 512 bytes. Of course the down side to SLC is it is half the capacity of MLC. As die shrinks get smaller SLC may be the only viable option in the enterprise space.

It’s Non-Volatile… Mostly

Another side effect of shrinking the floating gate size is the loss of charge due to voltage bleed off over time. When I say “over time” I’m talking weeks or months and not years or decades anymore. The data on these smaller and smaller chips will have to be refreshed every few weeks. We aren’t seeing this severe an issue at the 25nm level but it will be coming unless they figure out a way to change the floating gate to prevent it.

Smaller Faster Cheaper

If you look at trends in memory and CPU you see that every generation the die gets smaller, capacity or speed increases and they become cheaper as you can fit double the chips on a single wafer. There are always technical issues to overcome with every technology. But NAND flash is the only one that gets so inherently so unreliable at smaller and smaller die sizes. So, does this mean the end of flash? In the short term I don’t think so. The fact is we will have to come up with new ways to reduce writes and add new kinds of protection and more advanced ECC. On the pricing front we are still in a position where demand is outstripping supply. That may change somewhat as 25nm manufacturing ramps up and more factories come online but as of today, I wouldn’t expect a huge drop in price for flash in the near future. If it was just a case of SSD’s consuming the supply of flash it would be a different matter. The fact is your cell phone, tablet and every other small portable device uses the exact same flash chips. Guess who is shipping more, SSDs or iPhones?

So, What Do I Do?

The easiest thing you can do is read the label. Check what manufacturing process the SSD is using. In some cases like OCZ that wasn’t a straight forward proposition. In most cases though the manufacturer prints raw and formatted capacities on the label. Check the life cycle/warranty of the drive. Is it rated for 50 gigabytes of writes or 5 terabytes of writes a day? Does it have a year warranty or 5 years? These are indicators of how long the manufacturer expects the drive to last. Check the error rate! Usually the error rate will be expressed in unrecoverable write or read errors per bit. Modern hard drives are in the 10^15 ~ 10^17 range. Some enterprise SSDs are in the 10^30 range. This tells me they are doing more ECC than the flash manufacturer “recommends” to keep your data as safe as possible.

#SQLRally is coming, Go vote!

We are in the final stages of selecting the speakers for the SQLRally May 11th through the 13th in sunny Orlando Florida. The program selection is a little different than what we have done with the Summit. The committee narrowed the number of selections and is putting the rest up to a public vote. This is your opportunity to voice your opinion on what you would like to hear at this inaugural event! I’ve been fortunate enough to have two of my sessions put up for a vote. If you follow my blog you know I have a passion for moving bits of data around as fast as possible. Both my sessions focus on storage. As much as I would love to have your votes to see my sessions at SQLRally, I would like it even more if you voted on what YOU want to learn about the most. Having served on the program committee for Summit last year I know just how hard it can be choosing what I think people would like to learn about. having the opportunity to make your choice known directly is just awesome. I am very excited to see PASS expand and have training events that cover the gambit. Starting with local user groups and SQL Saturdays now growing with SQLRally and finishing it off with the Summit, there is something for every budget.

With that said, here are my abstracts so you can get a better idea of what I’m speaking on. GO VOTE!

Title:
Solid State Storage Deep Dive
Speaker:
Wesley Brown
Category:
Storage
Level:
100

Abstract:
If you have ever wanted to know how SSD’s and Flash memory works this talk is for you. We will cover the fundamentals of Flash in detail. I will also highlight some of the specific vendor implementations and what makes a particular SSD enterprise-ready vs. consumer grade. We will also cover SQL Server usage patterns what is a good fit for SSD’s and when it may be better to go with hard disks. Solid State Storage isn’t a cure-all for every situation, this presentation will give you the tools you need to make the right choice for your SQL Server environment.

Session Goals

Understand the fundamental building block of Flash memory.
Get a clear explanation of what makes some SSD’s robust enough for enterprise use.
Learn where SSD will and won’t make a real difference in your SQL Server environment.

Title:
Understanding Storage Systems and SQL Server
Speaker:
Wesley Brown
Category:
Storage
Level:
100

Abstract:
The most important part of your SQL Server is also the slowest, Storage. This talk will take you through the fundamentals of your server’s Disk I/O System. From how hard drives work, through RAID configurations, and how to configure the file system. This session should give you a solid foundation over storage systems and help you understand why they are slow and how to overcome some of their limitations.

Session Goals

Understand the physical characteristics of IO hardware.
Understand the fundamentals of RAID.
Understand how to configure the file system.

Fusion-io, Flash NAND All You Can Eat

Fusion-io has announced general availability of the new Octal. This card is the largest single flash based device I’ve ever seen. The SLC version has 2.56 terabytes of raw storage and the MLC has a whopping 5.12 terabytes of raw storage. This thing is a behemoth. The throughput numbers are also impressive, both read at 6.2 Gigabytes a second using a 64KB block, you know the same size as an extent in SQL Server. They also put up impressive write numbers the SLC version doing 6 Gigabytes a second and the MCL clocks in at 4.4 Gigabytes a second.

There is a market for these drives but you really need to do your homework first. This is basically four ioDrive Duos or eight ioDrive’s using a single PCIe 2.0 16x slot. It requires a lot of power, more than the PCIe slot can provide. It needs an additional three power connectors two 6 pin and one 8 pin. EDIT: According to John C. You only need to use ether the two 6 pin OR the single 8 pin. These are pretty standard on ATX power supplies in your high end desk top machines but very rarely available in your HP, Dell or IBM server so check to see if you have any extra power leads in your box first.

Also, remember you have to have a certain amount of free memory for the ioDrive to work. They have done a lot of work in the latest driver to reduce the memory foot print but it can still be significant. I would highly recommend setting the drive up to use a 4K page instead of a 512 byte page. After that, you will still need a minimum of 284 megabytes of RAM per 80 gigabytes of storage. On the MLC Octal that comes to 18 gigabytes of RAM that you need to have available per card. To be honest with you, if you are slotting one of these bad boys into a server it won’t be a little dual processor pizza box. On the latest HP DL580G7’s you can have as much as 512 gigabytes of RAM so carving off 18 gigabytes of that isn’t such a huge deal.

Lastly, you will actually see several drives on your system each one will be a 640 gigabyte drive. If you want one monster drive you will have to stripe them at the OS level. The down side of that is loosing TRIM support which is a boon to the overall performance of the drive, but not a complete deal breaker. EDIT: John C is correct. You don’t loose TRIM for striping with the default Windows RAID stripe on Windows Server 2008 R2. I’m waiting for confirmation from Symantec if that is the case with Veritas Storage Foundation as well since that is what I am using to get software RAID 10 on my servers.

I don’t have pricing information on these yet, but I’m guessing its like a Ferrari, if you have to ask you probably can’t afford it.

SATA, SAS or Neither? SSD’s Get A Third Option

I recently wrote about solid state storage and its different form factor. Well, several major manufacturers have realized that solid state needs all the bandwidth it can get. Dell, IBM, EMC, Fujitsu and Intel have formed the SSD Form Factor Working Group bringing PCIe 3 to the same form factor that SATA and SAS use. Focusing on the same connector types and a 2.5” dive housing. I’m not sure how quickly it will make it’s way into the enterprise space but that is clearly it’s target. Reusing the physical form factor cuts down on manufacturing and R&D costs for all involved. They have an aggressive time scale for something like this. The specification hasn’t been published yet and I’ll take a deeper look into it when it becomes available. There are some key players missing though. HP and Seagate being the two in the enterprise space that give me pause. Both control a large segment of the storage space. On the controller side LSI is also absent. This could be a direct threat to their current market domination of the RAID controller chipset space if they aren’t on the ball.

Fusion-io got that early on and took a different route sticking with just PCIe to bypass the limitations of SAS/SATA and intermediate controllers. By going that route they opened up a whole other level of performance.

I asked David Flynn what he thought about the new standard. Fusion-io is a contributor to the working group.

It is quite validating that folks would be routing PCIe to the drive bays. For us it’s just another form factor that

we can easily support it.

Two things, though… First is that I believe it’s a hangover from the mechanical drive era to put such emphasis on form factors that allow easy servicing access. Solid state should not need to be serviced. It should be much more reliable than HDD’s. But, outside of Fusion-io failure rates for solid state is actually much worse than for mechanical disk drives.

The second point is that form-factor and even PCIe attachment isn’t really the key thing to higher performing, more reliable solid state. What makes the real difference is eliminating the embedded CPU bottleneck in the access path to the flash.

Fusion-io uses a memory controller approach to integrating flash. You don’t find CPU’s on DRAM modules. SSD’s (SATA or PCIe) from everyone else use embedded CPU’s and attach using storage controller methodologies.

In an upcoming post on my solid state storage series I will explore failure rates in detail. I do find it interesting that Fusion-io is one of the very few companies that have significantly higher error detection rates than a standard hard drive or other SSD’s, even enterprise branded SSD’s. Fusion-io claims 10^20 uncorrectable detectable error rate and 10^30 uncorrectable undetectable error rate. I have yet to see any hard disk or SSD with a rate better than 10^17. So, I agree with David about actually needing a form factor for ease of service if you build the device with enough error correction, which clearly you can with solid state.

Fundamentals of Storage Systems, Understanding Reliability and Performance of Solid State Storage

Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.

SLC and MLC Characteristics and Differences

Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.

SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.

Measure	SLC	MLC
Read Speed	25~ nanoseconds	50~ nanoseconds
Write Speed	220~ nanoseconds	900~ nanoseconds
P/E Cycles	100k to 300k	3k to 30k
Minimum ECC Bits required	1 bit per 512 bytes	12 bits per 512 bytes
Block Size	64KB	128KB

SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.

Enterprise Reliability

As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.

SATA, SAS or Neither?

The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing it did get a lot better but still falls short of SAS in several areas.

SATA Vs. SAS

Feature	SAS	SATA
Command Queuing	TCQ supports queue depths up to 216 usually capped at 64	NCQ supports queue depths up to 32
Error recovery and detection	Uses the SCSI command is more robust	SMART Proven to be in adequate. see Google Paper
Duplex	Full Duplex dual port per drive	Half Duplex single port
Multi-path IO	fully supported at drive level	supported in SATA II via expanders

Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this. Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.

PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.

The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.

Performance Characteristics

By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar. As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!

Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.

Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.

Summary

There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.