Monthly Archives: October 2010

Fundamentals of Storage Systems, Understanding Reliability and Performance of Solid State Storage

Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.

SLC and MLC Characteristics and Differences

Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.

SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.

Measure	SLC	MLC
Read Speed	25~ nanoseconds	50~ nanoseconds
Write Speed	220~ nanoseconds	900~ nanoseconds
P/E Cycles	100k to 300k	3k to 30k
Minimum ECC Bits required	1 bit per 512 bytes	12 bits per 512 bytes
Block Size	64KB	128KB

SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.

Enterprise Reliability

As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.

SATA, SAS or Neither?

The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing it did get a lot better but still falls short of SAS in several areas.

SATA Vs. SAS

Feature	SAS	SATA
Command Queuing	TCQ supports queue depths up to 216 usually capped at 64	NCQ supports queue depths up to 32
Error recovery and detection	Uses the SCSI command is more robust	SMART Proven to be in adequate. see Google Paper
Duplex	Full Duplex dual port per drive	Half Duplex single port
Multi-path IO	fully supported at drive level	supported in SATA II via expanders

Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this. Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.

PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.

The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.

Performance Characteristics

By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar. As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!

Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.

Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.

Summary

There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.

Fundamentals of Storage Systems, Solid State Storage Basics

Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Lets review what solid state storage is and where it differs from traditional hard disks. I will cover solid state storage in a general manor not favoring any specific flash manufacturer or specific type of Flash.

Outline:
Types of Flash
NAND Flash Structure
NAND Flash Read Properties
NAND Flash Write Properties
Wear-Leveling
Garbage Collection
Write Amplification
TRIM
Error Detection and Correction

Flash Memory

Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.

The Structure of NAND Flash

NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-leveling, which we will cover shortly. Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit. There can be quite a few blocks per actual chip, as many as 16 thousand blocks or 8 gigabytes worth, on a single chip. From there manufacturers group chips together usually in a parallel arrangement using controllers to make them look like one large solid state disk. They can vary in form factor. Most common are a standard 2.5” or 3.5” drive or a PCIe device.

NAND Read Properties

NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like a hard disks, that is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns. A 15,000 RPM hard disk my have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area. It takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.

NAND Write Properties

NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.

Program Erase Cycle

NAND does writes based on a program erase(P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase.

Wear-Leveling

To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Lets take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-leveling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system. Extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.

It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can effect performance of the device.

Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.

TRIM
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.

Write Amplification
Another pit fall of wear-leveling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a stand still. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes.

Error Detection and Correction

The nature of NAND Flash makes it susceptible to several types of data corruption. Like hard drives and floppy disks are at risk near magnetic sources. NAND has several vulnerabilities, some of them occur even when just reading data.

Write Disturb
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.

Read Disturb
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.

Charge Loss/Gain
Lastly, there is an issue with data data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to.

All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues.

Bad Block Management

NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.

Summary

As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.

Takeaways
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it use to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.

Up next
We will talk about MLC vs SLC. What makes a device enterprise ready. How to effectively benchmark your solid state storage and not be caught off guard when you move into production.

At The End of the IO Road With C#? Pave New Road!

Not being one for letting a problem get the best of me, I took another look at the asynchronous overlapped IO problem. If you read my last post on the subject, you know I’ve done a lot of work on this already. None of the things I said last time have changed at all. If you want to do asynchronous and un-buffered IO in C# using the native file stream calls you can’t… So, I rolled my own. The kicker is, I don’t use any unmanaged code to do this. No call to VirtualAlloc() or anything else using DLL imports. Oh, and the speed is spectacular.

The Goal

My ultimate goal was to build a routine that would do un-buffered asynchronous IO. That means I don’t want the OS doing any buffering or funny stuff with the IO’s I issue. That goes for reads and writes. SQL Server uses this method to harden writes to the disk and it also performs well with excellent predictability. If you have ever use windows to do a regular copy you will see it eating up memory to buffer both reads and writes. If you copy the same file a couple of times you will notice that the first time it runs in about the speed you expect it, but the second time it may run twice as fast. This is all Windows, buffering as much data and holding on to that buffer. That’s great for smaller files but if you are pushing around multi-gigabyte files it is a disaster. As the system becomes starved for memory it pages then starts throttling back. Your 100MB/sec copy is now crawling along at 20MB/sec.

Where we left off..

I had settled on a simple routine that would allow me to do un-buffered reads from a file and write to a buffered file ether on disk or across the network.

internal class UnBufferedFileCopy
{
	public static int CopyBufferSize = 8 * 1024 * 1024;
	public static byte[] Buffer = new byte[CopyBufferSize];
	const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

	public static int CopyFileUnbuffered(string inputfile, string outputfile)
	 {
		var infile = new FileStream(inputfile, FileMode.Open, FileAccess.Read
		, FileShare.None, 8, FileFlagNoBuffering | FileOptions.SequentialScan);
		var outfile = new FileStream(outputfile, FileMode.Create, FileAccess.Write
		, FileShare.None, 8, FileOptions.WriteThrough);

		int bytesRead;
		while ((bytesRead = infile.Read(Buffer, 0, CopyBufferSize)) != 0)
		{
			outfile.Write(Buffer, 0, bytesRead);
		}

		outfile.Close();
		outfile.Dispose();
		infile.Close();
		infile.Dispose();
		return 1;
	}
}

There are two problems with this routine. First off, only the read from source is truly un-buffered. C# offers the write through flag and I thought that would be enough. I fired up process monitor and watched the IO issued on writes and it wasn’t buffer sized requests, it was always broken up into 64k chunks. So, the read request would fetch say 16MB of data and pass that to the write request who would then break that up into chunks. This wasn’t the behavior I was going for! Doing some additional research I found adding the no buffering flag to the write through flag gave me the results I was after, almost. You can’t do un-buffered writes. Synchronous or asynchronous doesn’t matter. To do a un-buffered write the buffer area that you build from the byte array must be page aligned in memory and all calls must return a multiple of the page size. Again, this just isn’t possible in managed code. So, I investigated a horrible kludge of a solution. I do un-buffered writes until I get to the last block of data. Then I close and reopen the file in a buffered mode and write the last block. It isn’t pretty but it works. It also means that I can’t use write through and un-buffered on a file smaller than the buffer size. Not a huge deal but something to be aware of if you are doing a lot of small files. If you are going the small file route the first routine will probably be OK.

internal class UnBufferedFileCopy
{
	public static int CopyBufferSize = 8 * 1024 * 1024;
	public static byte[] Buffer1 = new byte[CopyBufferSize];
	const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

	public static int CopyFileUnbuffered(string inputfile, string outputfile)
	{
		var infile = new FileStream(inputfile, FileMode.Open, FileAccess.Read
			, FileShare.None, 8, FileFlagNoBuffering | FileOptions.SequentialScan);
		//open output file set length to prevent growth and file fragmentation and close it.
		//We have to do it this way so we can do unbuffered writes to it later
		outfile = new FileStream(outputfile, FileMode.Create, FileAccess.Write
			, FileShare.None, 8, FileOptions.WriteThrough);
		outfile.SetLength(infilesize);
		outfile.Dispose();

		//open file for write unbuffered
		outfile = new FileStream(outputfile, FileMode.Open, FileAccess.Write
			, FileShare.None, 8, FileOptions.WriteThrough | FileFlagNoBuffering);
		long totalbyteswritten;
		int bytesRead1;
		//hold back one buffer
		while (totalbyteswritten &lt; infilesize - CopyBufferSize)
		{
			bytesRead1 = _infile.Read(Buffer1, 0, CopyBufferSize);
			totalbyteswritten = totalbyteswritten + CopyBufferSize;
			outfile.Write(Buffer1, 0, bytesRead1);
		}

		//close the file handle that was using unbuffered and write through
		outfile.Dispose();

		//open file for write buffered We do this so we can write the tail of the file
		//it is a cludge but hey you get what you get in C#
		outfile = new FileStream(outputfile, FileMode.Open, FileAccess.Write, FileShare.None, 8,
		FileOptions.WriteThrough);

		//go to the right position in the file
		outfile.Seek(infilesize - bytesRead1, 0);
		//flush the last buffer syncronus and buffered.
		outfile.Write(Buffer1, 0, bytesRead1);

		outfile.Dispose();
		infile.Dispose();
		return 1;
	}
}

This is as close to fully un-buffered IO on both the read and write side of things. There is a lot going on, but it is still synchronous all the way. If you look at performance monitor it will show you a saw tooth pattern as you read then write since you are only ever doing one or the other. Using this to copy a file across the LAN to another server never got better than 75MB/Sec throughput. Not horrible but a long way from the 105MB/Sec I get from something like FastCopy or TerraCopy. Heck, it’s not even close to the theoretical 125MB/Sec a gigabit connection could support. That leaves the last piece of the puzzle, going asynchronous.

Threading in C#, To Produce or Consume?

We know that using the asynchronous file IO built into C# isn’t an option. That doesn’t mean we can’t pattern something of our own like it. I’ve done quite a bit of threading in C#. It isn’t as difficult as C/C++ but you can still blow your foot off. It adds a whole other level of complexity to your code. This is where a little thought and design on paper and using a flow chart can help you out quite a bit. Also, it’s good to research design patterns and multi-threading. A lot of smart people have tackled these problems and have developed well designed solutions. Our particular problem is a classic producer consumer pattern, a simple one at that. We have a producer, the read thread, putting data in a buffer. We have a consumer, the write thread, that takes that data and writes it to disk. My first priority is to model this as simply as possible. I’m not worried with multiple readers or writers. I am concerned with locking and blocking. Keeping the time something has to be locked to a minimum is going to be key. That lead me to a simple solution. One read thread and the buffer it reads into, one write thread and the buffer it reads from and one intermediate buffer to pass data between them. Basically, an overlap buffer that is the same size as the read and write buffer. To give you a better visual example before showing you the code here are a couple of flow charts.

Read File
http://www.lucidchart.com/documents/view/4cac057f-d81c-472e-9764-52c00afcbe04

Write File
http://www.lucidchart.com/documents/view/4cac0726-dd14-46a6-8d44-53710afcbe04

There a few of things you need to be aware of. There is no guarantee of order on thread execution. That is why I’m using a lock object and a semaphore flag to let me know if the buffer is actually available to be written or read from. Keep the lock scope small. The lock can be a bottleneck and basically drop you back into a synchronous mode. Watch for deadlocks. With the lock and the semaphore flag in play if your ordering is wrong you can get into a deadlock between the two threads where they just sit and spin waiting for ether the lock or the flag to clear. At this point I’m confident I don’t have any race or deadlocking situations.

Here is a simplified sample, I’m serious this is as small a sample as I could code up.

internal class AsyncUnbuffCopy
{
	//file names
	private static string _inputfile;
	private static string _outputfile;
	//syncronization object
	private static readonly object Locker1 = new object();
	//buffer size
	public static int CopyBufferSize;
	private static long _infilesize;
	//buffer read
	public static byte[] Buffer1;
	private static int _bytesRead1;
	//buffer overlap
	public static byte[] Buffer2;
	private static bool _buffer2Dirty;
	private static int _bytesRead2;
	//buffer write
	public static byte[] Buffer3;
	//total bytes read
	private static long _totalbytesread;
	private static long _totalbyteswritten;
	//filestreams
	private static FileStream _infile;
	private static FileStream _outfile;
	//secret sauce for unbuffered IO
	const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

	private static void AsyncReadFile()
	{
		//open input file
		_infile = new FileStream(_inputfile, FileMode.Open, FileAccess.Read, FileShare.None, CopyBufferSize,
		FileFlagNoBuffering);
		//if we have data read it
		while (_totalbytesread &lt; _infilesize)
		{
			_bytesRead1 = _infile.Read(Buffer1, 0, CopyBufferSize);
			lock (Locker1)
			{
				while (_buffer2Dirty)Monitor.Wait(Locker1);
				Buffer.BlockCopy(Buffer1, 0, Buffer2, 0, _bytesRead1);
				_buffer2Dirty = true;
				Monitor.PulseAll(Locker1);
				_bytesRead2 = _bytesRead1;
				_totalbytesread = _totalbytesread + _bytesRead1;
			}
		}
		//clean up open handle
		_infile.Close();
		_infile.Dispose();
	}

	private static void AsyncWriteFile()
	{
		//open output file set length to prevent growth and file fragmentation and close it.
		//We have to do it this way so we can do unbuffered writes to it later
		_outfile = new FileStream(_outputfile, FileMode.Create, FileAccess.Write, FileShare.None, 8,
		FileOptions.WriteThrough);
		_outfile.SetLength(_infilesize);
		_outfile.Close();
		_outfile.Dispose();
		//open file for write unbuffered
		_outfile = new FileStream(_outputfile, FileMode.Open, FileAccess.Write, FileShare.None, 8,
		FileOptions.WriteThrough | FileFlagNoBuffering);
		while (_totalbyteswritten &lt; _infilesize - CopyBufferSize)
		{
			lock (Locker1)
			{
				while (!_buffer2Dirty) Monitor.Wait(Locker1);

				Buffer.BlockCopy(Buffer2, 0, Buffer3, 0, _bytesRead2);
				_buffer2Dirty = false;
				Monitor.PulseAll(Locker1);
				_totalbyteswritten = _totalbyteswritten + CopyBufferSize;
			}
			_outfile.Write(Buffer3, 0, CopyBufferSize);
		}
		//close the file handle that was using unbuffered and write through
		_outfile.Close();
		_outfile.Dispose();
		lock (Locker1)
		{
			while (!_buffer2Dirty) Monitor.Wait(Locker1);
			//open file for write buffered We do this so we can write the tail of the file
			//it is a cludge but hey you get what you get in C#
			_outfile = new FileStream(_outputfile, FileMode.Open, FileAccess.Write, FileShare.None, 8,
			FileOptions.WriteThrough);
			//this should always be true but I haven't run all the edge cases yet
			if (_buffer2Dirty)
			{
				//go to the right position in the file
				_outfile.Seek(_infilesize - _bytesRead2, 0);
				//flush the last buffer syncronus and buffered.
				_outfile.Write(Buffer2, 0, _bytesRead2);
			}
		}
		//close the file handle that was using unbuffered and write through
		_outfile.Close();
		_outfile.Dispose();
	}
	
	public static int AsyncCopyFileUnbuffered(string inputfile, string outputfile, int buffersize)
	{
		//set file name globals
		_inputfile = inputfile;
		_outputfile = outputfile;
		//setup single buffer size, remember this will be x3.
		CopyBufferSize = buffersize * 1024 * 1024;
		//buffer read
		Buffer1 = new byte[CopyBufferSize];
		//buffer overlap
		Buffer2 = new byte[CopyBufferSize];
		//buffer write
		Buffer3 = new byte[CopyBufferSize];
		//get input file size for later use
		var f = new FileInfo(_inputfile);
		long s1 = f.Length;
		_infilesize = s1;

		//create read thread and start it.
		var readfile = new Thread(AsyncReadFile) { Name = &quot;ReadThread&quot;, IsBackground = true };
		readfile.Start();

		//create write thread and start it.
		var writefile = new Thread(AsyncWriteFile) { Name = &quot;WriteThread&quot;, IsBackground = true };
		writefile.Start();

		//wait for threads to finish
		readfile.Join();
		writefile.Join();
		Console.WriteLine();
		return 1;
	}
}

As you can see, we have gotten progressively more complex with each pass until we have finally arrived at my goal. With zero unmanaged code and only one undocumented flag I’ve built a C# program that actually does fast IO like the low level big boys. To handle the small file issue I just drop back to my old copy routine to move these files along. You can see a working sample at http://github.com/SQLServerIO/UBCopy It also has an MD5 verification built in as well.

So, how well does it work?

FastCopy 1.99r4
TotalRead = 1493.6 MB
TotalWrite = 1493.6 MB
TotalFiles = 1 (0)
TotalTime= 15.25 sec
TransRate= 97.94 MB/s
FileRate = 0.07 files/s

UBCopy 1.5.2.1851 — Managed Code
File Copy Started
%100
File Copy Done
File Size MB : 1493.62
Elapsed Seconds : 15.26
Megabytes/sec : 102.63
Done.

I think it will due just fine.