OT: Rewriting History

I’ve been blogging for a few months. During that time I’ve received comments about my grammar, sentence construction and paragraph construction. I have taken this criticism to heart. I have only had two specific comments about my writing style so I have a point to start from. Like most of you, I do not write for a living. I do write every day of my life. I have had basic writing instruction during my high school an college days, but nothing during my time as a professional. The most training I’ve ever had communicating to others has been in the form of speech and theater. I have had some success as a speaker, and I feel very good about that end of my skill set.

So, I am attempting to correct this lack of education the only way I know how. I did some research and came up with three books to start my literary education.

The Elements of Style
My wife tells me this is a must.

 

On Writing Well, 30th Anniversary Edition: The Classic Guide to Writing Nonfiction
This one looked good and I’ve enjoyed reading it so far.

 

Grammar Girl’s Quick and Dirty Tips for Better Writing
I have listened to the pod cast before I thought it was fine so I got the book.

The great thing about a blog is it isn’t as static as a printed book. As I learn I can make revisions to the articles. I can make revisions to correct the broken bits and make them easier to read.

If you have any recommendations leave them in the comments section and I’ll give it a read. I’ll also try to learn something in the process!

 

UPDATE:

A couple of friends that write also suggested

Woe Is I (Expanded Edition) [WOE IS I (EXPANDED EDITION) EX]
Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation
Merriam-Webster’s Everyday Language Reference Set: Vocabulary Builder/Thesaurus/Dictionary

All three are on the way!

Adventures in SQL CLR and C#

I’ve toyed with the CLR in SQL Sever 2005 off and on since the first Yukon beta had it enabled. And I’ll be honest with you, I was not a fan.It wasn’t like “YOU got chocolate in my peanut butter!” kind of moment for me. I really thought it was going to be a disaster of biblical proportions. As SQL Server DBA’s we caught a break, adoption wasn’t exactly stellar. The problem was there are enough restrictions and little gotchas to keep developers from whole sale abandoning Transact SQL for something more familiar. Fast forward a few years and now I’m not so scared.My biggest worry back then was memory usage. I’m still not very comfortable with it, but on a 64-bit platform you can mitigate those issues by adding more memory. On a 32-bit platform you could cause all kinds of damage by squeezing the lower 4GB memory space to the point you could have connection and backup failures due to lack of memory. Oh and the fix is usually restarting SQL Server. An example of this comes directly from http://msdn.microsoft.com/en-us/library/ms131075.aspx 

Scalable Memory Usage

In order for managed garbage collection to perform and scale well in SQL Server, avoid large, single allocation. Allocations greater than 88 kilobytes (KB) in size will be placed on the Large Object Heap, which will cause garbage collection to perform and scale much worse than many smaller allocations. For example, if you need to allocate a large multi-dimensional array, it is better to allocate a jagged (scattered) array.

88KB!

This memory thing is serious.

The other biggie is what you can, or cannot do using the CLR.

Again from MSDN http://msdn.microsoft.com/en-us/library/ms131047.aspx

SAFE is the most reliable and secure mode with associated restrictions in terms of the allowed programming model. SAFE assemblies are given enough permission to run, perform computations, and have access to the local database. SAFE assemblies need to be verifiably type safe and are not allowed to call unmanaged code.

UNSAFE is for highly trusted code that can only be created by database administrators. This trusted code has no code access security restrictions, and it can call unmanaged (native) code.

EXTERNAL_ACCESS provides an intermediate security option, allowing code to access resources external to the database but still having the reliability guarantees of SAFE.

Most restrictive to least restrictive permissions. Something you don’t worry about in general as a C# programmer but in the database its always an issue in some way.

What it boils down to:

If you are just talking to SQL Server using basic C# stuff leave it in SAFE which is the default.

If you need access to the file system or the registry and some other limited stuff EXTERNAL_ACCESS is the way to go.

IF you want to have the ability to completely tank a production SQL Server UNSAFE puts it all into your hands. You can call unmanaged code via P/Invoke, all bets are off.

 

Some additional light reading on what libraries can and can’t be called in the CLR.

http://msdn.microsoft.com/en-us/library/ms403273.aspx

Fun stuff, no Finalizers or static fields, read-only static fields are ok though. You will see why this is important to me a little later on.

http://msdn.microsoft.com/en-us/library/ms403279.aspx

 

T-SQL vs. CLR

The other thing I had been promoting, and not always correctly, is putting complicated math functions in CLR. Generally, I’ve found that most math problems run faster in the CLR over native T-SQL. And I’ve found for the most part that holds true for the core algorithm. Once you add data retrieval into the mix things shift back in T-SQL’s favor for a lot of operations. Like everything else, test your ideas using real world scenarios or as close as you can before deciding on one technology over another. I prime example for me was coding up Pythagorean and Haversine equations for the classic distance between two zip codes in T-SQL and C# via CLR. Running test data through an array in the C# solution it ran rings around the T-SQL function I had coded up but once it had to start pulling and pushing data back to the database the T-SQL solution was the clear winner.

Another aspect where the CLR can be much better is string manipulation. I’ve written a couple of small UDF’s to handle some of this since using the LIKE ‘%’ would cause a table scan anyway the CLR UDF was faster internally when dealing with the string than T-SQL was using all the string handling functions.

I’m also seeing quite a bit on using the CLR for rolling aggregates and other kinds of aggregation problems. I don’t have any personal experience in that yet with the CLR.

There are also some things that aren’t practical at all using T-SQL, some would say you shouldn’t be using the database for some of this stuff in the first place but that is an argument for a different post.

 

And Now for Something Completely Different…

I’ve recently started working on my most complex project using the CLR, some aspects have been covered by other folks like Adam Machanic, Robin Dewson and Jonathan Kehayias but there was some specific requirements that I needed.

Thus was born….

SQL Server File System Tools

This is a codeplex hosted project and all the source code is available there for your viewing pleasure.

I’ve done a lot of C# stuff but this was my first hard core CLR app for SQL Server.

What the assembly does is pretty simple, store files in the database ether native, encrypted or compressed.Yoel Martinez wrote up a nice UDF that does blob compression using the CLR. Between this and examples in Pro SQL Server 2005 on storing files in the database I knew I could do what I needed to do.

The wrinkle in my project was not just reading the file and storing it compressed it was putting it back on disk compressed as well. Enter #ziplib (SharpZipLib). This library allows you to pretty easily create standard zip files that even Windows Explorer can open and extract from. So with all the bits in place I set out to build my little tool.

 

Development Cycle

The first thing I did was put together all the samples I’d found build them up as a set of stored procedures instead of UDF’s and just got the file in and out working. Next I added compression via C#’s DeflateStream to see what it would take to get the data flowing in and out and what the performance hit in memory and time would start looking like. At this point I was pretty optimistic I could knock this thing out in a day or two tops. That was all fine and dandy until I started integrating the #ziplib library. My initial goal was to have the assembly set to EXTERNAL_ACCESS since that was the most restrictive security model.

Since the guys that wrote #ziplib didn’t have the CLR in mind there are several things that break without UNSAFE set. As I mentioned earlier the use of finalizers and static fields were the two big ones. I will at some point recode those parts but for now they are still in place. The second thing is the library covers a lot more functionality that I actually need, So I’ve removed the bits I can without refactoring the library. The resulting DLL isn’t horribly big at this point but I figure when I get around to coding up the finalizers I’ll refactor down to what I need then. One big plus for me though is #ziplib is all managed code written in C# so it is pretty easily added directly into my DLL so I don’t have to register two assemblies or call down to the file system to a unmanaged DLL. Compression is handled by RijndaelManaged which is a built in .net 2.0 libraries.

The big downer for me was trying to debug the the code in Visual Studio 2008, when it did work it was ok but It would fail to connect or fail to register the assemblies so I just fell back to injecting debug messages and running tests manually in SSMS.

One thing I really like about programming languages like C# is method overloading, I really wished you could do that with stored procedures! Since I can’t there were only two options, a stored proc that had lots of flags and variables that may or may not be used and handle it all under the covers or just build each option into a proc with simple callers and a descriptive name. I voted for option two. Some of the T-SQL procedures are used internally by the CLR procedures while all the CLR procedures are called by the user.

Here is the list procedures and what they do.

Called by CLR procedures as helpers

InsertFile
Is called by every proc that inserts a file into the database.

SavePassphrase
Called by StorePassPhrase to handle insert into database.

RetrievePassPhrase
Called by any proc that has to decrypt a file stream

RetrieveFile
Called by any proc that retrieves a file from the database

Called by Users

T-SQL Procedures

InsertTag
Called by User supplying a file id and list of key words or “tags” as a search helper other than file name.

RetrieveFileDetails
Get details on a single file or every file stored in the database.

CLR Procedures

StorePassPhrase
Give it a password and it generates a secure hash stored into the database for symmetric encryption

Below all store a file from the file system into the database.
StoreFile
StoreFileEncrypted
StoreFileCompressed
StoreFileEncryptedCompressed

Below all retrieve a file from the database back to the file system.
ExtractFileToDisk
ExtractFileToDiskEncrypted
ExtractFileToDiskCompressed
ExtractFileToDiskEncryptedCompressed

Below all retrieve a file from the database but returns a record set with the file name and the BLOB data.
ExtractFileToRecord
ExtractFileToRecordEncrypted
ExtractFileToRecordCompressed
ExtractFileToRecordEncryptedCompressed

And lastly, I put in an xp_getfiledetails clone since I wanted a way to verify the file is on disk and get attributes it seemed pretty straight forward since I’m getting the same details when i write the file to the database anyway.

Final Thoughts

This project isn’t done yet. there are a few more things to be added other than the code cleanup I mentioned already.

Off line decryption tool so the files dumped to disk still encrypted can be worked with.

Additional stored procedures for searching for files by tag or by attributes like name, size, etc.

A real installer and not a zip file with T-SQL scripts.

After that it goes into maintained mode with no new features but work on speeding it up, reducing the memory impact and fixing any bugs that are found. I really want to avoid this growing into a huge library, Keep it simple, do one thing and do it well.

Resources

Here are some things that helped me along the way.

Development Tools

Visual Studio 2008 
could have used notepad but hey I’m getting lazy in my old age.

JetBrains ReSharper 4.5
If you are using Visual Studio ReSharper is a must. I feel like I’m programming the the stone age without it.

GhostDoc
Free tool to help you document your C# code using XMLDoc. Yet something else I wished I could do with stored procedures

Doxygen
If you are building documentation and have XMLDoc in your code this can make it easier to gather it all together. It isn’t perfect but it is free.

Books

Both solid text Pro SQL Server 2005 has a chapter on CLR

Pro SQL Server 2005

 

This one is dedicated to just CLR and was also invaluable to me.

Pro SQL Server 2005 Assemblies

 

On The Web

http://www.sqlclr.net
Devoted to the CLR with some nice resources.

 image
My Twitter buddies are always there to answer a question or two!

Until next time!

-Wes

SQL Server MVP Deep Dives

I’ve been reading through this book and it really does have something for everyone at any level. I think it is truly awesome that this many people have stepped up and donated not only their time but their knowledge to help http://www.WarChild.org!

I’ll have a more in-depth review later this week when I’ve read it cover to cover. From Paul Nielson’s blog: http://sqlblog.com/blogs/paul_nielsen/archive/2009/09/29/53-mvps-warchild-org-and-sqlservermvpdeepdives-com.aspx

GO GET IT! 🙂

Wes

Fundamentals of Storage Systems – The Basics of Spinning Disks

Your servers are only as fast as the slowest part, hard drives.To feed other parts of the system we have to add lots of drives to get the desired IO single server can consume.

The basics of how hard drives work has been fundamentally static since the 70’s only refinements in technique and the core technologies have improved. You have a shaft or “spindle” attached to a motor. Disks or “platters” are attached to the spindle. The motor spins the spindle and the platters. Read/write heads controlled by actuator motors move across the surface with very precise motion and access the information stored on the platters. Generally, there is one read/write head per platter surface that is useable.

Simple.

This configuration has worked so well for the last 45 years that every claim to date that X new technology would unseat it just hasn’t happened. That’s not to say it won’t happen, just that hard drives have been “good enough” for the bulk of our storage needs for a very, very long time. Since this is the core of our permanent storage in our database world it is important to have a basic understanding of them.

File:SixHardDriveFormFactors.jpg

Description
Six hard disk drives with cases opened showing platters and heads; 8, 5.25, 3.5, 2.5, 1.8, and 1 inch disk diameters are represented.
Date
1 March 2008(2008-03-01)
Author
Paul R. Potts

http://commons.wikimedia.org/wiki/File:SixHardDriveFormFactors.jpg

I love this picture. Smaller and faster yet still the same.

To give you an idea of what you are up against lets compare the growth rate of your hard drive VS. your CPU.

Our 1981 machine has a the veteran Seagate ST-412 and a Intel 8088.
Our new computer has a Seagate Cheetah 15k.6 ST3146356SS and a Core i7 965 from Intel of course.

Time Circa 1981 Today Improvement
Capacity 10MB 1470MB 147x (209715x for 2TB drive)
Seek speed 85ms 3.4ms 20x (6x for 2TB drive)
IO/Sec 11.4 303 26x
Mbit/Sec 5 (0.625 MB/Sec) 1000 (125 MB/Sec) 200x
CPU 4.77Mhz(.33 MIPS) 3200Mhz(18322 MIPS) 5521x

At first glance we can say WOW what an improvement! Right up until you see how far the processors have come.Everyone is familiar with Moore’s law (Often quoted, rarely understood) loosely applied says CPU transistor counts double roughly every 18 to 24 months.Up until recently, hard drive capacity has been growing almost at the same rate doubling in size around every 18 months (Kryder’s Law).  Hard disks haven’t come close to keeping up with that pace, performance wise. Again, the problem isn’t size is speed.

The Makeup of A Modern Hard Drive

You cannot change the laws of physics” – Scotty

As I stated in the previous section hard drives have remained relatively unchanged since the IBM Winchester drive. Lets take a closer look at the physical structure.

Head, Sectors and Cylinders

So, we have a spindle one or more platters and one or more read/write head, all of that spinning and jittering about at a pretty good clip. So, just how does the computer know where your data is? The platter is broken up into a map of sorts.

Simplistic view:

image

The platter is broken up into concentric rings and pie slices that allow the drive controller to find the region where the data is.

image

The heads all move in unison and present a view through the platters that make up a cylinder. I won’t go into great detail on how we have advanced sector and track layouts and the advent of Logical Block Addressing there are plenty of articles on the web that get into those nuts and bolts. What I’m after is to show you physically what has to happen to read the data from the disk and why that is the limiting factor. With the disk spinning at 15,000 RPM the sectors are flying by pretty quickly so the head has to be positioned above the sector and then read or write to it as the platter moves underneath it. The spinning disk, moving the heads and waiting for the data to be read all add up to latency.

Rotational latency is how long it takes the sector we are after to move under the head to be read or written to. Average rotational latency is expressed as half the time it takes for the platter to make one revolution. For our 15k hard drive that number is 2 milliseconds, 60 seconds divided by 15000 RPM divided by 2.

Seek Time is how quickly the disk head can be positioned over a sector to start reading data.

There are to kinds of seek we are interested in, average random seek time and sequential or track-to-track seek times.

In our top of the line Seagate Cheetah our random read seek time is 3.4ms that is the time it takes to get from any one sector to any other sector, usually half the distance from the inner track to the outer most track. Random write seek time is 3.9ms. It is longer due to the process of actually effecting the sector its at before moving on to the next random sector.

sequential is much much faster. If the head only has to move to the next track it can usually do so in under a millisecond.

All this adds up to an average access time. basically, you take the rotational latency plus the average random seek time and any command processing time overhead I usually throw in an additional millisecond. Our Cheetah has a random access time of 6.4ms. Sometimes it may be much faster sometimes it may be much slower but this is a good number to work with as far as planning our storage needs.

The flip side of operations per second is throughput usually expressed in megabytes a second.

This is a direct correlation to the amount of data that can be squeezed into a sector. As drive densities go up so does the average megabytes per second. There is something you should know, the inner tracks are slower on throughput but higher on IO’s and the outer tracks are higher on through put and lower on IO’s. This is just a function of the diameter of the platter getting larger the farther out you go.

It isn’t unusual to see sequential throughput average of around 110 MB/sec and that is only getting better.

Random throughput is not so rosy a picture. I haven’t seen any drive manufactures advertise these numbers from my own testing it can be as little as 15MB/sec up to 40MB/sec. You should test your system to get more accurate numbers.

What It All Means to Us

This boils down to how many I/O operations a single disk can give us. In SQL Server land random IO is king and generally one of the biggest bottlenecks on our data files.For log files, things are a little better. Since logs are written to sequentially you can effectively double the available I/O’s a drive can provide since you have cut our the random access and are much closer to the sequential or track-to-track access.

To calculate the maximum number of random operations we use 1000ms / (seek time[ms] + latency[ms]+overhead[ms])= input/output operations per second.

or

1000/(3.4+2+1) = 155 IOps

Sequential reads get much better since seek times go down from 3.4 to around 0.2.

1000(.7+2+1) = 270 IOps

Almost twice as much! Now you know why we keep our database log files separate from each other and from the data. The amount of disks needed to get the performance is about half. We do the same thing for writes and they will be a little less.

Hard drives suffer from what is known as the “hockey stick” effect the closer they get to 100% utilization the performance falls off dramatically.

image

Since running a disk at 100% capacity for IO’s introduces the maximum possible latency. The knee of the curve is around 80% we back that off a little more to 75% and that gives us the number of IO’s we have available per hard drive in the storage system in general. This reduces Queuing and keeps latency low, at the cost of maximum number of IO’s. Now our available read IO’s is down to about 117 IOps for random access and 216 IOps for sequential.  This number will get better as seek times get better and the command overhead gets better. But remember it will never ever be better than the 2.0ms for the rotational latency. Physics can be a real bummer sometimes. Along with physical spindle speed there have been large improvements with how the drive handles incoming and outgoing request. Through IO Prioritization and advanced command queuing algorithms (Native Command Queuing on SAS/SATA) access times and latencies are kept predictable and as fast as possible.

Series To Date:

  1. Introduction
  2. The Basics of Spinning Disks – You are here!
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems

The Fundamentals of Storage Systems – Introduction

At least once a year I give a large talk on disk subsystems, IO and SQL Server. It’s a ground up from the nuts and bolts of how a hard drive works through SAN’s and Solid State Disks. The reasons I give this presentation so often is it is one of the most requested topics and one of the most misunderstood. The problem often lies in the fact the DBA may not know that much about different storage systems but they do know that it is very important do their jobs. With the rise of SAN, iSCSI and other storage solutions DBA’s have less and less control over the disk system that their SQL Server relies on. It’s my goal to give them, or you, the tools they need to effectively present their needs to the storage teams hopefully without a major amount of fuss and arguments. If you know how and why it works they way it works you can make logical requests in the language that your storage folks understand.

The presentation is meant to lay the foundation that can then be built upon and expand your knowledge off all things I/O.

This article series will be slightly expanded over what my presentation normally covers, since I’m only restricted by your willingness to read what I write. It will still be a condensed version of storage systems but I’ll put up as many reference links as I can.

Series To Date:

  1. Introduction
  2. The Basics of Spinning Disks
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems
  10. Latency
  11. Solid State Storage Basics
  12. Understanding Reliability and Performance of Solid State Storage
  13. Shared Consolidated Storage Systems

Upcoming Posts :

Storage Area Networks
Network Attached Storage
iSCSI
SQL Server and The File System
Understanding Mean Time to Failure and Other Failure Metrics
Tools and Techniques To Monitor SQL Server and I/O

Some topics may be a single post some may span several I won’t know for sure until I get done writing them. As request come in I may try to post on specific questions, or at a minimum point you in the right direction.

Stay Tuned….

-Wes