Monthly Archives: February 2011

My Upcoming Schedule, Come Get Your Learn On!

It’s official, I’ll be speaking at SQLSaturday #63 April 2nd. It is something near and dear to my heart, solid state storage. If you have seen me speak recently at any of the other SQLSaturdays you know I love talking about I/O. This time around I’ll be focusing on one of the new technologies, flash storage. It always surprises me just how little some people understand a technology before pushing it into service! Just deciding to implement solid state disks without doing your due diligence can be a very expensive mistake. I have spent quite a bit of time over the last couple of years getting to know solid state disks and the underlying technology of NAND flash. I’ve been writing about it and I’ve been fortunate enough to have first hand experience with it as well. I’m really exited to teach about this new technology with a SQL Server slant to it.

It’s also been officially announced that I’ll be speaking at SQLRally in Orlando! I cannot tell you how exited I am about being a speaker at the very first SQLRally. To be honest with you I was a little stunned and a lot humbled by the nomination committee and your votes. Just as I was getting use to the idea of a “regular” 60 minute session, I was asked if I could expand it to 90 minutes and do a “deep dive” session instead. This will be an extended version of Understanding Storage Systems and SQL Server with more coverage on SAN’s and monitoring IO. 

What Is PASS SQLRally?

PASS’ new regional event fills the gap between our free PASS SQLSaturday 1-day training events and PASS Summit, the world’s largest conference for the SQL Server community. At SQLRally you’ll meet with industry experts, authors, Microsoft MVPs, and other people just like you to learn about best practices, effective troubleshooting, how to prevent issues, save money, and build a better SQL Server environment for your company or clients.

Fundamentals of Storage Systems, IO Latency and SQL Server

A Thousand Men Marching Still Only March As Fast As One Man.

la·ten·cy – Computers . the time required to locate the first bit or character in a storage location, expressed as access timeminus word time.

Often when talking to people about performance they get rapped around the MB/Sec number and ignore a critical factor, latency. SQL Server is one of those few applications that is very sensitive to disk and network latency. Latency is what the end user sees. If your SQL Server is waiting around for disk or network they will start to complain. In an OLTP environment SQL Server accesses data on disk in a nonlinear fashion, aka random IO’s. The size of these IO request can be pretty small. In a good application you really try and limit the amount of data returned to keep things speedy. The down side of small Random IO’s is the system will never be faster than a single seek operation on your disk. So, if you have a 15k SAS drive that is around 2.5ms. Caching and buffering schemes aside for now, 2.5ms is your floor. You will never be faster than that. Depending on the size of the IO request you spend more time waiting for the seek operation than you do actually transferring the data from the disk. We group disks together in larger arrays to give us more aggregate throughout and higher operations per second, but you are ever only as fast as your slowest disk. You can always get more aggregate throughput up to several gigabytes a second but you still have to wait on that first bit to travel to you.

To get around these short comings, SQL Server buffers as much data as it can in memory. We also add large amounts of cache to our SAN’s and disk controllers. This can mask some of the problem, but at some point it still needs to come from your disk drives.

On the network side things are actually better. With latency under a millisecond on a LAN you usually waiting on disk to deliver the data. There are other factors like the speed of the network equipment and number of hops across interfaces you have to make can be more significant than the actual transmittion rate. TCP/IP can be a factor as well. Out of the box SQL Server is configured at 4KB(4096 byte) packet. This is a good general setting for most workloads. If you are working on a highly tuned OLTP system you may want to set that to something smaller or align it with the TCP packet size for your network, usually 1500 bytes. If it is a OLAP system with lots of streaming throughput latency will make up a very small part of the overall transmission time and going with a larger packet size possibly aligned to the 8KB page size with increase throughput and decrease the time to transmit overall. If you do set a large packet size you should consider enabling jumbo frames on your network card. Make sure that your network equipment can support the jumbo frame from end to end.

Another place where we are starting to see more latency issues is with database mirroring. In synchronous mode, the default setting, you are now adding network latency plus the disk latency of the other server to the overall transaction time.

Mirroring isn’t the only game in town. We have had SAN level replication for quite a while as well. In most of the scenarios where we were using SAN level replication it was site to site across several miles. To give you an idea of how much latency can be added in this situation go ping yahoo.com or google.com, I’ll wait….. Ok, from my location ether of them is 45ms~75ms, or 18 times slower than your spinning disks. All the sudden, your network is the major player in delaying transactions. I’ve used fibre optics to connect to sites and the latency can still be a killer for OLTP systems. The general rule of thumb is 7.5 microseconds for every  1 1/2 miles. If our target SAN is 125 miles away we just added 2ms of latency to the 4ms of latency the two sets of disks are providing. In reality, it is worse than that when you again figure in network equipment. I personally haven’t seen synchronous setups more than 50 miles apart.

Just something to keep in mind as you plan your SQL Server infrastructure. Latency in its myriad forms is king.

Series To Date:
  1. Introduction
  2. The Basics of Spinning Disks
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems
  10. Latency – You are here!

SQLDIY: Manage and Monitor SQL Server Yourself

As an old school DBA and always having cheap frugal employers, I have rolled my own management and monitoring solutions. I’ve shared bits and pieces of it over the years to colleagues and to the community but never the whole thing at one go. I did try and build a commercial version. With the birth of my son, doing contract work and building a company from scratch just wasn’t in the cards. So, my loss is your gain! I’ve decided to take what I have built up over the years, clean it up and post it as a series of articles. The code is hosted at github. It may take a while to unwind all the pieces, update it for SQL Server 2008 and clean it up for general consumption but when it is done you should have a solid foundation for a solution of your own.

My goal as a DBA is to provide secure, accurate, consistent and  efficient data access to all the databases under my care.

To meet those goals I should:

  1. Provide monitoring and alerting for critical errors.
  2. Provide reliable maintenance on all databases.
  3. Perform backup and recovery operations, including testing backups to guarantee recoverability.
  4. Analyze performance issues and be proactive about possible issues.
  5. Secure the databases and monitor for any breaches in security policies and procedures.

If you have a few SQL Server instances to manage a single person may be able to do these things without any level of automation. As you scale to dozens or hundreds of instances it becomes impossible to throw enough people at the problem. You get inconsistency across servers in configuration and management. Introduce inaccuracy in your monitoring and long term growth projections. Add a whole new level of unreliability as some servers just fall off the radar and your DBA team spends all their time reacting to problems.

My core philosophy for dealing with these issues is pretty straight forward.

Manage By Exception
This means setting up your monitoring and alerting systems to work for you not against you. To many alerts are just as deadly as too few. As you handle problems feed the solution back into your automated system.

Automate Everything
Make your servers work for you and not against you! There is always a cost up front to automate the management of your environment. The time you get back by not having your very valuable and highly skilled workers not doing menial tasks more than makes up for it.

Standardize Across Servers
As servers come on line you may find that they aren’t setup to the company policies. Fixing all of this by hand is also a huge time waster. Define a standard and modify it as needed. Through automation you can make a single change and then deploy it across your entire environment.

To meet these needs I’ve written several modules, a centralized system for managing them and the servers in my care. I will update this list with links as the articles come on line. If I use any code that I haven’t written I’ll be very clear who did write it, if I made changes to it and what the license to redistribute the code falls under.

Data Gathering

Monitoring And Alerting

  • Server Blocking
  • Check Backup Space
  • Server Database Transaction Rate
  • Server Poll
  • System Database Roles
  • Long Running Queries
  • Long Running Stored Procedures
  • Log Shipping

Administration

  • Backup Database
  • Update Statistics
  • Re-indexing
  • QA Refresh
  • Test Backups
  • Log Shipping

Reporting

  • Daily Status
  • Data Dictionary
  • Long Running Queries
  • Long Running Stored Procedures

Trend Analysis

  • Backup/Restore Failure Analysis
  • Backup Growth
  • Drive Space Detail
  • Server Blocking
  • Database I/O
  • Database Transaction Rate
  • Server Wait Statistics
  • Long Running Queries
  • Long Running Stored Procedures

Moore’s Law May Be The Death of NAND Flash

"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so." -  Mark Twain

I try and keep this quote in my mind whenever I’m teaching about new technologies. You often hear the same things parroted over and over again long after they quit being true. This problem is compounded by fast moving technologies like NAND Flash.

If you have read my previous posts about Flash memory you are already aware of NAND flash endurance and reliability. Just like CPU’s manufacturing processes flash receive boost in capacity as you decrease the size of the transistors/gates used on the device. In CPU’s you get increases in speed, on flash you get increases in size. The current generation of flash manufactured on a 32nm process. This nets four gigabytes per die. Die size isn’t the same as chip, or package size. Flash dies are actually stacked in the actual chip package giving us sixteen gigabytes per package. With the new die shrink to 25nm we double the size to eight gigabytes and thirty two gigabytes respectively. That sounds great, but there is a dark side to the ever shrinking die. As the size of the gate gets smaller it becomes more unreliable and has less endurance than the previous generation. MLC flash suffers the brunt of this but SLC isn’t completely immune.

Cycles And Errors

One of the things that always comes up when talking about flash is the fact it wears out over time. The numbers that always get bantered about are SLC is good for 100,000 writes to a single cell and MLC dies at 10,000 cycles. This is one of those things that just ain’t so any more. Right now the current MLC main stream flash based on the 32nm process write cycles are down to 5000 or so. 25nm cuts that even further to 3000 with higher error rates to boot.

Several manufactures has announced the transition to 25nm on their desktop drives. Intel and OCZ being two of the biggest. Intel is a partner with Micron. They are directly responsible for developing and manufacturing quite a bit of the NAND flash on the market. OCZ is a very large consumer of that product. So, what do you do to offset the issues with 25nm? Well, the same thing you did to offset that problem with 32nm, more spare area and more ECC. At 32nm it wasn’t unusual to see 24 bits of ECC per 512 bytes. Now, I’ve seen numbers as high as 55 bits per 512 bytes to give 25nm the same protection.

To give you an example here is OCZ’s lineup with raw and usable space listed.

Drive Model Production Process Raw Capacity (in GB) Affected Capacity (in GB)
OCZSSD2‐2VTXE60G 25nm 64 55
OCZSSD2‐2VTX60G 32nm 64 60
OCZSSD2‐2VTXE120G 25nm 128 118
OCZSSD2‐2VTX120G 32nm 128 120

As you can clearly see the usable space is significantly decreased. There is a second problem specific to the OCZ drives as well. Since they are now using higher density modules they are only using half as many of them. Since most SSD’s get their performance from multiple read/write channels cutting that in half isn’t a good thing.

SLC is less susceptible to this issue but it is happening. At 32nm SLC was still in the 80,000 to 100,000 range for write cycles but the error rate was getting higher. At 25nm that trend continues and we are starting to see some of the same techniques used in MLC coming to SLC as ECC creeps up from 1 bit per 512 bytes to 8 bits or more per 512 bytes. Of course the down side to SLC is it is half the capacity of MLC. As die shrinks get smaller SLC may be the only viable option in the enterprise space.

It’s Non-Volatile… Mostly

Another side effect of shrinking the floating gate size is the loss of charge due to voltage bleed off over time. When I say “over time” I’m talking weeks or months and not years or decades anymore. The data on these smaller and smaller chips will have to be refreshed every few weeks. We aren’t seeing this severe an issue at the 25nm level but it will be coming unless they figure out a way to change the floating gate to prevent it.

Smaller Faster Cheaper

If you look at trends in memory and CPU you see that every generation the die gets smaller, capacity or speed increases and they become cheaper as you can fit double the chips on a single wafer. There are always technical issues to overcome with every technology. But NAND flash is the only one that gets so inherently so unreliable at smaller and smaller die sizes. So, does this mean the end of flash? In the short term I don’t think so. The fact is we will have to come up with new ways to reduce writes and add new kinds of protection and more advanced ECC. On the pricing front we are still in a position where demand is outstripping supply. That may change somewhat as 25nm manufacturing ramps up and more factories come online but as of today, I wouldn’t expect a huge drop in price for flash in the near future. If it was just a case of SSD’s consuming the supply of flash it would be a different matter. The fact is your cell phone, tablet and every other small portable device uses the exact same flash chips. Guess who is shipping more, SSDs or iPhones?

So, What Do I Do?

The easiest thing you can do is read the label. Check what manufacturing process the SSD is using. In some cases like OCZ that wasn’t a straight forward proposition. In most cases though the manufacturer prints raw and formatted capacities on the label. Check the life cycle/warranty of the drive. Is it rated for 50 gigabytes of writes or 5 terabytes of writes a day? Does it have a year warranty or 5 years? These are indicators of how long the manufacturer expects the drive to last. Check the error rate! Usually the error rate will be expressed in unrecoverable write or read errors per bit. Modern hard drives are in the 10^15 ~ 10^17 range. Some enterprise SSDs are in the 10^30 range. This tells me they are doing more ECC than the flash manufacturer “recommends” to keep your data as safe as possible.

SQLDIY: Gather Virtual File Statistics Using T-SQL #TSQL2sDay 15

It’s that time again, T-SQL Tuesday is here! This time Pat Wright (blog|twitter) is hosting and has put forth automating tasks using ether T-SQL or Powershell. I LOVE automating stuff. As a production DBA in some very large shops you can’t do your job unless you make your servers work for you. I’ve been using T-SQL and *GASP* xp_cmdshell, OSQL and file shares to gather stats and push configurations to servers for decades. Log before fancy things like C# and Powershell existed. These days I use a variety of home grown tools but doing things with just T-SQL can be just as powerful. I’m going to leverage this post to start a series on a pure T-SQL implementation of configuration management, data gathering and utility procedure deployment.

Where Is Your Management Database?

Every DBA should have two things, a utility, or management database on every server and a central repository where all the locally collected data in the management database is pulled back to. What surprises people when I talk about this methodology is I don’t always advocate using a licensed copy of SQL Server. By that I mean that SQL Express works just great as a central repository server. I usually put this on a virtual machine, Hyper-V or what ever flavor of virtual environment your company supports. This allows you to do things like enable CLR and xp_cmdshell on a server that technically is non-production and keep your security risk much lower. Every server that is deployed in my shop gets a local management database. From the central repository I can push out everything I need to manage a server and keep all my servers up to date as I make improvements or bug fixes to the management code. That’s all I’m really going to say about that in this post though I just wanted to give you an idea of just how deep the rabbit hole can go.

DMV’s give you the information, you have to use it.

Since SQL Sever 2005 Microsoft let of of the black box mentality and started providing crazy useful information via Dynamic Management Views. Virtual file statistics though have been around for quite a while. They got a touch up in the DMV but the basic information was available in SQL Server 2000 via function call.  The DMV I’m after is sys.dm_io_virtual_file_stats. It has a ton of information in it. It’s main problem though is it is an aggregation over time and doesn’t really tell you what is different from yesterday to today. To get around that we have to build our own sampling routine.

The VirtualFileStats Table

We create a table in our local management database to collect the daily numbers. I try to keep things simple.


CREATE TABLE dbo.VirtualFileStats
  (
     RecordID                             INT IDENTITY(1,1) NOT NULL,
     ServerName                           VARCHAR(255) NOT NULL,
     DBID                                 INT NOT NULL,
     FileID                               INT NOT NULL,
     Reads                                BIGINT NULL,
     ReadsFromStart                       BIGINT NULL,
     Writes                               BIGINT NULL,
     WritesFromStart                      BIGINT NULL,
     BytesRead                            BIGINT NULL,
     BytesReadFromStart                   BIGINT NULL,
     BytesWritten                         BIGINT NULL,
     BytesWrittenFromStart                BIGINT NULL,
     IostallInMilliseconds                BIGINT NULL,
     IostallInMillisecondsFromStart       BIGINT NULL,
     IostallReadsInMilliseconds           BIGINT NULL,
     IostallReadsInMillisecondsFromStart  BIGINT NULL,
     IostallWritesInMilliseconds          BIGINT NULL,
     IostallWritesInMillisecondsFromStart BIGINT NULL,
     RecordedDateTime                     DATETIME NULL,
     IntervalInMilliseconds               BIGINT NULL,
     FirstMeasureFromStart                BIT NULL
  )
GO
CREATE TABLE dbo.VirtualFileStatsHistory
  (
     RecordID                             INT NOT NULL,
     ServerName                           VARCHAR(255) NOT NULL,
     DBID                                 INT NOT NULL,
     FileID                               INT NOT NULL,
     Reads                                BIGINT NULL,
     ReadsFromStart                       BIGINT NULL,
     Writes                               BIGINT NULL,
     WritesFromStart                      BIGINT NULL,
     BytesRead                            BIGINT NULL,
     BytesReadFromStart                   BIGINT NULL,
     BytesWritten                         BIGINT NULL,
     BytesWrittenFromStart                BIGINT NULL,
     IostallInMilliseconds                BIGINT NULL,
     IostallInMillisecondsFromStart       BIGINT NULL,
     IostallReadsInMilliseconds           BIGINT NULL,
     IostallReadsInMillisecondsFromStart  BIGINT NULL,
     IostallWritesInMilliseconds          BIGINT NULL,
     IostallWritesInMillisecondsFromStart BIGINT NULL,
     RecordedDateTime                     DATETIME NULL,
     IntervalInMilliseconds               BIGINT NULL,
     FirstMeasureFromStart                BIT NULL
  )

This is what we need to gather, and later analyze the data. Since we are managing our samples we have to know when the sampling started and what the first sample is. FirstMeasureFromStart lets us know that it is the first base measurements the rest of the samples will delta off of.

GatherVirtualFileStats Stored Procedure

Next we need a stored procedure to do the sampling. One thing you will notice is the procedure executes continuously with a WAIT FOR DELAY built into it so you can get finer grained than the 1 minute limitation of the SQL Agent. Sometimes, I will do one off sampling for a short period, say 30 minutes at a 30 second interval but most often I just let it run and set the sample rate at 1 minute or larger depending on how busy the system is.


IF EXISTS (SELECT
             *
           FROM
             dbo.sysobjects
           WHERE
            id = Object_id(N'[dbo].[GatherVirtualFileStats]')
            AND Objectproperty(id, N'IsProcedure') = 1)
  DROP PROCEDURE [dbo].[GatherVirtualFileStats]

GO

--------------------------------------------------------------------------------------
--  GatherVirtualFileStats
--  by: Wesley D. Brown
--  date: 02/08/2011
--  mod:  04/14/2011
--  mod:  04/17/2013

--  description:
--	This stored procedure is used to sample sys.dm_io_virtual_file_stats to track
--	performance at a database file level. This is useful for finding
--	hotspots on SAN's or under performing IO systems.

--  parameters:
--    @Duration	 = '01:00:00' How long to run before exiting
--   @IntervalInSeconds = 120 Number of seconds between samples
--@DB			 = -1 DB_ID to monitor, -1 for all
--@DBFile		 = -1 File_ID of file to monitor, -1 for all
--  usage:
--  	DECLARE @RC         INT,
--			@StartTime  DATETIME,
--			@databaseID INT

--	SELECT @StartTime = Getdate(),
--		   @databaseID = Db_id()

--	EXEC @RC = Gathervirtualfilestats
--	  '00:45:30',
--	  30,
--	  10,
--	  -1

--	SELECT *
--	FROM   dbo.VirtualFileStats
--	WHERE  DBID = 10
--	ORDER  BY RecordID

--  platforms:
--  SQL Server 2005
--  SQL Server 2008
--  SQL Server 2008 R2
--  SQL Server 2012
--  tested:
--  SQL Server 2005 SP2
--  SQL Server 2012
---------------------------------------------------------------------------------------
--  *** change log		***
--	Added history table and perge on start up if there is data in the main table
--  *** end change log	***
-------------------------------------------------------------------------------------
CREATE PROC dbo.Gathervirtualfilestats
  @Duration          DATETIME = '01:00:00',
  @IntervalInSeconds INT = 120,
  @DB                INT = -1,
  @DBFile            INT = -1
AS
  SET nocount ON

  DECLARE
    @StopTime                 DATETIME,
    @LastRecordedDateTime     DATETIME,
    @CurrentDateTime          DATETIME,
    @ErrorNumber              INT,
    @NumberOfRows             INT,
    @ErrorMessageText         NVARCHAR(4000),
    @CurrentServerName        VARCHAR(255),
    @DifferenceInMilliSeconds BIGINT,
	@SQLVersion				  VARCHAR(50)

	select @SQLVersion =
		substring(version_number,1,charindex(' ',version_number))
		from
		(
		select
			substring(version,charindex('-',version)+2, charindex('(',version)-patindex('%.[0-9]',version)) as version_number from
			(
				select @@version as version
			) as t
	) as t

	if (charindex('11.',@SQLVersion,0) > 0)
	BEGIN
		IF (@DB = -1)
		BEGIN
			set @DB = NULL
		END
		IF (@DBFile = -1)
		BEGIN
			set @DBFile = NULL
		END
	END

  IF EXISTS (SELECT
               1
             FROM
               dbo.VirtualFileStats)
    BEGIN
        IF EXISTS (SELECT
                     *
                   FROM
                     dbo.sysobjects
                   WHERE
                    id = Object_id(N'[dbo].[VirtualFileStats]')
                    AND Objectproperty(id, N'IsTable') = 1)
          BEGIN
              INSERT INTO dbo.VirtualFileStatsHistory
              SELECT
                *
              FROM
                VirtualFileStats;

              TRUNCATE TABLE dbo.VirtualFileStats;
          END
    END

  SELECT
    @CurrentServerName = Cast(Serverproperty('servername') AS VARCHAR(255))

  SET @DifferenceInMilliSeconds = Datediff(ms, CONVERT(DATETIME, '00:00:00', 8), @Duration)

  SELECT
    @StopTime = Dateadd(ms, @DifferenceInMilliSeconds, Getdate())

  WHILE Getdate() <= @StopTime
    BEGIN
        SELECT
          @LastRecordedDateTime = @CurrentDateTime

        SELECT
          @CurrentDateTime = Getdate()

        INSERT INTO dbo.VirtualFileStats
                    (ServerName,
                     DBID,
                     FileID,
                     Reads,
                     ReadsFromStart,
                     Writes,
                     WritesFromStart,
                     BytesRead,
                     BytesReadFromStart,
                     BytesWritten,
                     BytesWrittenFromStart,
                     IostallInMilliseconds,
                     IostallInMillisecondsFromStart,
                     IostallReadsInMilliseconds,
                     IostallReadsInMillisecondsFromStart,
                     IostallWritesInMilliseconds,
                     IostallWritesInMillisecondsFromStart,
                     RecordedDateTime,
                     IntervalinMilliseconds,
                     FirstMeasureFromStart)
        SELECT
          @CurrentServerName,
          vfs.database_id,
          vfs.[file_id],
          vfs.num_of_reads - dbaf.ReadsFromStart                            AS Reads,
          vfs.num_of_reads                                                  AS ReadsFromStart,
          vfs.num_of_writes - dbaf.WritesFromStart                          AS Writes,
          vfs.num_of_writes                                                 AS WritesFromStart,
          vfs.num_of_bytes_read - dbaf.BytesReadFromStart                   AS BytesRead,
          vfs.num_of_bytes_read                                             AS BytesReadFromStart,
          vfs.num_of_bytes_written - dbaf.BytesWrittenFromStart             AS BytesWritten,
          vfs.num_of_bytes_written                                          AS BytesWrittenFromStart,
          vfs.io_stall - dbaf.IostallInMillisecondsFromStart                AS IostallInMilliseconds,
          vfs.io_stall                                                      AS IostallInMillisecondsFromStart,
          vfs.io_stall_read_ms - dbaf.IostallReadsInMillisecondsFromStart   AS IostallReadsInMilliseconds,
          vfs.io_stall_read_ms                                              AS IostallReadsInMillisecondsFromStart,
          vfs.io_stall_write_ms - dbaf.IostallWritesInMillisecondsFromStart AS IostallWritesInMilliseconds,
          vfs.io_stall_write_ms                                             AS IostallWritesInMillisecondsFromStart,
          @CurrentDateTime,
          CASE
            WHEN @LastRecordedDateTime IS NULL THEN NULL
            ELSE Datediff(ms, dbaf.RecordedDateTime, @CurrentDateTime)
          END                                                               AS IntervalInMilliseconds,
          CASE
            WHEN @LastRecordedDateTime IS NULL THEN 1
            ELSE 0
          END                                                               AS FirstMeasureFromStart
        FROM
          sys.Dm_io_virtual_file_stats(@DB, @DBFile) vfs
        LEFT OUTER JOIN VirtualFileStats dbaf
          ON vfs.database_id = dbaf.dbid
             AND vfs.[file_id] = dbaf.fileid
        WHERE
          ( @LastRecordedDateTime IS NULL
             OR dbaf.RecordedDateTime = @LastRecordedDateTime )

        SELECT
          @ErrorNumber = @@ERROR,
          @NumberOfRows = @@ROWCOUNT

        IF @ErrorNumber != 0
          BEGIN
              SET @ErrorMessageText = 'Error ' + CONVERT(VARCHAR(10), @ErrorNumber) + ' failed to insert file stats data!'

              RAISERROR (@ErrorMessageText,
                         16,
                         1) WITH LOG

              RETURN @ErrorNumber
          END

        WAITFOR DELAY @IntervalInSeconds
    END

I Have Data, Now What?

This is where the fun begins! If you just query the table the data doesn’t make much sense.

image

We need to do some simple math and get the data into a metric that is meaningful.


SELECT TOP 10 Db_name(dbid)                                          AS 'databasename',
              File_name(fileid)                                      AS 'filename',
              Reads / ( IntervalInMilliSeconds / 1000 )              AS 'readspersecond',
              Writes / ( IntervalInMilliSeconds / 1000 )             AS 'writespersecond',
              ( Reads + Writes ) / ( IntervalInMilliSeconds / 1000 ) AS 'iopersecond',
              CASE
                WHEN ( Reads / ( IntervalInMilliSeconds / 1000 ) ) > 0
                     AND IostallReadsInMilliseconds > 0 THEN IostallReadsInMilliseconds / Reads
                ELSE 0
              END                                                    AS 'iolatencyreads',
              CASE
                WHEN ( Reads / ( IntervalInMilliSeconds / 1000 ) ) > 0
                     AND IostallWritesInMilliseconds > 0 THEN IostallWritesInMilliseconds / Writes
                ELSE 0
              END                                                    AS 'iolatencywrites',
              CASE
                WHEN ( ( Reads + Writes ) / ( IntervalInMilliSeconds / 1000 ) > 0
                       AND IostallInMilliseconds &gt; 0 ) THEN IostallInMilliseconds / ( Reads + Writes )
                ELSE 0
              END                                                    AS 'iolatency',
              RecordedDateTime
FROM   management.dbo.VirtualFileStats
WHERE  DBID = 10
       AND FirstMeasureFromStart = 0
ORDER  BY RecordID

This gives us reads, writes and io latency per second results.

image

Now we are cooking! We can now see that on this database we are seeing some spikes in latency, the number of milliseconds it takes to complete a single IO request, and may warrant investigation. As a general rule of thumb if I see IO latency above 20 milliseconds consistently I start looking deeper into the IO system to see what is wrong. A single modern hard disk is capable of 130 random IO’s a second. Another thing to consider is how many databases are on the same disk. I will pull the database files together with the volume they are on to get a true aggregate number of IO’s and latencies. You may find that a single database is dominating the disks and causing other databases to slow down even when the number of IO’s for those databases is small. On a SAN being able to get to the file level can help you locate the physical disks for that LUN and help your SAN administrators look at a very specific spot instead of the overall health of the SAN and actually fix your performance issues.

Lastly, I run a nightly job that performs these aggregations for me and moves them into a table that I keep long term so I can see performance over time. This is a great way to see if you are getting near your IO capacity and if the steps you have done correct the issues.

So, That’s it?

Not by a long shot. Measuring IO latencies is just one piece of a much larger puzzle for troubleshooting performance problems. It is a valuable one though, and keeping this information over time is just priceless. Now that you have the tools to gather virtual file statistics on your SQL Server instance I’d love to hear how you are using the data.

Update 4/14/2011

I added a history table an a check at the beginning to move all data to history before starting up. Mike Kaplan reported below that multiple runs caused issues if there was data in the table from a previous run.

Update 4/17/2013

Added a check for the version of SQL Server. In 2012 they changed the syntax so calling the dmv with a -1,-1 simply gives you no data. Previously, it returned all database files.