SQLGeordie's Blog

Helping the SQL Server community……where i can!

NetApp Dynamic Backup Script — June 9, 2011

NetApp Dynamic Backup Script

For those of you who’ve worked with NetApp and SnapManager for SQL Server, you will know can generate a scheduled backup as a SQL Agent Job. Simple, click through the GUI, selecting the Databases you want to backup, the schedule etc etc and a job is created with a command line script within it similar to that below:

"C:\Program Files\NetApp\SnapManager for SQL Server\SmsqlJobLauncher.exe" new-backup
–svr 'ServerToBackupName'
-d 'ServerToBackupName', '4', 'DBName1', 'DBName2', 'DBName3', 'DBName4'
-ver –verInst 'ServerToVerifyOnName' -mp
–mpdir 'C:\Program Files\NetApp\SnapManager for SQL Server\SnapMgrMountPoint'
-RetainBackups 7 -lb -bksif -RetainSnapofSnapInfo 0 -trlog –mgmt standard

This script indicates the server in which the Databases reside upon for backing up, the number of databases to be backed up as well the list of database names you selected earlier in the GUI.

Now, this looks all relatively straight forward and makes perfect sense up until the point when you either create or drop a database from the server. This script does not pick up the fact that this has happened and will fail to snap any new databases and will no doubt fail trying to snap a database that no longer exists. So, every time you do this administration work you have the additional step of amending the job to match the current list of databases as it cannot update itself – how crazy is that!?!

I’m not sure about you I don’t like giving myself additional work to do so I set about rectifying this issue by creating a script that would mean you can create this job once and not have to worry about it again unless you specifically require to exclude a certain database by dynamically populating the Database list for you based on what is in sys.databases. This can be modified to include or exclude certain Databases i.e. system databases.

The script itself is straightforward, assigning the list of Databases to a variable and build up a command string to be ran via xp_cmdshell. Note:- There are certain security risks associated with enabling the xp_cmdshell feature so please make sure you’ve read and understood this before proceeding .
I’ve added comments to the script where necessary:


-- Amend variables where necessary depending on how many DBs you have
-- Remembering the maximum recommendation for DBs on a LUN (from NetApp) is 35
DECLARE 	@strSQL		        VARCHAR(2000), 
		@strDBNames		VARCHAR(1000),
		@intDBCnt		VARCHAR(5),
		@strVerifServer 	VARCHAR(50),
		@intRetainBkups 	INT

SELECT		@strSQL		        = '',
		@strDBNames		= '',
		@strVerifServer 	= 'VerificationServerName',
		@intRetainBkups 	= 8
--Get DB Names
SELECT @strDBNames = @strDBNames +''''+ NAME +''', '
FROM sys.databases
--WHERE database_id > 4

--Get DB count
FROM sys.databases
--WHERE database_id > 4

--remove trailing comma, probably a fancier way but its quick and works
SET @strDBNames = LEFT(@strDBNames,LEN(@strDBNames)-1) 

--Make sure this string is all on one line!!!!
--Multi-line for readability
SET @strSQL = @strSQL + '
"C:\Program Files\NetApp\SnapManager for SQL Server\SmsqlJobLauncher.exe" new-backup  
–svr '''+@@SERVERNAME+'''
-d  '''+@@SERVERNAME+''', '''+@intDBCnt+''', '+@strDBNames+'
-ver  –verInst '''+@strVerifServer+''' -mp  
–mpdir ''C:\Program Files\NetApp\SnapManager for SQL Server\SnapMgrMountPoint''
-RetainBackups  '+CONVERT(VARCHAR(5),@intRetainBkups)+' -lb -bksif -RetainSnapofSnapInfo 0 -trlog  –mgmt standard 

SELECT @strSQL --Output the text being ran, can comment out
EXEC xp_cmdshell @strSQL

There we have it. A NetApp snap backup script which will continue to work even when you go crazy and decide to add/drop as many databases as you can.

Hopefully there are other out there that can benefit from this script 🙂

Feel free to leave constructive criticism or even a positive comment but please don’t comment on things like “why have you used xp_cmdshell?” or “why did you do it so that you have to run a separate SET statement to remove trailing comma’s”, everyone has their way and this is invariably is mine 

UPDATE (30th Nov 2011):
The latest versions of Snapmanager for SQL have addressed this issue and now if you select all DBs to backup it does not output all DB names into the job script as per the above. You can still use the script above to dynamically backup a selected set of databases by amending the select from sys.databases query so its not completed redundant!!!

SnapManager Epic Fail – oh dear! — April 7, 2011

SnapManager Epic Fail – oh dear!

An ex colleague and good friend of mine has recently been having trouble with Snapmanager for SQL, anyone had similar troubles? Would be interesting to see peoples thoughts on this as i share his concerns….

Bearing in mind that this may very well be down to human (sysadmin) error and a requirement to RTFM, but their tech support neglected to mention this issue when discussed before the work was carried out! I would have thought something as big as this would be mentioned by them during that call…….

UPDATE:- From some of the responses i’ve had on #sqlhelp, this issue is actually by design and not bug. I still can’t understand why a tool that claims to do this job fully automatically but doesn’t do some of the fundamental changes required when moving a database. Thanks to @GiulianoM and @BrentO for their words of wisdom and in Brent’s case (as per usual), sarcasm :o)

Why I don’t like NULLs and Nullable Columns — March 25, 2011

Why I don’t like NULLs and Nullable Columns

As I imagine (or hope), the majority of us adhere to some form of standards in the environment that they work in. Whether that be whether you have to wear a Shirt and Tie, be in for 9am or those of us in a Database or Development environment, Coding Standards.

Everyone who has worked with me knows that one of my standards is NO NULLABLE columns in any new development unless there is an extremely good, valid reason and to date, I’ve only ever had 1 developer convince me that they should make their column nullable – and that was due to the current system design meaning we had to have it without a complete re-write of that part of the system. I don’t really want this blog to turn into a big debate as to whether I’m right or wrong on this, its purely my view based on the many years I’ve been working with SQL Server and development teams.

There are many blogs out there where DBAs share the same or similar views such as Thomas LaRock (Blog|Twitter) in which he talks about the harm they cause by the result of a person not knowing they are there. What I’m going to focus on in this blog is more where people know the NULL values are there and tailor their T-SQL around it, specifically when the value is NULL and they don’t want to show NULL on screen so they replace it with another value.

I’ve knocked up a quick script to populate a table EMPLOYEE with dummy data for the purposes of this blog and yes, I’ve purposely left MODIFIEDBY_ID and MODIFIEDBY_DATE as NULLable and no, i’m not saying this is the table design anyone should be using in their systems, especially later on.

                     PRIMARY KEY
                     NOT NULL


SELECT  @counter = 1
       ,@NumberOfRows = 70000
WHILE ( @Counter < @NumberOfRows )
                    SELECT  CONVERT(VARCHAR(10) , getdate() - ( ( 18 * 365 ) + RAND() * ( 47 * 365 ) ) , 103)
                           ,CONVERT(DECIMAL(16,2) , ( 50000 + RAND() * 90000 ))
                           ,'Dummy Name' + CONVERT(VARCHAR(100) , @Counter)
                           ,CONVERT(INT , rand() * 10000)
                           ,CONVERT(VARCHAR(10) , getdate() - ( ( 18 * 365 ) + RAND() * ( 47 * 365 ) ) , 103)

            SET @Counter = @Counter + 1
      END ; 
SELECT  @counter = 1
       ,@NumberOfRows = 25000
WHILE ( @Counter < @NumberOfRows )
                    SELECT  CONVERT(VARCHAR(10) , getdate() - ( ( 18 * 365 ) + RAND() * ( 47 * 365 ) ) , 103)
                           ,CONVERT(DECIMAL(16,2) , ( 50000 + RAND() * 90000 ))
                           ,'Dummy Name' + CONVERT(VARCHAR(100) , @Counter)
                           ,CONVERT(INT , rand() * 10000)
                           ,CONVERT(VARCHAR(10) , getdate() - ( ( 18 * 365 ) + RAND() * ( 47 * 365 ) ) , 103)
                           ,CONVERT(INT , rand() * 10000)
                           ,CONVERT(VARCHAR(10) , getdate() - ( ( 18 * 365 ) + RAND() * ( 47 * 365 ) ) , 103)

            SET @Counter = @Counter + 1
      END ;    

I also create a couple of non-clustered indexes that should be utilised by the sample queries below:

CREATE INDEX idx1 ON employee ( MODIFIEDBY_DATE , inserted_date ) INCLUDE ( emp_id ,salary )
CREATE INDEX idx2 ON employee ( MODIFIEDBY_DATE ) INCLUDE ( emp_id ,salary )

So, the table is there and the data populated, now for one of my biggest bugbears:

Query 1:

SELECT  emp_id
       ,isnull(MODIFIEDBY_DATE , inserted_date)
WHERE   isnull(MODIFIEDBY_DATE , inserted_date) > dateadd(yy , -20 , getdate())

Arrrrgggghhhh!!!! Don’t get me wrong, i do understand why tables and columns like this do exist but for a DBA its a nightmare to tune as any index you may put on these columns will ultimately end in a Scan as opposed to a Seek.

Here is the execution plan:

As a comparison here is the query and execution plan for the query without the ISNULL function:

Query 2:

SELECT  emp_id
WHERE   MODIFIEDBY_DATE > dateadd(yy , -20 , getdate())

Execution Plan:

And a comparison of the two:

SELECT  emp_id
       ,isnull(MODIFIEDBY_DATE , inserted_date)
WHERE   isnull(MODIFIEDBY_DATE , inserted_date) > dateadd(yy , -20 , getdate())

SELECT  emp_id
WHERE   MODIFIEDBY_DATE > dateadd(yy , -20 , getdate())

Thats 95% against 5%!! What a difference!

Now, for bugbear #2 i’ll need to add a new column with new values, again, i’m not saying you should or would ever do this but for the purposes of the blog:

ADD   PreviousSalary DECIMAL(16,2)

So i’ve now added a new column with all NULL values, however, i don’t want my users seeing the word NULL on the front end. Easy, i’ll replace it with a zero:

SELECT  emp_id
       ,isnull(MODIFIEDBY_DATE , inserted_date) AS LastModifiedDate
       ,isnull(PreviousSalary,0) AS PreviousSalary
WHERE   isnull(MODIFIEDBY_DATE , inserted_date) > dateadd(yy , -20 , getdate())

From this you can now see that the SQL Optimiser will now ignore the index created and scan the clustered index:

Obviously i could amend idx1 to factor in this column or create a new index:

CREATE INDEX idx3 ON employee ( MODIFIEDBY_DATE , inserted_date ) INCLUDE ( emp_id ,salary, PreviousSalary )

And sure enough it will choose that index but again, its a scan!

Ok, so you may well be thinking that this doesn’t happen often or wouldn’t cause too much of an issue on your system which may very well be correct. However, what about a highly transactional financial system with 100’s of millions of rows? Yes it does happen! What makes matters even worse is if you decided you wanted to search for (in this example) PreviousSalary > 10000, you’d then have to handle the NULL values and convert them to another value (typically 0) which begs the question, why is it not set to an arbitrary default value in the first place?

I’ve updated a random selection of PreviousSalary records to now have non-NULL values and added a new index:

CREATE INDEX idx4 ON employee ( PreviousSalary ) INCLUDE ( emp_id ,salary )

Running the query to handle the NULLs will still produce and Index Scan:

SELECT  emp_id
WHERE   isnull(PreviousSalary,0) > 10000   

If I now update all the NULL columns to 0 (I could have used an arbitrary figure such as -1, -100 or even -99999 to indicate a NULL value) and amend the query above, we now get the execution plan:

SELECT  emp_id
WHERE   isnull(PreviousSalary,0) > 10000   

SELECT  emp_id
WHERE   PreviousSalary > 10000

I think by now everyone will be getting the picture. So how do you rectify this? First and foremost its education. Speak with your developers, do “Lunch ‘n’ Learn” sessions, anything you can to express how much of an issue this can potentially be and make the time to sit with them and (re)design that part of the system so that you don’t have to handle NULLs either at the query level or even the front end. If they understand that if they use the ISNULL function in this way then their query will not be SARGable and ultimately hurt performance, I’m confident that you won’t have any issues converting them to this method.

I know that the majority of people who may read this may not take anything away from it but even if I get a handful of people who can take this and speak with their development teams on this very simple concept of database design then you could save yourself major headaches in the future!

Database Mail – Collection of SQL Server Scripts — March 21, 2011

Database Mail – Collection of SQL Server Scripts

As part of my blogging I’ll be periodically providing readers with a collection of my scripts which I use on a daily basis (collected over the years) and hopefully others can gain from these and utilise them going forward.

Today i’m providing a list of Database Mail scripts I use when troubleshooting issues. These are predominantly SQL Server 2005 but I believe will work on SQL Server 2008 but as I don’t have a 2008 instance available to test I can’t guarantee this!

Firstly, a good place to start is to check the status of your Database Mail:

EXEC msdb.dbo.sysmail_help_status_sp

it may be possible that it isn’t sending mail because it hasn’t been started 🙂

Following on from this, if you discover that Database Mail has in fact stopped then you can start it again with:

EXEC msdb.dbo.sysmail_start_sp

NOTE: It can also be stopped by simply running

EXEC msdb.dbo.sysmail_stop_sp

If Database Mail is running then it may be that Service Broker has not been enabled:

SELECT is_broker_enabled FROM sys.databases WHERE name = 'msdb';

Another technique I use when troubleshooting Mail issues (if its running and the settings are correct) is to check if anything is actually stuck in the SQL Server Mail system, it can be apparent that mail is sat queued but unable to send. A quick and easy way to determine this is to run:

EXEC msdb.dbo.sysmail_help_queue_sp @queue_type = 'mail';

If the value is 0 then you can determine that there isn’t actually anything in the queue. If the value is anything other than 0 then you can investigate this further by running one or all of the below:

Test to see if your mail has actually been sent

SELECT * FROM msdb.dbo.sysmail_sentitems where sent_date > dateadd(d,-1,getdate());

Check the mail eventlog for further errors

SELECT * FROM msdb.dbo.sysmail_event_log;

Check for any mail item which is sitting at the unsent status

SELECT * FROM msdb.dbo.sysmail_allitems WHERE sent_status = 'unsent';

A favourite of mine, determine the reasons why a mail item has actually failed

SELECT items.subject,
    ,l.description ,*
FROM msdb.dbo.sysmail_faileditems as items
INNER JOIN msdb.dbo.sysmail_event_log AS l
    ON items.mailitem_id = l.mailitem_id
WHERE items.last_mod_date > getdate()-1

The script above tends to be the ideal script to run in order to find out why an email isn’t being sent, i tend to find 90% of my issues by running that script!

Something else worth checking is whether the user actually has permissions to be sending emails via SQL Server:

Check members of the DatabaseMailUserRole

EXEC msdb.sys.sp_helprolemember 'DatabaseMailUserRole'

These are just a handful of queries that can be ran when troubleshooting Database Mail issues, by no means is it a complete list but from my experience i rarely have to go elsewhere. There has been times when i’ve had to venture into amending retry attempts, querying sys.dm_broker_queue_monitors and sys.service_queues but this has been very rare circumstances so I’ve not included the queries for these

If anyone has anything they feel worthwhile adding to this then feel free to email me or comment and I’ll add them if relevant.

Poll: – Index Rebuild / Defrag, what page count filter do you set? — March 11, 2011

Poll: – Index Rebuild / Defrag, what page count filter do you set?

After reviewing a number of clients overnight index defrag / rebuild scripts over the years, I’m intrigued to know how many of us actually set the page count filter when capturing the fragmentation levels.

I know there’s a lot of debate around the value to set and personally I set mine to 100, mainly because (in most circumstances) the client has no requirement for a separate update statistics job and therefore its down to the autostats update (if we’ve determined its required to be on) or indeed the overnight index defrag/rebuild job to rebuild them and even though 10000 pages isn’t a great deal on some systems for the stats to be out of date, as a rule of thumb I’d rather play it safe and defrag/update stats on tables with pages > 100.

So, interested to see what everyone sets and if enough interest I’ll blog the results 🙂

I’d appreciate any comments to supplement your selection too….

Product Review – Redgate SQL Virtual Restore — March 9, 2011

Product Review – Redgate SQL Virtual Restore

1. Introduction

After reading Brad McGehee’s guide to SQL Virtual Restore I was really intrigued by the capabilities of the product. So much so I downloaded it and thought I’d give it a whirl. This article is not so much a guide (you can read Redgate’s website or Brad’s blog for that) but more my experiences with the product and the conclusions I came to about its uses.
Please note, I’m not employed by Redgate or endorse any of their products, this is simply providing feedback of my early experiences. I will go through the good and bad times I’ve spent with the product and tried to be as honest as possible in showing my findings. Along with this I’ll provide one or two recommendations which I feel would further enhance the product.
As a side note, this blog was originally written 5 months ago as an article for SQLServerCentral.com so any issues that are still outstanding at the end of this document may now actually have been fixed but I don’t have access to the software anymore to prove or disprove this as i’m now with a different client.

2. Background

We’re currently in the process of moving our SQL environment to SQL 2008 and decided to go crazy and go virtual (VMWare) and on a new SAN Technology (NetApp) all at the same time. With this in mind, we have the facility of snap mirror and flexclones but we cannot fully utilise these technologies until we’ve migrated our live environment which would be an even more crazy idea to throw that in first so we’re moving the non-production environment first. Having a Dev/UAT/Support environment running an expensive SAN is something a lot of DBA’s (including myself) can only dream of but there is a cost implication with this – a massive cost in fact. Our non-production environments are (almost) exact replicas of our live environment so require 1TB of storage for each server and bearing in mind we only have 19TB available for all servers it’ll be utilised very quickly!
The live environment is planned for March 2011 so we’ve got at least 6 months in this situation. This is why I decided to delve into the realms of SQL Virtual Restore (SQLVR from now on) and test to see if it actually can do what it states and if it would work in our environment. If I could get it to work then the cost savings could be enormous.

3. Software (initial)

We currently backup our databases using Redgate’s SQL Backup v5.4 which unfortunately is not supported by SQL Virtual Restore as a known media type so for the purposes of testing I had to download Redgate’s SQL Backup v6.4.0.56 with a requirement to upgrade to this if all went well.
The version of SQLVR we downloaded and installed was v2.0 and with that came Hyperbac

4. Testing Environment

The server I used was a virtual server running on 2 vCPU’s and 4GB of RAM and running on a NetApp SAN (12 SAS disks, DP RAID). I won’t bore you too much with the full spec of the ESX hosts, speed or configuration of the SAN. If you want that information I could supply it at a later date.

4.1. Uncompressed DB Sizes

Data – 750GB (626GB not including Archives)
Logs – 150GB (120GB not including Archived)

4.2. Redgate SQL Backup Backupset

173GB compressed .sqb’s (including Archives and compression 1)
140GB compressed .sqb’s (excluding Archives and compression 1)

5. Testing of Virtual vs Physical

I wanted to test space and time savings we’d get out of this product as well as the performance impact of implementing virtual databases.

5.1. Restore Sizes

The initial mount of the vmdf’s and vldf’s came to a total size of < 1GB which gave the total environment size of 173GB for the backups + 1GB for the virtual files = 174GB. This saving a whopping 726GB disk space saving from the outset.
(750GB + 126GB) – 174GB = 726GB

5.2. Restore Timings

To fully restore all the required databases in their uncompressed state it took 3hrs 02mins.
To fully restore (mount) the .sqb backups took 1hr 19mins which is a saving of 1hr 43mins which is a 43% saving on restore time alone.
Fully Restored Size = 626GB / 48GB (log) (not including bankingarchive) and took 3hrs 2m
Fully Mounted Size = 1.37GB + 173GB (Backupset) = 174.37GB (including bankingarchive)

5.3. Large overnight process

I decided to test one of our long running overnight processes to test the capabilities of the virtual files and stress it to see how it copes under heavy process load. This would also provide me with potential file size increases as well as the performance of performing such a process.
Here are the sizes of the virtual data files before the process was ran are shown below:

The size of the database which would be updated by this process was 139,776kb and after the process was complete there was a 2.5GB increase in file size (1.67GB vmdf and 830MB vldf)
This process would be ran everyday in UAT and would therefore rack up 17.5GB of data space increase over the course of the week. This is not including any of the other testing / processes that would be running during that time. 17.5GB is not a great deal of space in the grand scheme of things and the 726GB saving more than makes up for that increase, even if I were to multiply it by 10 then 175GB increase is still a significant saving.
I decided to then run the same process using physical databases and the timings showed that the process whilst running on physical databases was 30% quicker. It took 50 minutes as opposed to 72 when virtual. I believe this to be purely down to the fact that the virtual datafile sizes can not be pre-allocated and every single change requires the file to expand therefore causing an IO overhead for every change written to the database.

6. Limitations Experienced

6.1. Memory Being Consumed

One of the first things I noticed when testing SQLVR was the large amounts of RAM it utilised. In order for me to get the relevant syntax required, I launched the GUI and selected a database backup file to mount and this selecting of the database backup file caused the HyperbacSrv.exe to consume all available RAM resources on the server…..but worst of all, wouldn’t release it after closing the GUI.

Black line being the % Commited Bytes in Use:

Almost 4GB of ram gone in about 20seconds!!! The only way I could release the memory was to restart the HyperbacSrv.exe service as shown below:

What is didn’t realise was that if you restart the HyperBacSrv.exe then YOU MUST restart the sql service. If you don’t then the databases will not reconnect. Depending on the size of the databases, they can take several (over 10mins in my case) minutes to recover and be Online.
NOTE: This bug has now been fixed by Redgate in their latest version

6.2. Fix for Memory issue

After liaising with Jeff Aven about this we found that it was due to an issue with the process doing a RESTORE HeaderOnly on the backup file as opposed to a RESTORE FilelistOnly. I was supplied with a couple of patched versions (Hyperbac v5.0.24.1 then Hyperbac v5.0.24.3) which had addressed this and rectified the issue.
NOTE: This bug has now been fixed by Redgate in their latest version

6.3. Large virtual files if multiple restore threads ran

I ran a test to see if multiple restores could be ran simultaneously which worked with the obvious IO overhead and something a bit strange which I didn’t realise at first – the virtual files are huge! Some around 9GB!!!

I believe this to be a memory resource issue as i’ve also tested restoring a single DB on a server with only 2GB of RAM and found that it produced a vmdf file of 13GB!!! I’m currently liaising with Redgate again to investigate this issue.

7. Other intermittent issues

7.1. Incorrectly formed backup files

Msg 3241, Level 16, State 0, Line 1
The media family on device ‘d:\Backup\DB.sqb’ is incorrectly formed. SQL Server cannot process this media family.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.


Msg 3241, Level 16, State 0, Line 1
The media family on device ‘d:\Backup\ DB.sqb’ is incorrectly formed. SQL Server cannot process this media family.
Msg 3013, Level 16, State 1, Line 1
RESTORE DATABASE is terminating abnormally.

This happened every now and again and I couldn’t pinpoint why. Quite frustrating when trying to build a dynamic restore of all files with their relevant filegroups/files.
If you keep running the script eventually you’ll get this:

Msg 10054, Level 20, State 0, Line 0
A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 – An existing connection was forcibly closed by the remote host.)

Run the script again and then it’ll work!!!
NOTE: This bug has now been fixed by Redgate in their latest version

7.2. Replace command doesn’t actually seem to replace

I found that when restoring the database using Replace, it didn’t reset the virtual file sizes, it seems to append to the existing vmdf and vldf files as opposed to actually replacing them.
NOTE: This bug has now been fixed by Redgate in their latest version

8. 500k row insert performance test

I decided to run some quick performance tests based on Glenn Berry’s 1 million row insert test (http://www.sqlservercentral.com/blogs/glennberry/archive/2010/02/07/some-initial-insert-test-benchmarks-on-sql-server-2008-r2.aspx). I shortened this to 500k and compared it to a physical database on the same server and truncated the BigTable before each run.

8.1. Looping insert (exec spAddDataToBigTable 500000, 0)

Database Type StartTime EndTime ElapsedTime No.OfRows RowsPerSec
Physical Sep 10 2010 3:14PM Sep 10 2010 3:15PM 58 500000 8620
Virtual Sep 10 2010 1:45PM Sep 10 2010 2:04PM 1083 500000 461

8.2. Bulk insert (exec spAddDataToBigTable 500000, 1)

Database Type StartTime EndTime ElapsedTime No.OfRows RowsPerSec
Physical Sep 10 2010 3:16PM Sep 10 2010 3:17PM 79 500000 6329
Virtual Sep 10 2010 1:29PM Sep 10 2010 1:31PM 109 500000 4587

From these results you can see the impact that having to autogrow the virtual files for every single record has on the timings. The batched insert does still show a performance impact but this is not as significant as the previous test.

9. Conclusion

Although during the short time i’ve spent with SQLVR i’ve had a number of issues with memory, file sizings and actually getting a file to restore i can honestly say extremely impressed with this product. Not only are we saving on average £3000 per virtual SQL Server we create due to the reduction of SAN storage utilised but i can also get a Dev/UAT environment up and running significantly quicker than before. Couple this with all the benefits of using it as a quick and easy way to check your databases for corruption or even testing a procedure / index change without impacting other environments I’d recommend this product to anyone.
So, to answer the question is SQLVR truly a friend or in fact indeed a foe, from personal experience I’d like to think of it as a friend. With a few bumps n scrapes along the way, It has certainly helped me out in my time of need.
One thing I’d like to add to this, in no way shape or form am i suggesting everyone utilizes this product in the same way we have, for some the performance impact could be too great to implement in a UAT or even Dev environment but I’ve thrown into the mix the possibility that it can actually be done and the benefits you can get out of doing so.

10. Recommendations

I think the biggest improvement I’d like to see in later releases is the ability to pre-allocate virtual file sizes. I’m not sure if this is actually possible but I believe that if it is then the one major drawback to this product (autogrow / IO performance) could be alleviated somewhat. This could also help in the fact that at present every time I restore a database using SQLVR I cannot guarantee how much space it will utilize as every time I have done so far they’ve been different, in some case massively different!
I’d like to see the GUI updated to include a simple idea to allow data / logs to be put on a different drive. I’m not sure how much performance impact (never tested it) there is by having the vmdf and vldf’s on the same drive. You can get around this by scripting the restore.

11. Thank you’s

I’d like to give a big shout out to Jeff Aven and the support/development teams for working so swiftly in getting any issues that I uncovered rectified as soon as possible. Jeff is a man who doesn’t seem to sleep and was always on hand anytime of the day or night to have any issues addressed asap.

SQL Server CPU’s at 100%? Anyone checked the VAS (Virtual Address Space)??? – Part 2 (The Fix!) —

SQL Server CPU’s at 100%? Anyone checked the VAS (Virtual Address Space)??? – Part 2 (The Fix!)

First of all I have to make an apology to everyone on SQLServerCentral.com for the lateness of this follow up blog as i know a lot of you have been requesting a follow up to Part 1 which was originally blogged there. I originally wrote (Part 1) over a year ago (recently moved to new blog site) and started work on part 2 but never got to the point of actually finishing it.

In part 1, I wrote about how every now and again our live production sql server’s CPU’s would flatline at 100%. This could happen pretty much anytime of the day, there was no consistency in when it would happen. It was becoming a major issue for the company (and also myself!) because we could not afford any downtime at all and I was running out of ideas as to troubleshoot and ultimately resolve the issue. Then, as if from nowhere, (Slava Ok) came to the rescue when I came across his blogs on VAS in SQL 2005. And this is where I’ll start my follow up by attempting to go through the steps to track, find and resolve the process causing the 100% CPU.

As stated in part 1, I’d been through all the usual checks a DBA would go through when troubleshooting any spikes or flatlines in CPU with no real luck. After reading Slava’s blogs on VAS and tying some of that up to the errors we were receiving in the sql logs, I thought there was a very good chance that VAS was the issue and set about proving that fact.

NOTE: This VAS issue was on a 32bit Windows 2003 Enterprise server with SQL Server 2005 Enterprise (Service Pack 2 CU9) with 4 x 3ghz Xeon processors and 16GB RAM.

First off, in order to try and give a bit of leeway between outages, I looked into the startup parameter –g which sets the amount of MemToLeave (MTL) for sql server. By default this is set to 384MB (256 MB + (0.5 MB x max worker threads [default: 255]) = 384. Calculation taken from Era of Data) so I took the decision to add the startup parameter –g512 to give a bit more for us to play with and hopefully allow longer periods between these outages. I’m not for one minute indicating that this is best practice or the right thing to do in your systems but for us at the time it was a decision we made and ultimately paid off as our outages went from a weekly occurrence, back up to about 1 in 3. The time between outages had been getting shorter and shorter over time, it started as 1 in 4-5 weeks and slowly got more and more frequent which added to the confusion.

So, now with the frequency of the outages less frequent I could then start to track the vas usage and see if I could pinpoint what was causing this issue. First of all I had to capture what the total and max free vas was during the day and look for trends. The script below creates the table to capture the results, the view used (could be a cte if you wish) to get the vas which was taken from Slava’s blog and a modified script taken from Christian Bolton’s blog which captures the server, the max free size, the Total Available and the capture date.


--Create table to capture results
IF EXISTS(SELECT 1 FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'VASAvailability') AND type = (N'U'))
DROP TABLE dbo.VASAvailability
CREATE TABLE dbo.VASAvailability
VASAvailabilityID INT IDentity(1,1),
ServerName Varchar(100),
TotalAvailMem_KB INT,
MaxFreeSize_KB INT,
CreatedDate datetime

--Create VAS view
IF EXISTS(SELECT 1 FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'vwVas_Summary') AND type = (N'V'))
DROP VIEW dbo.vwVas_Summary
CREATE VIEW dbo.vwVas_Summary AS
Size = Vas_Dump.Size,
Reserved = sum(case (convert (INT,Vas_Dump.Base) ^ 0) when 0 then 0 else 1 end),
Free = sum(case (convert (INT,Vas_Dump.Base) ^ 0x0) when 0 then 1 else 0 end)
--- combine all allocation according with allocation base, don't take into
--- account allocations with zero allocation_base
SELECT CONVERT (varbinary,sum(region_size_in_bytes)) AS Size,
region_allocation_base_address AS Base
FROM sys.dm_os_virtual_address_dump
WHERE region_allocation_base_address<> 0x0
GROUP BY region_allocation_base_address
--- we shouldn't be grouping allocations with zero allocation base
--- just get them as is
SELECT CONVERT (varbinary,region_size_in_bytes), region_allocation_base_address
FROM sys.dm_os_virtual_address_dump
WHERE region_allocation_base_address = 0x0)
as Vas_Dump

Now there are the relevant objects in place to start tracking, I then set about creating a SQL Agent job to run every minute to capture these results. This may have seemed a little extreme as some people have asked if every 5 or 10 minutes would have been sufficient but due to the amount of traffic passing through this system, every minute may not have actually been enough but I stuck with it as a starter. Below is the script I used to insert the results of the vas (Total and Max), all that was required was to create the job and schedule it to run every minute.

INSERT INTO dbo.VASAvailability(ServerName,TotalAvailMem_KB,MaxFreeSize_KB,CreatedDate)
SELECT  @@ServerName, ,SUM(CONVERT(BIGINT , Size) * Free) / 1024 AS [Totalavailmem,KB] ,CAST(MAX(Size) AS BIGINT) / 1024 AS [Maxfreesize,KB] ,getdate() FROM    vwVas_Summary WHERE   Free <> 0

Over the next couple of weeks I monitored the results of the VAS trying to look for any trends or patterns as to when the VAS would drop and not be reclaimed. The nature of VAS shows that a process will claim a portion of it but in theory should release it once finished with it. At first I couldn’t find a particular day or even time as to when the total VAS reduced significantly and never came back. As with the flatlined cpu’s, it seemed to be happening on any day and at any time. By this time I’d lost most of my hair due to ripping it out and was on the verge of launching our system out the window!

All I could do from here was try and take sample times from when the vas dropped and try to investigate what processes were running at that time. The godsend at the time was having SQL Sentry’s Performance Advisor for SQL Server as that allowed me to quickly interpret what was running at any given time and saved me god knows how long in setting up server side traces etc to monitor this. A few of the selected time periods I chose were inconclusive due to the amount of traffic going through our system at those times, it could have been any one of over 1000 different procedures – gutted! Over the next few weeks I continued to monitor the results and finally started to see a trend becoming apparent. We seemed to lose a large portion of VAS on a Monday evening which I had kind of discredited at first due to it not being every Monday evening and the fact there wasn’t anything major running at that time…………..or so I thought.

Every evening we had a large overnight XML file import process which ran at various times depending on when we received the file from a 3rd party. This could be from 8pm until 2am the following morning. What I hadn’t thought of is that this process didn’t run over a weekend so on a Monday we had a file which was 3 times the normal size to cater for 3 days worth of data, bearing in mind that a normal size file was around 250MB, this meant that file on a Monday evening would be ~700MB. Now, this normally wouldn’t present an issue as the process ran a procedure using sp_xml_preparedocument and sp_xml_removedocument to clean up after itself, but what I wasn’t aware of until speaking with the development team was that their service sat waiting for this file and as soon as it saw the file it’d begin processing it – even though it hadn’t fully downloaded! Being XML, this import process would obviously fail due to an incomplete file and this could potentially happen 3 or 4 times during the 20-30mins its take to download the file (very slow uplink from 3rd party).

Can you see where I’m going with this?

What I uncovered was that each time the file was being (attempted) imported, it would run the sp_xml_preparedocument and attempt to process the incomplete file and ultimately bomb out, without running the sp_xml_removedocument due to error handling being handled by COM+ and nothing in the stored procedure. Something I actually learned from this was that rolling back via COM+ doesn’t “unprepare” the XML document, the only way you can do this is to deallocate it via the sp_xml_removedocument. We amended the process to handle reading the XML file by using a simple Try and Catch within the procedure like this:

Do import / sp_xml_preparedocument
EXECUTE sp_xml_removedocument @iDoc
-- Raise an error with the details of the exception
DECLARE @ErrMsg nvarchar(4000), @ErrSeverity int
SELECT @ErrMsg = ERROR_MESSAGE() ,@ErrSeverity = ERROR_SEVERITY()   RAISERROR(@ErrMsg, @ErrSeverity, 1)

And that was that! At the time of writing Part 1, we hadn’t experienced the issue within 2 months by implementing this error handling. At the time of writing this (well, when I started it) it was over 6 months and we never experienced the same issue of 100% CPU due to low VAS. We continued to monitor VAS by keeping the agent job running (at a reduced frequency) and having our bespoke proactive monitoring system alert us when it fell below 30MB to warn us that we may be experiencing memory pressure caused by low VAS and our systems are potentially going to go skits!

SQL Server CPU’s at 100%? Anyone checked the VAS (Virtual Address Space)??? —

SQL Server CPU’s at 100%? Anyone checked the VAS (Virtual Address Space)???

As a DBA, you can almost guarantee that at some point you’ll experience a server maxing out its CPU’s at 100%. Here is my latest experience with this situation which I’ve decided to blog as I found whilst researching that there isn’t that much in the way of documentation for this. Hope it helps many others out there……

This issue was on a Windows 2003 Enterprise server with SQL Server 2005 Enterprise (Service Pack 2 CU9) with 4 x 3ghz Xeon processors and 16GB RAM.


Every now and again (roughly 2 weeks) our live production sql server’s CPU’s would flatline at 100%. This could happen pretty much anytime of the day, there was no consistency in when it would happen, could be first thing in the morning, mid-afternoon or late evening.


Instinct and experience told me that CPU’s at 100% will no doubt be an issue with a big process that probably shouldn’t be running during peak times (AV scan or the likes), a query with possibly a lot of big table / index scans or hash joins. As I knew we don’t run AV scans during peak times and that from task manager I could see that the biggest CPU hogger was sql server itself, my primary focus was to investigate the procedures currently running. So, knowing that only myself and the technical architect have access to the live production environment and I knew that I wasn’t running anything and neither was he, it had to be something being ran from our in-house VB.net system.

Starting from the beginning, I started simple with a sp_who2 and sp_lock. These themselves could take up to 30seconds to return any output! When they did the number of processes running was about normal but the one thing that did stand out was the number of blocked processes. Since there were a lot of procedures blocking a number of other processes I decided to look into sysprocesses for processes and queries that were long running, open transactions and high CPU time to see if anything stood out as being the problem. However, due to the cpu’s being at 100%, EVERYTHING was running slow. Queries that normally took less than 500ms were taking up to 1min which as you can imagine is not acceptable in a live environment. Nothing seemed to be standing out as an issue, those that seemed like potential candidates were killed (risky I know). The daily profiler we run was difficult to research because of the above mentioned issue, every single query that was going through the system was off the scale. With this in mind, I stopped the Profiler and even tried restarting the agent to see if it were a transactional replication issue – but still the cpu’s at 100%.

So, to re-cap.

  • There are no external / 3rd party processes running on this box as its a dedicated sql server. The process with high cpu was sql server itself.
  • There were no long running queries / high cpu time that stood out, almost every procedure had an open transaction as they were taking so long to process.

The only option I had was to fail the server over as I now had department heads lined up at my desk asking “is it fixed yet?, is it fixed yet?, is it fixed yet?”. Sure enough, fail the servers over or restart the sql service and all is well again.

Checking the sql logs show nothing other than the fact that I’ve failed the server over and that the every 30min (Redgate) sql log backup’s failed, so no luck there…..or so I thought (see further down). I tried trailing the profiler around the time of high cpu issue beginning but nothing jumping out. After this had happened a couple of times this was becoming a very big problem. I was in a very difficult situation, nothing I searched after the incidents showed anything and I had roughly 5minutes to investigate whilst the issue was apparent.

My next train of thought was a memory leak. I trawled Microsoft and google looking for any patches or kb’s referring to memory leaks but all I found were either related to SQL 2000, fixed in earlier sql server cumulative updates to what we had installed and the one I found relating to McAfee turned out to be a dead end as we had a later version to that mention as having the memory leak issues.

I decided to take the slight performance hit and run Disk, CPU and memory perfmon’s as well as running quests spotlight to monitor this situation. Once again, when the high CPU issue arose all I got was basically that; CPU’s maxed out. Matching perfmon to the profiler ran didn’t bring anything to light. By now I was running out of ideas. I tried turning Hyper Threading off, restricting affinity to 2cpu’s instead of 4 but still no luck – every couple of weeks all the cpu’s still maxing out!

So, I once again returned to the log files and delved further into the one common theme I could find when this issue arose – RedgateVDI / log backup failure’s.

The error log (cut down) showed:

2009-01-19 17:15:00.59 spid435 Error: 18210, Severity: 16, State: 1.
2009-01-19 17:15:00.59 spid435 BackupVirtualDeviceSet::Initialize: Request large buffers failure on backup device ‘SQLBACKUP_C9D0A658-E6F1-4A03-8CC7-AC8466A5B202’. Operating system error 0x8007000e(Not enough storage is available to complete this operation.)

This was initially seen (by myself and colleagues) as a red-herring, due to there being more than enough storage available to initiate the backups. Looking a bit further (1000’s of errors output during this failure) I found errors such as:

Event Type: Error
Event Source: SQLVDI
Event Category: None
Event ID: 1
Date: 14/02/2009
Time: 2:10:03 PM
User: N/A
Computer: SqlBox1
SQLVDI: Loc=BufferAreaManager::MapBuffers. Desc=Out Of Address Space. ErrorCode=(8)Not enough storage is available to process this command.
. Process=512. Thread=2500. Server. Instance=SQL2000. VD=Global\SQLBACKUP_690D9E54-3147-4530-B037-7BA589B1B005_SQLVDIMemoryName_

Virtual Address Space? Aha! Now we’re getting somewhere. I searched Redgate for this VDI error and came up with this forum post: http://www.red-gate.com/messageboard/viewtopic.php?t=4240&postdays=0&postorder=asc&start=0

Now, although I was having similar issues relating to VDI, we didn’t seem to be having the common issue of 100% cpu’s but I did decide to monitor the server using EXEC maser..sqbmemory and this proved to be very interesting. When our server was maxing out at 100%, I ran this procedure and low and behold, the maximum free block of VAS was 3mb! I restarted the sql service and checked it again and it was back to circ 300mb.

This lead me to believe that it was almost certainly a VAS pressure issue we were experiencing and possibly created by redgate sql backup. I decided to research VAS further and found a brilliant blog by Slava Oks which can be found at: http://blogs.msdn.com/slavao/

I decided to monitor VAS by capturing the free blocks and total available every minute to see if I can trap what is actually causing the VAS to be eaten up. Over the next few weeks it became apparent that we had a large xml file (> 500mb) which was imported (using sp_xml_preparedocument and sp_xml_removedocument) every evening and sometime during this import the total VAS available dropped drastically. Taking this a step further, I decided to capture the VAS at every step of this import and found that one section seemed to be the culprit. The ins and outs of this are beyond the scope of this post but I may blog it in the future.

So, amend the procedure and all VAS issues are now resolved – we’ve not had a 100% cpu issue in over two months now (touch wood!). I am still monitoring the VAS (alerts put in place for if it drops too low) but I’m now 99.9% certain (there’s always room for doubt 🙂 ) that this issue is now resolved. Took a while but got there in the end!!!

Anyone else had similar issues?

Below I have listed a couple of links to descriptions of VAS for you to either brush up on or even learn a thing or two.

Virtual Address Space (VAS):