As a DBA, you can almost guarantee that at some point you’ll experience a server maxing out its CPU’s at 100%. Here is my latest experience with this situation which I’ve decided to blog as I found whilst researching that there isn’t that much in the way of documentation for this. Hope it helps many others out there……
This issue was on a Windows 2003 Enterprise server with SQL Server 2005 Enterprise (Service Pack 2 CU9) with 4 x 3ghz Xeon processors and 16GB RAM.
Every now and again (roughly 2 weeks) our live production sql server’s CPU’s would flatline at 100%. This could happen pretty much anytime of the day, there was no consistency in when it would happen, could be first thing in the morning, mid-afternoon or late evening.
Instinct and experience told me that CPU’s at 100% will no doubt be an issue with a big process that probably shouldn’t be running during peak times (AV scan or the likes), a query with possibly a lot of big table / index scans or hash joins. As I knew we don’t run AV scans during peak times and that from task manager I could see that the biggest CPU hogger was sql server itself, my primary focus was to investigate the procedures currently running. So, knowing that only myself and the technical architect have access to the live production environment and I knew that I wasn’t running anything and neither was he, it had to be something being ran from our in-house VB.net system.
Starting from the beginning, I started simple with a sp_who2 and sp_lock. These themselves could take up to 30seconds to return any output! When they did the number of processes running was about normal but the one thing that did stand out was the number of blocked processes. Since there were a lot of procedures blocking a number of other processes I decided to look into sysprocesses for processes and queries that were long running, open transactions and high CPU time to see if anything stood out as being the problem. However, due to the cpu’s being at 100%, EVERYTHING was running slow. Queries that normally took less than 500ms were taking up to 1min which as you can imagine is not acceptable in a live environment. Nothing seemed to be standing out as an issue, those that seemed like potential candidates were killed (risky I know). The daily profiler we run was difficult to research because of the above mentioned issue, every single query that was going through the system was off the scale. With this in mind, I stopped the Profiler and even tried restarting the agent to see if it were a transactional replication issue – but still the cpu’s at 100%.
So, to re-cap.
- There are no external / 3rd party processes running on this box as its a dedicated sql server. The process with high cpu was sql server itself.
- There were no long running queries / high cpu time that stood out, almost every procedure had an open transaction as they were taking so long to process.
The only option I had was to fail the server over as I now had department heads lined up at my desk asking “is it fixed yet?, is it fixed yet?, is it fixed yet?”. Sure enough, fail the servers over or restart the sql service and all is well again.
Checking the sql logs show nothing other than the fact that I’ve failed the server over and that the every 30min (Redgate) sql log backup’s failed, so no luck there…..or so I thought (see further down). I tried trailing the profiler around the time of high cpu issue beginning but nothing jumping out. After this had happened a couple of times this was becoming a very big problem. I was in a very difficult situation, nothing I searched after the incidents showed anything and I had roughly 5minutes to investigate whilst the issue was apparent.
My next train of thought was a memory leak. I trawled Microsoft and google looking for any patches or kb’s referring to memory leaks but all I found were either related to SQL 2000, fixed in earlier sql server cumulative updates to what we had installed and the one I found relating to McAfee turned out to be a dead end as we had a later version to that mention as having the memory leak issues.
I decided to take the slight performance hit and run Disk, CPU and memory perfmon’s as well as running quests spotlight to monitor this situation. Once again, when the high CPU issue arose all I got was basically that; CPU’s maxed out. Matching perfmon to the profiler ran didn’t bring anything to light. By now I was running out of ideas. I tried turning Hyper Threading off, restricting affinity to 2cpu’s instead of 4 but still no luck – every couple of weeks all the cpu’s still maxing out!
So, I once again returned to the log files and delved further into the one common theme I could find when this issue arose – RedgateVDI / log backup failure’s.
The error log (cut down) showed:
2009-01-19 17:15:00.59 spid435 Error: 18210, Severity: 16, State: 1.
2009-01-19 17:15:00.59 spid435 BackupVirtualDeviceSet::Initialize: Request large buffers failure on backup device ‘SQLBACKUP_C9D0A658-E6F1-4A03-8CC7-AC8466A5B202’. Operating system error 0x8007000e(Not enough storage is available to complete this operation.)
This was initially seen (by myself and colleagues) as a red-herring, due to there being more than enough storage available to initiate the backups. Looking a bit further (1000’s of errors output during this failure) I found errors such as:
Event Type: Error
Event Source: SQLVDI
Event Category: None
Event ID: 1
Time: 2:10:03 PM
SQLVDI: Loc=BufferAreaManager::MapBuffers. Desc=Out Of Address Space. ErrorCode=(8)Not enough storage is available to process this command.
. Process=512. Thread=2500. Server. Instance=SQL2000. VD=Global\SQLBACKUP_690D9E54-3147-4530-B037-7BA589B1B005_SQLVDIMemoryName_
Virtual Address Space? Aha! Now we’re getting somewhere. I searched Redgate for this VDI error and came up with this forum post: http://www.red-gate.com/messageboard/viewtopic.php?t=4240&postdays=0&postorder=asc&start=0
Now, although I was having similar issues relating to VDI, we didn’t seem to be having the common issue of 100% cpu’s but I did decide to monitor the server using EXEC maser..sqbmemory and this proved to be very interesting. When our server was maxing out at 100%, I ran this procedure and low and behold, the maximum free block of VAS was 3mb! I restarted the sql service and checked it again and it was back to circ 300mb.
This lead me to believe that it was almost certainly a VAS pressure issue we were experiencing and possibly created by redgate sql backup. I decided to research VAS further and found a brilliant blog by Slava Oks which can be found at: http://blogs.msdn.com/slavao/
I decided to monitor VAS by capturing the free blocks and total available every minute to see if I can trap what is actually causing the VAS to be eaten up. Over the next few weeks it became apparent that we had a large xml file (> 500mb) which was imported (using sp_xml_preparedocument and sp_xml_removedocument) every evening and sometime during this import the total VAS available dropped drastically. Taking this a step further, I decided to capture the VAS at every step of this import and found that one section seemed to be the culprit. The ins and outs of this are beyond the scope of this post but I may blog it in the future.
So, amend the procedure and all VAS issues are now resolved – we’ve not had a 100% cpu issue in over two months now (touch wood!). I am still monitoring the VAS (alerts put in place for if it drops too low) but I’m now 99.9% certain (there’s always room for doubt 🙂 ) that this issue is now resolved. Took a while but got there in the end!!!
Anyone else had similar issues?
Below I have listed a couple of links to descriptions of VAS for you to either brush up on or even learn a thing or two.
Virtual Address Space (VAS):