The Problem

Late last week (20th) we had an emergency call from a company that had a production database go into Suspect mode and needed help. Now this isn’t a great situation to be in so when they then go on to tell us that the last valid backup they had was from the 12th and the backup job had been failing since then – even less of a great situation to be in 😢

There are many blogs and forums posts out there showing the steps to rectify this situation with the main options being

  1. Restore the last valid backup or
  2. Put the DB into Emergency mode and run CHECKDB with REPAIR_ALLOW_DATA_LOSS
  3. Create a new DB and migrate everything to it
    1. The data was in a readable state from the DB in Emergency mode – we were fortunate!
  4. Other options are available in certain scenarios

Depending on your backup strategy, options 1 and 2 can put you in the situation where data loss could occur but with this company the first option isn’t really an option as they would be losing 8 days worth of data.

This is where it started to get interesting, as we discussed what they had tried already they mentioned that they had tried some of the steps in blogs / forums to do option 2 but they were getting an error relating to In-Memory OLTP preventing it. Like many many others, these guys have an In-Mem OLTP filegroup from trying it out once and not being able to remove it but it had been like that for a number of years without causing an issue……..until today.

Aha, so we’re now in an even less of a great situation than the less of a great situation a few minutes earlier 👀. Unfortunately, In-Memory OLTP objects are not checked as part of a DBCC CHECKDB process so option 2 is again not an option. A standard CHECKDB had been run by the guys and returned no errors which helped narrow down the issue to being with the In-Memory OLTP structures as they’re not part of the check.

Another option we did explore as a last ditch effort before option #3 was a slightly modified version of the process from one of Paul Randal’s blog’s on “Creating, detaching, re-attaching, and fixing a SUSPECT database” to try and re-attach the mdf and rebuilding a new log file. That day I learned something new and that this is also not an option for DBs with In-Memory OLTP. Below is a snippet of code to show what I mean for attaching a DB and attaching a new log file.

USE [master]
( FILENAME = N'D:\DATA\DBNameHere.mdf' )

-- If the above doesn't work then try forcing...
    ON  ( FILENAME = N'D:\DATA\DBNameHere.mdf' )

The error that you will get is:

Msg 41316, Level 16, State 0, Line 7
Restore operation failed for database ‘DBNameHere‘ with internal error code ‘0x88000001’.
Msg 41836, Level 16, State 1, Line 10
Rebuilding log is not supported for databases containing files belonging to MEMORY_OPTIMIZED_DATA filegroup.
DBCC results for ‘DBNameHere‘.
CHECKDB found 0 allocation errors and 0 consistency errors in database ‘DBNameHere‘.
Msg 7909, Level 20, State 1, Line 10
The emergency-mode repair failed. You must restore from backup.

The “Fix

So really, the only option now is #3, to copy all the data to a new database following the steps below as a guide:

  • disable logins so applicaiton will not be able to connect
  • create new database (without in-memory OLTP filegroup)
  • script schema / procs / users etc from Emergency mode db
  • disable foreign keys disable nonclustered indexes
  • migrate the data
    • We used the import method from SSMS to quickly / easily utilise the identity seed management but there are other methods available
  • enable foreign keys (WITH CHECK)
  • rebuild nonclustered indexes
  • drop emergency mode db
  • rename new database back to old name
  • enable logins

Although this took a while, most of it on making sure the data migrated was indeed correct, they managed to recover their data and are back up and running, albeit with a 1 day outage. These guys were extremely lucky and have learned a valuable lesson about ignoring failed backup errors.

The Cause?

Although we couldn’t 100% say with certainty what exactly happened to cause this, from the error log we could see that a restore of the DB was attempted on a secondary instance and the script did not include a WITH MOVE and it attempted to overwrite the MEM folder for the production DB. These files were locked by SQL Server and the log indicated this also but the problems were being seen not long after and the error from the failed backups relates to missing objects from the MEM folder so it is a likely cause.

Couple of things to take away from this:

  • Always check and validate your backups
    • These guys did that every night by restoring this DB to another instance for reporting, their problem was ignoring the backup failures / errors
  • Be very wary when looking to implement / try out In-Memory OLTP, especially if you’re throwing it straight into production. Once the filegroup is created then it can’t be got rid of and if you are unfortunate enough to encounter corruption and don’t have valid backups then you are in a more difficult situation compared to if you weren’t using In-Memory OLTP