Consolidating Snapshots, failed backup and consolidating via PowerCLI.

I began my day testing for some failed backup and found some errors stating that a VM failed to have a snapshot taken.  So I began testing. 
  • Let's try a snapshot of the affected VM.  -- Success!
Okay, no problem, remove that snapshot. Um, now it is 6 hours later and the remove snapshot is at 31%.  (I had began looking to why, at about 3 hours.)
  • Warning message from <generic vmhostname> This virtual machine has more than 100 redo logs in a single branch of its snapshot tree. Deleting some of the snapshots or consolidating the redo logs will improve performance. The maximum number of redo logs supported is 255.
That would do it.  It appears that when the backup mechanism would perform a backup it would create a redo or delta file (snapshot) and attempt to backup of the file.  When it got too onerous to perform it began to fail the snapshot aspect.  This is one of those gotchas you don't see until it is too late.  (LUN fills up)

I know many people have harped on the fact that Snapshot's are not a backup as many storage vendors use the term for their backup. It is with great experience that this is pointed out.  Snapshots are a means to stop I/O to the virtual hard drive file, to enable it to be moved, or copied. The pointer from the VM redirects the output to a snapshot or "delta" file to track and record every write in sequence to this new file until it is reconstituted into the original.  This file continues to grow, and grow until it is merged back into the original. 

Technically, you can do this up to 255 times or until the LUN fills up, but it is generally bad practice as it effects the performance of the VM as it searches between the main vmdk file and the Delta's for the required block of data.

Robert van den Nieuwendijk has a great little one liner that would tremendously help in self examination of the system through a scheduled task or similar to check for this scenario. 

One of the links Robert points to is this
where it shows the GUI method and some more detail around the issue.

One of the considerations we needed to do was to shut the VMs OS down to do some other maintenance.  DO NOT DO IT!  Numerous communities forums point out that this can cause corruption in the Guest OS system.