My name is Sean Dwyer and I am a Support Escalation Engineer with the Microsoft CORE team.
I’d like to share a quick tip for handling 2008 Windows Server Cluster admins.
There may come a time, for whatever reason, that a cluster managed volume is flagged as dirty and you will see an event ID message indicating that CHKDSK needs to run against the volume.
In a best case scenario, you can take the volume out of production, run CHKDSK on the volume if needed (refer to: http://technet.microsoft.com/en-us/library/cc772587.aspx, and then put the volume back into production.
In most situations though, the volume that needs attention is a heavily utilized production volume and it will be extremely disruptive to have the volume offline for any length of time.
For example, a recent case I was involved with had a 14Tb* (see note 1 below) volume that was being flagged for CHKDSK to run on it about once a month. The volume had about 9tb of data on it. Apart from the concern of why the volume was continually being flagged as corrupt, the length of time that CHKDSK took to run on the volume was extremely painful for the customer’s business. When it ran initially, it took roughly 80 hours to complete a run on the volume.
It may be necessary to temporarily configure a problem volume to block CHKDSK from running against it while troubleshooting continues to determine why the volume is being flagged for CHKDSK to run.
I stress the word temporary here.
Turning off the health monitoring tool for the file system as a permanent solution will only lead to more downtime in the future, and you may end up on the phone with one of the File Systems experts on my team, such as Robert Mitchell.
Ok – so let’s talk specifics about temporarily blocking CHKDSK from doing work on a Cluster volume.
Say we’ve determined that we need to suspend CHKDSK from running on a problem volume. For you old school Cluster admins, the first command that probably jumps to mind is SKIPCHKDSK=1.
This works just fine for 2003 Clusters, but will NOT work for 2008.
If SKIPCHKDSK is used for a 2008 volume, it will be ignored when the disk is next brought online and CHKDSK will run against the volume. In a situation where the volume is 18tb, the volume will remain unavailable for use until CHKDSK finishes* (See note 2 below).
The correct way to configure a volume to block CHKDSK from running on it, is to use the DiskRunChkdsk switch.
Keep in mind that these two switches we’re discussing only apply to the Cluster environment.
If the machine is restarted, the OS will prompt for CHKDSK to run on the affected volumes.
For information on how to configure the OS to ignore the dirty bit, refer to:
158675 How to Cancel CHKDSK After It Has Been Scheduled
http://support.microsoft.com/default.aspx?scid=kb;EN-US;158675
Let’s walk through an example of setting this Cluster specific switch configured for a volume to give you a better idea how to do it should you need to one day.
Step 1: Determine which disk to work with
Image may be NSFW.
Clik here to view.
(I’ll pick Disk 8 for this example)
Step 2: Determine the resource name as seen by Cluster
Image may be NSFW.
Clik here to view.
Step 3: Open an Admin command prompt and run the command
Image may be NSFW.
Clik here to view.
Note: For the setting to WORK, the disk must be brought offline, and then online.
Step 4: Bring the disk offline, then online again.
Image may be NSFW.
Clik here to view. Image may be NSFW.
Clik here to view.
Step 5: Verify the setting is applied
Image may be NSFW.
Clik here to view.
Step 6: Actively start troubleshooting what could cause the volume to end up flagged dirty and needing CHKDSK.
Footnotes:
Note 1: It’s not suggested to run with volumes this large. In my experience once they exceed 2tb in size, they rapidly become an administrative liability, especially in a situation where CHKDSK has to run against the volume. We strongly suggest that mount points be used to carve up larger volumes like this, into more administratively friendly chunks. Chkdsk runs against mount points just fine, too.
Note 2: While it’s not suggested to interrupt CHKDSK while it’s running, an admin is not locked into having to let CHKDSK finish once it starts. The process can be terminated if absolutely required.
However, we cannot guarantee that the end result will be positive. If the process is interrupted during the “magic moment” when CHKDSK is making changes, the results may be worse than the initial reason for the volume being flagged as corrupt.
Additional reading material related to the components and tools mentioned in this post:
300415 A Description of the Diskpart Command-Line Utility
http://support.microsoft.com/default.aspx?scid=kb;EN-US;300415
947021 How to configure volume mount points on a server cluster in Windows Server 2008 http://support.microsoft.com/default.aspx?scid=kb;EN-US;947021
The shared disk on Windows Server 2008 cluster fails to come online
http://support.microsoft.com/default.aspx?scid=kb;en-US;2517696
FSUTIL utility; marking a volume dirty for testing
http://technet.microsoft.com/en-us/library/bb490641.aspx
In summary; try to keep your production volumes’ size under control, be aware that command line switches may not persist through all versions of a product, and continue being successful with Windows Server 2008!
I hope this post has been helpful!
Sean Dwyer
Support Escalation Engineer
Windows CORE Team
Image may be NSFW.Clik here to view.