Wednesday, July 11, 2007

Lesson Learned - Stopping Exchange services

I inadvertently corrupted a couple of mailbox stores (like anyone actually causes corruption intentionally) because of a lack of understanding. I hope this story will save someone else some pain.

The story started when an Exchange server's CPU was running at 100% for a long time, causing my monitor to alert us. After some intial troubleshooting, it was decided to restart the Exchange services. The restart process failed, the CPU continued to run at 100%.

We waited an hour, and the status had not changed. Figuring a reboot would resolve any ailments, we did just that. The server took a long time, but finally rebooted after 15 minutes. When the server restarted, some of the mailbox stores did not mount. After spending some time to try to fix things, we put in a call to Microsoft PSS.

The bad news

We discovered that our actions corrupted several mailbox stores (this was on an Enterprise Edition server). In talking to Microsoft, we discovered that a shutdown or restart of the Operating System does not necessarily wait for all services to stop. The Information Store service apparently did not stop completely and after a timed delay, Windows shut itself down. We were told that the corruption happened because the store was actively being written to when the service stopped.

Lesson learned

With a new understanding of the Information Store service, whenever maintenance is performed on our Exchange servers, we always dismount all of the mailbox stores first. This assures that all "in flight" transactions are complete before the service is stopped.

4 comments:

Andy Grogan said...

The problem is, how do you know when the store is being written to? or not, I have seen a similar problem before CPU 100 % nothing much responding in terms of Exchange so you naturally try to stop the service. Then you get the dreaded "The Microsoft Information Store Service did not respond to the stop request in a timely fashion" and hang on stopping. This then raises - how long do you wait - and hour, two hours - a day?
I have in the past opened up Task Manager and added in the IO, and Bytes counters to see if the store process is writing, but its not fool proof. Personally I would love Microsoft to put some options into ending processes like there are in Unix where you have multiple levels to killing a process - sorry for the ramble - I just don't think there was much you could have done your were in a hole that most of us Exchange folks face at somepoint - do I pull the plug or don't I?

Dean T. Uemura said...

Those are exactly the same questions and issues that haunted me. That's why I've adopted this new strategy which seems to work. Dismount the stores first - don't try to do anything with the services. In fact, don't do anything with the services until the stores dismount.

In an emergency, I'd pull the plug on new messages and cause bounces rather than risk corruption. That will allow the stores to eventually catch up and quiet down. Then dismount the stores, then finally stop the services.

Wasim said...

Hello,
I am user of msexchange.org and got your blog from one of your post.
In this case, when we dismount stores, all the transcations that are taking place are written to database? and store will not dismount untill this is done?

Regards,
Wasim.

Dean T. Uemura said...

Wasim,

Sorry about the tardiness of the response. As you can tell by the infrequency of my posts, I don't spend a lot of time at my blogsite.

In response to your question, when you dismount a store it does complete all the transactions in progress. That's why you'll notice that sometimes it dismounts quickly and sometimes it can take a while.