Andy Grogan wrote an excellent commentary to my previous post, and I wrote in return. I don't know how many people would actually read the comments, and the points made are important enough that I wanted them to be more than a footnote. I left Andy's comments verbatim, and added a couple of notes to mine.
Andy wrote:
"The problem is, how do you know when the store is being written to? or not, I have seen a similar problem before CPU 100 % nothing much responding in terms of Exchange so you naturally try to stop the service. Then you get the dreaded "The Microsoft Information Store Service did not respond to the stop request in a timely fashion" and hang on stopping. This then raises - how long do you wait - and hour, two hours - a day?"
"I have in the past opened up Task Manager and added in the IO, and Bytes counters to see if the store process is writing, but its not fool proof. Personally I would love Microsoft to put some options into ending processes like there are in Unix where you have multiple levels to killing a process - sorry for the ramble - I just don't think there was much you could have done your were in a hole that most of us Exchange folks face at some point - do I pull the plug or don't I?"
The issues Andy raises in his comment are exactly the same questions and issues that haunted me. That's why I've adopted this new strategy which seems to work (at least it alleviates my fears). Dismount the stores first - don't try to do anything with the services. In fact, don't do anything with the services until the stores dismount.
In an emergency, I'd pull the plug on new messages (e.g. block port 25) and cause bounces rather than risk corruption. That will allow the stores to eventually catch up and quiet down. Then dismount the stores, then finally stop the services.
Showing posts with label service. Show all posts
Showing posts with label service. Show all posts
Friday, July 13, 2007
Wednesday, July 11, 2007
Lesson Learned - Stopping Exchange services
I inadvertently corrupted a couple of mailbox stores (like anyone actually causes corruption intentionally) because of a lack of understanding. I hope this story will save someone else some pain.
The story started when an Exchange server's CPU was running at 100% for a long time, causing my monitor to alert us. After some intial troubleshooting, it was decided to restart the Exchange services. The restart process failed, the CPU continued to run at 100%.
We waited an hour, and the status had not changed. Figuring a reboot would resolve any ailments, we did just that. The server took a long time, but finally rebooted after 15 minutes. When the server restarted, some of the mailbox stores did not mount. After spending some time to try to fix things, we put in a call to Microsoft PSS.
The bad news
We discovered that our actions corrupted several mailbox stores (this was on an Enterprise Edition server). In talking to Microsoft, we discovered that a shutdown or restart of the Operating System does not necessarily wait for all services to stop. The Information Store service apparently did not stop completely and after a timed delay, Windows shut itself down. We were told that the corruption happened because the store was actively being written to when the service stopped.
Lesson learned
With a new understanding of the Information Store service, whenever maintenance is performed on our Exchange servers, we always dismount all of the mailbox stores first. This assures that all "in flight" transactions are complete before the service is stopped.
The story started when an Exchange server's CPU was running at 100% for a long time, causing my monitor to alert us. After some intial troubleshooting, it was decided to restart the Exchange services. The restart process failed, the CPU continued to run at 100%.
We waited an hour, and the status had not changed. Figuring a reboot would resolve any ailments, we did just that. The server took a long time, but finally rebooted after 15 minutes. When the server restarted, some of the mailbox stores did not mount. After spending some time to try to fix things, we put in a call to Microsoft PSS.
The bad news
We discovered that our actions corrupted several mailbox stores (this was on an Enterprise Edition server). In talking to Microsoft, we discovered that a shutdown or restart of the Operating System does not necessarily wait for all services to stop. The Information Store service apparently did not stop completely and after a timed delay, Windows shut itself down. We were told that the corruption happened because the store was actively being written to when the service stopped.
Lesson learned
With a new understanding of the Information Store service, whenever maintenance is performed on our Exchange servers, we always dismount all of the mailbox stores first. This assures that all "in flight" transactions are complete before the service is stopped.
Subscribe to:
Comments (Atom)