Saturday, July 14, 2007

Lesson learned - Recovering from corruption

As if the previous lesson learned wasn't enough, it led to another painful lesson.

Our Microsoft PSS engineer had us take a backup of the current mailbox stores that wouldn't mount along with the transactional logs so that we'd have something to fall back on should things not go well. Our mailbox stores are large enough that the process would take several hours. Rather than have new mail bounce he had us create empty databases by configuring the file names different from the original files. This gave the new messages a place to go.

After the backup completed, we restored one mailbox store files back to the previous day (which took a couple of hours). We then went through several rounds of replaying the logs until we figured out where the corruption started. When that process finally completed, we had a restored mailbox store as good as we could get it, plus a newly created mailbox store with a couple of days worth of messages. The final step is to merge the two. This is done by making one of the stores the Recovery Storage Group and merging the new data with the old.

I'll interrupt the story here to say that because of the length of time involved with the process, we ended up doing this with two different groups of people. The first group completed the merge and all appeared well. The second group went through the same process with another mailbox store with a different PSS engineer.

The bad news

When the second group got to the merge, it was taking a very long time. Much longer than it took the first group. We could do nothing but watch the merge wizard slowly process each mailbox. Our mailbox store was large enough that the entire process took over a day to complete. Our users were patient and understanding, but at the same time displeased.

Lesson learned

Some of you may have recognized that what we were doing was performing a Dial-tone Restore. Henrik Walther wrote an excellent set of articles on this subject at MSExchange.org. The first group did exactly what Henrik described, and the second group had left out one important step which would have saved us many hours of frustration. Before performing the merge, you need to swap the mailbox stores so that you are merging the small into the large. The mailbox store was somewhere between 75GB and 100GB and therefore took a very long time!

I strongly urge everyone to read through Henrik's articles to familiarize yourself with the process. You never know when you'll need it!

Friday, July 13, 2007

Addendum - Stopping Exchange services

Andy Grogan wrote an excellent commentary to my previous post, and I wrote in return. I don't know how many people would actually read the comments, and the points made are important enough that I wanted them to be more than a footnote. I left Andy's comments verbatim, and added a couple of notes to mine.

Andy wrote:

"The problem is, how do you know when the store is being written to? or not, I have seen a similar problem before CPU 100 % nothing much responding in terms of Exchange so you naturally try to stop the service. Then you get the dreaded "The Microsoft Information Store Service did not respond to the stop request in a timely fashion" and hang on stopping. This then raises - how long do you wait - and hour, two hours - a day?"

"I have in the past opened up Task Manager and added in the IO, and Bytes counters to see if the store process is writing, but its not fool proof. Personally I would love Microsoft to put some options into ending processes like there are in Unix where you have multiple levels to killing a process - sorry for the ramble - I just don't think there was much you could have done your were in a hole that most of us Exchange folks face at some point - do I pull the plug or don't I?"

The issues Andy raises in his comment are exactly the same questions and issues that haunted me. That's why I've adopted this new strategy which seems to work (at least it alleviates my fears). Dismount the stores first - don't try to do anything with the services. In fact, don't do anything with the services until the stores dismount.

In an emergency, I'd pull the plug on new messages (e.g. block port 25) and cause bounces rather than risk corruption. That will allow the stores to eventually catch up and quiet down. Then dismount the stores, then finally stop the services.

Wednesday, July 11, 2007

Lesson Learned - Stopping Exchange services

I inadvertently corrupted a couple of mailbox stores (like anyone actually causes corruption intentionally) because of a lack of understanding. I hope this story will save someone else some pain.

The story started when an Exchange server's CPU was running at 100% for a long time, causing my monitor to alert us. After some intial troubleshooting, it was decided to restart the Exchange services. The restart process failed, the CPU continued to run at 100%.

We waited an hour, and the status had not changed. Figuring a reboot would resolve any ailments, we did just that. The server took a long time, but finally rebooted after 15 minutes. When the server restarted, some of the mailbox stores did not mount. After spending some time to try to fix things, we put in a call to Microsoft PSS.

The bad news

We discovered that our actions corrupted several mailbox stores (this was on an Enterprise Edition server). In talking to Microsoft, we discovered that a shutdown or restart of the Operating System does not necessarily wait for all services to stop. The Information Store service apparently did not stop completely and after a timed delay, Windows shut itself down. We were told that the corruption happened because the store was actively being written to when the service stopped.

Lesson learned

With a new understanding of the Information Store service, whenever maintenance is performed on our Exchange servers, we always dismount all of the mailbox stores first. This assures that all "in flight" transactions are complete before the service is stopped.

Monday, July 9, 2007

Microsoft MVP!

I got to spend a little time away from everything Exchange related (a.k.a. vacation) and returned to find a nice surprise amongst my Email. I have been named a Microsoft MVP! I will be spending the next few days seriously studying my privileges and responsibilities. Thanks to James Chong and everyone else who was involved in my obtaining this honor. I will do everything in my power to prove myself worthy of your support!