UC Berkeley CalMail Substantially Impacted During 36 Hours
After failed disk and insufficient capacity
By Jean Jacques Maleval | December 5, 2011 at 3:26 pmThis text has been written by Shel Waggener, Associate Vice Chancellor and CIO, UC Berkeley, on November 29, 2011:
I very much appreciate that it is unacceptable to have a core campus system like CalMail offline for any length of time. Today’s continued problems with CalMail have been particularly difficult for all involved. We work hard to design and operate systems that can handle the needs of our community and when we fail to meet that standard we bring in outside experts to help us improve. We have added additional outside experts from other campuses, vendors and new team members from the Berkeley technical community as we work through this crisis. This is our highest priority and will remain so until we have the environment fully stabilized. The following message provides current information. You can also check for the latest status.
Current Status Summary
- The CalMail system is available only through web clients at http://calmail.berkeley.edu. All messages are being sent and received but can only be accessed via webmail. Webmail sessions may be slower than normal due to volume.
- Students are strongly encouraged to forward their email to alternate email accounts. Instructions for how to do that can be found on the Calmail site.
Details
CalMail has been substantially impacted during the last 36 hours after the successful recovery from database corruption this weekend. The load on the system has remained extremely high as the millions of backlogged messages are delivered. The load situation worsened considerably Monday morning as tens of thousands of campus community members returned from the holidays and connected for the first time, pushing the load above operating limits. Normal processes that usually run in the background unnoticed, including copying data off of a failed hard drive, effectively shut down the system for many people. Attempts to keep email moving during the day Monday were minimally successful. Monday night during off hours work was undertaken to accelerate the repair of the failed disk and prepare for anticipated continued high load through this week. On Tuesday morning, the load exceeded even the unusually high levels experienced on Monday and ultimately caused the entire system to become unusable.
Unfortunately the root cause of this problem – insufficient capacity on the legacy CalMail environment – cannot be resolved safely without the new storage array that is not expected to be available until the weekend in spite of overnight delivery of key components and around the clock work of staff and vendors.
We recognize how critical email is, this week in particular, to our ability to perform our work and have taken the following immediate actions to assist in lowering the load to allow email to continue to flow:
- While faculty and staff email must remain on a university provisioned email service, students with external email accounts are encouraged to forward their CalMail messages there. Doing so will lower both email volume and login attempts and thereby reduce load on the Calmail system.
- Access to email from anything but the CalMail web clients has been disabled. This dramatically reduces the number of simultaneous connections from cell phones, iPads, and clients such as Outlook and MacMail, which are often configured to maintain persistent connections and in doing so place extremely heavy load on the CalMail platform.
- Moving some users to other campus email services temporarily to further reduce load.
While these are drastic actions, none were undertaken lightly but were done after extensive consultation with technical experts from campus central and departmental staff as well as vendors we have enlisted to work on the problem. Once the new storage system is installed, configured and tested, we will begin the migration process from the legacy storage system to one more than double in size and expected to handle at least several times our current load. That process itself will put substantial load on the legacy storage environment as data is copied so we are planning on doing this work off hours.