Skip to main content

Message in a bottleneck

Software bugs revealed during upgrades this summer caused unusual interruptions in e-mail service

by TALLEY HENNING BROWN

The two Sun Microsystem servers that process and deliver the university’s e-mail look and perform a lot like any other desktop computer: a plastic box, a few fans, a processor, some memory. But they have an enormous workload: Together they handle over 400,000 new e-mail messages every day and store about 500 gigabytes of e-mail. The numbers are average for an operation this size, but they are 25 times higher than the figures from just six years ago.

For those six years, the system handled the steep increase nearly flawlessly. With routine software and hardware upgrades, these two boxes have been running more or less nonstop since 2001. When one needed to be taken offline for maintenance, the other automatically absorbed its workload. An additional backup in the university’s disaster recovery site provided another layer of redundancy.

Then, this summer, the streak came to an end.

The series of campus-wide interruptions in e-mail service, which started June 17, began during a scheduled overhaul, when Information Technology personnel replaced those servers and upgraded two software applications that make up the systems that receive, store and deliver Rockefeller’s e-mail. Despite extensive predeployment testing, bugs in the updated software caused first severe slowdowns in system response time and later intermittent full system freezes.

What followed was an intensive investigation by IT, as well as the vendors that provide the software, to pinpoint and patch the flawed code. The work took several weeks, but the system has been running without any unscheduled interruptions since August 16. “I am highly confident that the software is now back to the point where it is every bit as stable as it has been in the past, and in addition the hardware will have the capacity to allow for inevitable growth in our e-mail communications,” says Associate Vice President for IT Gerald Latter.

Rockefeller University had a record of remarkably few service outages before this summer. A steady rise in e-mail volume each year, however, necessitated upgrades to hardware with more horsepower and software that can parse higher loads. Though the number of Rockefeller e-mail accounts has increased only 33 percent over the last six years — from 1,800 users in 2001 to 2,400 now — the amount of spam received now accounts for about 70 percent of weekday incoming e-mails. Two years ago, that figure was only 30 percent. In 2001, it was zero.

Rockefeller’s e-mail system is run on three components: Solaris computers from Sun Microsystems, Veritas storage management software from Symantec and Sendmail, an e-mail management system. All three companies have above-average track records with reliability, and following preinstallation testing, consultants from the companies predicted a smooth transition. “Predeployment testing is slightly less effectual than it might sound, however, because the testing environment cannot efficiently simulate actual maximum-load situations,” says Armand Gazes, director of IT operations and network security. “Upgrades add a higher level of complexity to the system, and with that comes not only greater power but greater chance for problems.”

With each stage of the upgrade, a new bug appeared; bugs from each software upgrade interacted with one other, magnifying their individual effects and creating an avalanche of glitches.

The first service slowdowns — which for many users caused delays long enough to effectively prevent the delivery of e-mail — occurred during the first stage of the overhaul when the Veritas upgrade was applied to the new Solaris computers, with the old Sendmail application still in place. Using process of elimination to locate the problem, IT went back to the old hardware (with the new Veritas software). That didn’t work, and neither did a subsequent software patch from Veritas, but Sendmail’s consultants were confident that it would disappear with the completion of the Sendmail upgrade, the last step in the overhaul.

But when the upgrade was finished — on the new hardware, with all new software — the situation instead worsened, causing the system to occasionally freeze entirely. All three vendors then began to investigate more closely, comparing snapshots of data collected from each of their systems over the same time periods to see how the systems were interacting. “That kind of analysis is extraordinarily complex, as the number of factors to be cross-checked is simply enormous,” says Mr. Gazes. The problem eluded them for weeks, during which time IT staff was working at full capacity, even leveraging personnel who were on vacation to monitor the system remotely and help fix outages as they happened to ensure downtime was kept to a minimum.

Then, on July 18, Rockefeller’s antispam program, Cloudmark, had a bug that triggered Sendmail to mark a portion of the e-mail coming in from outside the university as spam. IT fielded damage control and was able to recover the names and time stamps of lost e-mails for everyone whose e-mail clients were set to automatically delete spam. “That problem is a rare one, and added insult to injury, but it does illustrate why we suggest that people quarantine their spam instead of having it automatically deleted,” says Mr. Latter. The problem turned out to be unrelated, but it added an additional layer of confusion for many users.

By the end of July, things began looking up. Sendmail and Veritas both discovered additional bugs in their software and worked to provide patches to address them. The patches worked, and the final repairs were completed in the early morning hours of August 16. There has been no unplanned downtime since.

“Unfortunately, you never know exactly how new software will interact with the full production environment you’ve got until you put it all together and run under load. We experienced an unusual amount of bad luck during the course of what should have been a routine upgrade, but encountering bugs is itself not unusual, and the vendors were all very diligent in working to resolve them,” says Mr. Latter. “We’ve also gained some valuable lessons that will enable us to stress-test the system before future upgrades.”

Over the next several months, IT will proceed with existing plans to evaluate and test additional enhancements to help ensure the university’s e-mail systems remain stable. Under consideration: a migration to an open-source Linux operating system running on Intel-powered computers. “These weren’t products we would have felt entirely comfortable with earlier this year, but now, it looks very promising,” says Mr. Gazes. “This combination offers several advantages in terms of future upgradeability and improved support conditions, because we’ll be leveraging hardware and software that are both standard and cutting edge.”