It was August 15, 2003. A software bug invoked a blackout spanning the Northeast, Midwest, and parts of Canada. Subways shut down. Hospital patients suffered in stifling heat. And police evacuated people trapped in elevators.
What should have been a manageable, local blackout cascaded into widespread distress on the electric grid. A lack of alarm left operators unaware of the need to re-distribute power after overloaded transmission lines hit unpruned foliage, which triggered a race condition in the control software.*
Ali Ebnenasir is working to prevent another Northeast Blackout. He’s creating and testing new design methods for more dependable software in the presence of unanticipated environmental and internal faults. “What software does or doesn’t do is critical,” Ebnenasir explains. “Think about medical devices controlled by software. Patient lives are at stake when there’s a software malfunction.”
How do you make distributed software more dependable? In the case of a single machine—like a smartphone—it’s easy. Just hit reset. But for a network, there is no centralized reset. “Our challenge is to design distributed software systems that automatically recover from unanticipated events,” Ebnenasir says.
The problem—and some solutions—has been around for nearly 40 years, but no uniform theory for designing self-stabilizing systems exists. “Now we’re equipping software engineers with tools and methods to design systems that autonomously recover.”
Ebnenasir’s work has been funded by the National Science Foundation.
*Source: Wikipedia