On Dec. 21, 2022, just as peak holiday season travel was getting underway, Southwest Airlines went through a cascading series of failures of their scheduling, initially triggered by severe winter weather within the Denver area. But the issues spread through their network, and over the course of the following 10 days the crisis ended up stranding over 2 million passengers and causing losses of $750 million for the airline.
How did a localized weather system find yourself triggering such a widespread failure? Researchers at MIT have examined this widely reported failure for instance of cases where systems that work easily more often than not suddenly break down and cause a domino effect of failures. They’ve now developed a computational system for using the mixture of sparse data a few rare failure event, together with rather more extensive data on normal operations, to work backwards and take a look at to pinpoint the foundation causes of the failure, and hopefully give you the chance to search out ways to regulate the systems to stop such failures in the longer term.
The findings were presented on the International Conference on Learning Representations (ICLR), which was held in Singapore from April 24-28 by MIT doctoral student Charles Dawson, professor of aeronautics and astronautics Chuchu Fan, and colleagues from Harvard University and the University of Michigan.
“The motivation behind this work is that it’s really frustrating when now we have to interact with these complicated systems, where it’s really hard to grasp what’s occurring behind the scenes that’s creating these issues or failures that we’re observing,” says Dawson.
The brand new work builds on previous research from Fan’s lab, where they checked out problems involving hypothetical failure prediction problems, she says, reminiscent of with groups of robots working together on a task, or complex systems reminiscent of the facility grid, on the lookout for ways to predict how such systems may fail. “The goal of this project,” Fan says, “was really to show that right into a diagnostic tool that we could use on real-world systems.”
The thought was to offer a way that somebody could “give us data from a time when this real-world system had a difficulty or a failure,” Dawson says, “and we are able to attempt to diagnose the foundation causes, and supply slightly little bit of a glance backstage at this complexity.”
The intent is for the methods they developed “to work for a reasonably general class of cyber-physical problems,” he says. These are problems during which “you may have an automatic decision-making component interacting with the messiness of the actual world,” he explains. There can be found tools for testing software systems that operate on their very own, however the complexity arises when that software has to interact with physical entities going about their activities in an actual physical setting, whether it’s the scheduling of aircraft, the movements of autonomous vehicles, the interactions of a team of robots, or the control of the inputs and outputs on an electrical grid. In such systems, what often happens, he says, is that “the software might make a choice that appears OK at first, but then it has all these domino, knock-on effects that make things messier and rather more uncertain.”
One key difference, though, is that in systems like teams of robots, unlike the scheduling of airplanes, “now we have access to a model within the robotics world,” says Fan, who’s a principal investigator in MIT’s Laboratory for Information and Decision Systems (LIDS). “We do have some good understanding of the physics behind the robotics, and we do have ways of making a model” that represents their activities with reasonable accuracy. But airline scheduling involves processes and systems which might be proprietary business information, and so the researchers had to search out ways to infer what was behind the choices, using only the relatively sparse publicly available information, which essentially consisted of just the actual arrival and departure times of every plane.
“We have now grabbed all this flight data, but there’s this whole system of the scheduling system behind it, and we don’t understand how the system is working,” Fan says. And the quantity of information regarding the actual failure is just several day’s price, in comparison with years of information on normal flight operations.
The impact of the weather events in Denver throughout the week of Southwest’s scheduling crisis clearly showed up within the flight data, just from the longer-than-normal turnaround times between landing and takeoff on the Denver airport. But the best way that impact cascaded though the system was less obvious, and required more evaluation. The important thing turned out to need to do with the concept of reserve aircraft.
Airlines typically keep some planes in reserve at various airports, in order that if problems are found with one plane that’s scheduled for a flight, one other plane may be quickly substituted. Southwest uses only a single form of plane, in order that they are all interchangeable, making such substitutions easier. But most airlines operate on a hub-and-spoke system, with a couple of designated hub airports where most of those reserve aircraft could also be kept, whereas Southwest doesn’t use hubs, so their reserve planes are more scattered throughout their network. And the best way those planes were deployed turned out to play a significant role within the unfolding crisis.
“The challenge is that there’s no public data available by way of where the aircraft are stationed throughout the Southwest network,” Dawson says. “What we’re capable of find using our method is, by the general public data on arrivals, departures, and delays, we are able to use our method to back out what the hidden parameters of those aircraft reserves might have been, to elucidate the observations that we were seeing.”
What they found was that the best way the reserves were deployed was a “leading indicator” of the issues that cascaded in a nationwide crisis. Some parts of the network that were affected directly by the weather were capable of recuperate quickly and get back on schedule. “But after we checked out other areas within the network, we saw that these reserves were just not available, and things just kept getting worse.”
For instance, the info showed that Denver’s reserves were rapidly dwindling due to the weather delays, but then “it also allowed us to trace this failure from Denver to Las Vegas,” he says. While there was no severe weather there, “our method was still showing us a gradual decline within the variety of aircraft that were capable of serve flights out of Las Vegas.”
He says that “what we found was that there have been these circulations of aircraft inside the Southwest network, where an aircraft might start the day in California after which fly to Denver, after which end the day in Las Vegas.” What happened within the case of this storm was that the cycle got interrupted. Consequently, “this one storm in Denver breaks the cycle, and suddenly the reserves in Las Vegas, which is just not affected by the weather, begin to deteriorate.”
In the long run, Southwest was forced to take a drastic measure to resolve the issue: That they had to do a “hard reset” of their entire system, canceling all flights and flying empty aircraft across the country to rebalance their reserves.
Working with experts in air transportation systems, the researchers developed a model of how the scheduling system is speculated to work. Then, “what our method does is, we’re essentially attempting to run the model backwards.” the observed outcomes, the model allows them to work back to see what sorts of initial conditions could have produced those outcomes.
While the info on the actual failures were sparse, the extensive data on typical operations helped in teaching the computational model “what is possible, what is feasible, what’s the realm of physical possibility here,” Dawson says. “That provides us the domain knowledge to then say, on this extreme event, given the space of what’s possible, what’s the probably explanation” for the failure.
This may lead to a real-time monitoring system, he says, where data on normal operations are consistently in comparison with the present data, and determining what the trend looks like. “Are we trending toward normal, or are we trending toward extreme events?” Seeing signs of impending issues could allow for preemptive measures, reminiscent of redeploying reserve aircraft upfront to areas of anticipated problems.
Work on developing such systems is ongoing in her lab, Fan says. Within the meantime, they’ve produced an open-source tool for analyzing failure systems, called CalNF, which is offered for anyone to make use of. Meanwhile Dawson, who earned his doctorate last 12 months, is working as a postdoc to use the methods developed on this work to understanding failures in power networks.
The research team also included Max Li from the University of Michigan and Van Tran from Harvard University. The work was supported by NASA, the Air Force Office of Scientific Research, and the MIT-DSTA program.