On the Saturday afternoon of July 4, 2015, NASA’s New Horizons Pluto mission leader Alan Stern was in his office near the project Mission Control Center, working, when his cell phone rang. He was aware of the Independence Day holiday but was much more focused on the fact that the date was “Pluto flyby minus 10 days.” New Horizons, the spacecraft mission that had been the central focus of his career for 14 years, was now just 10 days from its targeted encounter with the most distant planet ever explored.
Immersed in work that afternoon, Alan was busy preparing for the flyby. He was used to operating on little sleep during this final approach phase of the mission, but that day he’d gotten up in the middle of the night and gone into their Mission Operations Center (MOC) for the upload of the crucial, massive set of computer instructions to guide the spacecraft through its upcoming close flyby. That “command load” represented nearly a decade of work, and that morning it had been sent by radio transmission hurtling at the speed of light to reach New Horizons, then on its approach to Pluto.
Glancing at his ringing phone, Alan was surprised to see the caller was Glen Fountain, the longtime project manager of New Horizons. He felt a chill because he knew that Glen was taking time off for the holiday, at his nearby home, before the final, all-out intensity of the upcoming flyby. Why would Glen be calling now?
Alan picked up the phone. “Glen, what’s up?” “We’ve lost contact with the spacecraft.” Alan replied, “I’ll meet you in the MOC; see you in five minutes.” Alan hung up his phone and sat down at his desk for a few seconds, stunned, shaking his head in disbelief. Unintentional loss of contact with Earth should never happen to any spacecraft. It had never before happened to New Horizons over the entire nine-year flight from Earth to Pluto. How could this be happening now, just 10 days out from Pluto?
Throughout the nine long years of travel toward the ninth planet, the radio link to New Horizons was the lifeline that allowed its team to contact and control the craft and to receive spacecraft status and data from its observations. As New Horizons kept going farther to the outer reaches of the solar system, the time delays to communicate with it increased, and the link had lengthened to what was now a nine-hour round trip for radio signals, traveling at the speed of light.
Loss of communications is about as bad a thing as a mission control team can experience—it means the link to Earth is broken.
To stay in touch, New Horizons depends, as do all long-distance spacecraft, on a largely unknown and unsung marvel of planetary exploration: NASA’s Deep Space Network. This trio of giant radio-dish complexes in Goldstone, California; Madrid, Spain; and Canberra, Australia, seamlessly hands off communication duties between one another as the Earth rotates on its axis every 24 hours. The three stations are spread around the world so that no matter where an object is in deep space, at any time at least one of the antenna complexes can point in its direction.
But now … the DSN had lost contact with one of their most precious assets, New Horizons.
If this were an orbiter, or a rover safely on an alien surface, the team could take its time to analyze the problem, make recommendations, try different courses of action. But New Horizons was a flyby mission. The spacecraft was rushing toward Pluto at over 750,000 miles per day—more than 31,000 miles per hour. Back to working order or not, it would fly by the planet on July 14, never to return. There was no stopping New Horizons as they sorted the problem out. There was only one shot at getting the goods at Pluto—New Horizons had no backup, no second chance, no way to delay its date with Pluto.
Earlier that Fourth of July afternoon, Alice Bowman, the mission’s coolheaded and enormously competent 14-year veteran Mission Operations Manager (hence her nickname: “MOM”) was in the MOC with a handful of other mission operations personnel, waiting to see the report come back from New Horizons indicating it had received and stored the Core load. This was the long command script that New Horizons would follow to execute its many hundreds of scientific observations during the nine days surrounding closest approach and flyby. Aptly named, this exhaustively tested command script would literally perform the core of the mission, and its faithful transmission and execution would direct New Horizons through every twist and turn, every computer memory assignment, every communication with Earth, every camera shot, and so forth.
At about 1 p.m., and right on time, the first signals started coming back confirming the reception of the command script. Alice:
Everything was going fine until we hit about 1:55 in the afternoon. Suddenly, we lost all communication with the spacecraft. Dead silence. Nothing. We’d lost comm. And it didn’t come back.
Nine times out of ten, when we lose signal, it’s a problem with the ground station: Something’s out of configuration, or whatever. Because this upload was so important, we had our network operations engineers online. We call them NOPEs: That’s their acronym. We also had our Pluto Aces—which are the controllers there in our ops center. So we had the Pluto Aces ask the NOPEs at the ground station in Australia to check their system configuration. All those checks came back that everything was nominal with the ground system.
That meant that the problem was not down here on Earth—not in Maryland where Alice and her team of Pluto Aces were gathered, nor in Australia where the NOPEs were at the Canberra station of the Deep Space Network receiving the signal from New Horizons. The loss of signal was due to a problem with the spacecraft itself.
Loss of communications is about as bad a thing as a mission control team can experience—it means the link to Earth is broken. But that’s not the worst of it. It could mean the spacecraft had suffered a catastrophic failure. Alice felt a ripple of unfamiliar fear:
You know that feeling in the pit of your stomach when something is occurring, and you can’t believe it’s happening? We’d come nine and a half years on this journey, and I couldn’t believe this—we’d never lost communications. You allow yourself that 5, 10 seconds of feeling that fear and disbelief, but then everything we trained for started to kick in.
New Horizons was still millions of miles from Pluto, and any hazards it posed. The chances of striking anything there in interplanetary space were absurdly low. But, nonetheless, everyone on the team had the passing nightmare thought: Could we have just hit something?
Because the team had telemetry from the spacecraft before it went silent, Chris Hersman (who led the spacecraft engineering team) and his engineers, already arriving as well, had some clues to work with. Something key they discovered very quickly was that just before the spacecraft’s signal stopped, the main computer had been doing two things at once, both of which were computationally demanding. One of these tasks was compressing 63 Pluto images taken previously, in order to free up memory space for the close flyby imaging soon to begin. At the same time, the computer was also receiving the Core load from Earth and storing it in its memory. Could the computer have become overloaded by this intense combination of computational tasks, and as a result rebooted?
New Horizons had no backup, no second chance, no way to delay its date with Pluto.
This was Brian Bauer’s theory. Brian was then the mission’s autonomy system engineer, who had coded the recovery procedure that the spacecraft would automatically go through in just this situation. Brian told Alice, “If that is what happened, then the spacecraft will restart using the backup computer, and 60 to 90 minutes from now we’ll get a radio signal with New Horizons operating on the backup computer.”
There is a phrase from World War I describing warfare as “months of boredom punctuated by moments of terror.” The same applies to long spacecraft missions. And it was a long and frankly terrifying hour as they awaited the hoped-for signal to return from New Horizons. The engineers, the Aces, along with Alice, Glen, and Alan waited out those long minutes, making contingency plans in case Brian’s hypothesis was incorrect. But sure enough, after 90 minutes, a signal arrived from New Horizons indicating it had switched to the backup computer.
Communication had been restored, and with that, the fear of a catastrophic loss of the spacecraft evaporated. But the crisis wasn’t over; it had just entered a new phase.
The MOC and its surrounding offices were quickly filling up with engineers, more flight-control team members, and others on the project who had cut short their holiday weekend to come in and assist. People were arriving in shorts and flip-flops, in their picnic clothes, having dropped everything to get to the MOC.
As more telemetry came back from the bird, they learned that all of the command files for the flyby that had been uploaded to the main computer had been erased when the spacecraft rebooted to the backup computer. This meant that the Core flyby sequence sent that morning would have to be reloaded. But worse, numerous supporting files needed to run the Core sequence, some of which had been loaded as far back as December, would also need to be sent again. Alice recalls, “We had never recovered from this kind of anomaly before. The question was, could we do it in time to start the flyby sequence, scheduled to begin on July 7?”
That meant the team had just three days to put Humpty Dumpty back together again, from 3 billion miles away. If they couldn’t, then with every passing day they would lose dozens of unique, close-up Pluto system observations that were part of the exquisitely constructed Core load flyby plans. The mission team suddenly found itself in a three-day race to salvage everything they had spent years planning and months uploading.
The New Horizons process to get back on track after any spacecraft anomaly is shaped around a series of formal meetings called ARBs, or Anomaly Review Boards. Soon after 4 p.m., only 45 minutes after spacecraft re-contact, the July 4 anomaly’s first ARB was convened in the meeting room adjacent to the MOC.
At that kickoff ARB meeting the team members had to assess what had happened, how to restore the flyby plan, and how to make sure they wouldn’t accidentally do anything during the recovery that would cause another problem on the spacecraft. The scope of how far they had been set back by the reboot onto the backup computer was stupefying. It was quickly estimated that they would have to perform the equivalent of several weeks of work in just three days to start the flyby Core sequence on time on July 7. And it would all have to be done flawlessly.
What made this even worse, was that every move had to be done by remote control with a nine-hour round-trip radio communication time between mission control and the spacecraft. Science classes teach how the speed of light is incredibly fast, how a signal moving at that speed can travel around the world in an eighth of a second or to the moon and back—a half-million-mile trip—in just two and a half seconds. But for the New Horizons team trying to get their spacecraft back on track as it closed in on Pluto, the great distance between Earth and New Horizons made the speed of light seem excruciatingly slow.
Those assembling for the ARB knew that with all the press attention, the world would soon be aware that New Horizons had tripped over itself on the verge of its flyby. In just 10 days, the spacecraft would hurtle through the Pluto system—nothing could stop that celestial mechanics—but whether it would be gathering the data it had journeyed almost a decade to collect, was something else.
Alan and Glen opened the meeting, telling the ARB that there was no finer spacecraft team they’d ever known than on New Horizons, and that if any team could pull off this recovery, it was the group in that room. Then Alice took the floor and began architecting how they would effect a recovery.
Alice immediately asked Alan about the science observations being lost that day and in the next three days before the close flyby sequence was to kick off on July 7. She wanted to know, from the mission leader, if her team should also attempt to recover those observations, in addition to reconfiguring the spacecraft and getting all the files and command load up to the bird for the close flyby. Alan:
I didn’t call for any discussion of it from the other science team members in the room. I didn’t even let my flyby-planning czar, Leslie, weigh in. I knew for a fact that Alice’s team needed crisp direction, with no fuzz on it, and that they needed to focus on saving the main event, rather than the preliminary observations we were losing with the spacecraft idled due to the reboot. I told Alice that anything beyond getting us back on track to initiate the close flyby itself, on time, would be a distraction.”
With that, Alice had her marching orders. Her sole job now was to save the Core flyby sequence; everything else was expendable. But could it be done in time?
Alice and her team quickly but methodically devised a recovery plan. In the next three days, they had to design and build all the command procedures to get the spacecraft back onto its primary computer, then to resend all the lost command and support files for the Core load, and they would have to test all of this on the NHOPS (New Horizons Operations Simulator) before any actions were taken, to ensure that each step would work on the first try—there was no leeway for repeats. They knew when the flyby Core sequence needed to engage, which would be noontime on July 7. So Alice’s team took the total time available until then and divided it up into nine-hour round-trip light-times—the amount of time it would take to send each set of procedures to run on the spacecraft and receive confirmation that it had performed successfully. Counting everything else that had to be done on the ground, they found there was time for only three of these communications cycles before the Core load would need to engage mid-day on July 7.
No problem of this scope or with such high stakes had ever occurred.
Thus, the recovery would be split into three steps. First, the team would command the spacecraft to restore normal, rather than emergency, communications. That would up the communications bit rates by a factor of 100, making the rest of the recovery possible to do in time. That first step alone, they estimated, would take about half a day to code, test, send to New Horizons, and get confirmation back that it had succeeded. Tick, tock.
Next, the team would command the spacecraft to reboot onto its primary computer. This was needed in order to use the flyby command load as coded. A reboot from the backup to the prime computer had never been done in flight. So a procedure had to be designed and coded for that, and tested on NHOPS, and the test results then had to be checked before that procedure could be sent to New Horizons. Finally, the team would have to methodically restore all the Core flyby files and engage the flyby timeline. It was nearly midnight by the time this plan was architected, and there was no time to spare: The clock had already bled down over 10 hours since the loss of contact that afternoon. Tick, tock.
Alice’s mission operations team, working closely with Chris Hersman’s spacecraft systems team, wrote, tested, and then sent up the first set of commands about 12 hours after they had re-established spacecraft comm, at about 3:15 a.m. on July 5.
Nine hours later, midday on the fifth, the MOC received confirmation that normal communications had been restored! But a day had passed, and New Horizons had swept nearly another million miles toward its destiny at Pluto. Recovery step 1 was complete, but now only two days remained until the Core flyby sequence needed to engage. Tick, tock.
The New Horizons team organized their work, and their lives, for the next few days around the nine-hour communications cycles to the spacecraft and back. They ran on very little sleep and lots of adrenaline. They had worked together for over a decade and had encountered problems on the spacecraft before, but no problem of this scope or with such high stakes had ever occurred. It demanded a round-the-clock existence in mission control, and the team delivered.
Glen recalls, “The team just did what they needed to do. I started searching for places for people to sleep, trying to find something more comfortable than their office floors.” And Alice remembers, “We found cots, blankets, and pillows, and someone brought in an air mattress. There weren’t enough, so we were sharing.” Alan:
You should have seen it. Without a single complaint, people worked day and night—without so much as changes of clothes or places to properly sleep or shower, in some cases for four days straight. Some people were sleeping on desks. Some were living on just two or three hours of catnaps per day. There was no time for restaurant meals. We brought in people just to find takeout and keep the team fed.
In order to ensure that this and every step of the recovery was going to work as intended, it was essential that each of the recovery procedures be tested on NHOPS. Because NHOPS so faithfully simulated the spacecraft, command-load testing on it could be used to work out bugs and certify that the instructions that would be sent to New Horizons itself would be error free.
As it turned out, a decision made years earlier proved to be a life-saver during the recovery. Alan had become so concerned that the team did not have a fully complete backup to NHOPS, that a second one was built. Well, during the weekend of July 4, there simply was not enough time to test all the new command loads needed to recover using only a single NHOPS. So they doubled up, using that second NHOPS to fit in more test runs. Had there been no NHOPS-2, the recovery would have taken days longer, and whole swaths of unique Pluto science would have been lost forever.
Using procedures tested on NHOPS-1 and NHOPS-2, the middle step of taking New Horizons out of safe mode and getting it back on the primary flight computer succeeded and was confirmed by telemetry sent by the spacecraft on July 6.
Next, the spacecraft had to be configured just as it had been prior to the attempt to upload the flyby script on July 4, and then, as a final step, the Core load had to be sent back up again, and with it all the dozens of associated support files that had been lost when the anomaly rebooted the primary computer. Those steps and all the NHOPS testing for them, including many Anomaly Review Board meetings to plan and certify each step, took round-the-clock work on the sixth.
But somehow, by late morning on July 7 all the recovery work was complete. Exhausted, the team had managed to get the spacecraft back on track and ready to go for the flyby. They had completed it with just four hours to spare before the Core load needed to engage.
What science was lost because of the July 4th anomaly and recovery reworks? In saving the day for New Horizons, Alice and her team did follow Alan’s directive to do “whatever it takes” to save the Core flyby. So in the end they did trash all the observations that would have taken place during the three days of the anomaly recovery, because there was simply no way to re-plan them and also get the spacecraft out of safe mode and ready to start the close flyby on time.
But Alice’s team did manage to save the 63 images that were in the process of being compressed when the anomaly occurred. Those images had to be compressed to fit in storage because the larger, raw images had to be deleted to open up more memory space for flyby data. During the recovery operations, Alice’s team spotted an open window in the spacecraft operations timeline and managed to get that compression rescheduled, saving every single one of those precious 63 images.
What about all the approach observations that were trashed during the July 4 weekend recovery of the spacecraft? Alan assigned flyby planning czar Leslie Young the task of forming a tiger team to analyze just that. Leslie and her troops worked during the three days of the spacecraft recovery to look at every lost observation and its impact on the overall science return at Pluto. They found that each one had a later observation that was at higher resolution or closer range, meaning no objectives had been lost—except in one case. That was the final satellite search around Pluto that had been planned to take place on July 5 and 6, when New Horizons was still far enough out to blanket the space around Pluto with images. That sequence would have searched with several times the sensitivity of the previous search made just days before the anomaly occurred. When all the satellite search images were later scoured carefully by the New Horizons science team, no new satellites were found. This surprised many on the science team, since every time the Hubble Space Telescope had looked harder, it had found more moons. Would New Horizons have discovered satellites in that trashed, final, better search? No one knows, or will know, perhaps, until some future Pluto orbiter mission arrives, to search again.
Alan Stern is the principal investigator of the New Horizons mission, leading NASA’s exploration of the Pluto system and the Kuiper belt. A planetary scientist, space program executive, aerospace consultant, and author, he has participated in over two dozen scientific space missions.
David Grinspoon is an astrobiologist, award-winning science communicator, and prize-winning author. His previous books include Earth in Human Hands, and his writing has appeared in The New York Times, Slate, Scientific American, and others.
Excerpted from Chasing New Horizons: Inside the Epic First Mission to Pluto by Alan Stern and David Grinspoon. Published by Picador, May 1, 2018. Copyright © 2018 by Alan Stern and David Grinspoon. All rights reserved.