We have previously written about poor data quality having a real-world negative impact with the recent UK government COVID data loss. There are numerous other examples where avoidable issues with data quality have caused significant losses and disruptions. In this post we’re going to look at another high-profile case, when NASA lost a multimillion-dollar space vehicle leading to the failure of an entire Mars project, simply because of a data conversion error which was not detected.
In December 1998, the Mars Climate Orbiter was launched, set to make it to the red planet by September 1999. It would travel 669 million kilometers with its over half-tonne body carrying two main scientific instruments that would help scientists better understand the Mars climate. Its total cost added up to $125 million USD.
Two main sensors would analyse the surface of the planet from orbit, tracking variations in temperature, ice and gasses throughout the Martian year. It would also set up a communication nexus for the Mars Polar Lander project, the subsequent phase of the larger Mars mission, set to launch not long after the Orbiter.
On September 23 1999, the spacecraft prepared to enter its orbit around Mars. Unexpectedly, the orbit trajectory was 105 miles lower than was expected. Contact was lost and the Orbiter burned up in the Mars atmosphere, with the entire project considered a failure. An investigation by the Jet Propulsion Laboratory discovered the reason for the catastrophic failure, the velocity changes that were performed in order to align the craft into orbit were off by a factor of 4.45. But why did this happen?
How Did Poor Data Quality Cause the Outcome?
The culprit was a single piece of software called SM_FORCES (Small Forces) used by the ground crew to process trajectory models. The expected output from this code, according to the missions Project Software Interface Specification, was in metric Newton-seconds. It was reasonably assumed that any value being derived from that file would be in metric.
However, due to the way work on the project was distributed, an error was introduced into this file and went undetected. The propulsion engineer contractors from Lockheed Martin Astronautics, who built the thruster systems, operated using their own preferred measurement systems, pound-force. Since NASA generally uses metric for most its systems, a conversion for the pound-force value should have been made and accounted for. It was not.
How did it go Unnoticed?
Navigation in space was being performed by measuring the doppler effect on the radio link and converting it into coordinates. By plotting a number of these coordinates, a flight path was calculated. Based on this, trajectory correction burns could be made. These corrections, as they were being made, were putting the orbiter off course by a factor of 4.45. This was affecting all calculations, as they were all being sourced from the same incorrect file.
There were no single deviations that would have alerted the ground crew to the error, since all the measurements were still perfectly in scale with each other. Also, the results of a correction could not be known for several weeks until a new set of points could be plotted. Finally, the doppler effect calculation was at its least accurate and unreliable during the final approach to the planet. A major corrective thruster fire was initiated but could not rescue the spacecraft because all the force metrics used in the correction were compromised.
How Could it Have Been Avoided?
The specifics of any data format in use by NASA are documented in what is called the Software Interface Specification (SIS). This documentation exhaustively explains every element, the relationships between them and any calculations that take place, including the fact that the expected units were metric. However, problems with rushed development, improper testing and inadequate training meant that even when this documentation could have been useful, it was overlooked and incomprehensible to anyone who hadn’t worked on the software directly.
The Description of ”Contributing Cause No. 1” in the Official Mishap Investigation Board Report states that for the first four months of the flight, the flawed SM_FORCES software was not even in use by the flight control team due to several other file-related issues. As a stopgap measure, the crew were relying on emails from Lockheed Martin warning them about relevant trajectory correction events.
The Impact of Software Development Culture
Eventually, SM_FORCES was enabled and allowed the ground controllers to independently calculate trajectory. Immediately, incorrect data started to put the spacecraft off course. The haphazard introduction of software fixes without proper validation would lead to disaster. The IEEE report on this issue highlights that the Jet Propulsion Laboratory had a “cowboy programming” culture, which involved only a small group of people being familiar with the 30-year-old trajectory software code. This meant that validation, testing, configuration and control could not be effectively done.
Untested software was producing data which was used without any subsequent validation. Furthermore, the ground control crew had only one other point against which to check the validity of the trajectory, the increasingly unreliable doppler effect. Even though many engineers started to suspect that something was off, there was no single major incident which sparked a major re-evaluation. The all-clear was given based on the very same data that was driving the orbiter to its doom.
A Failure of Data Governance
Edward Weiler, the NASA associate administrator for space science had this to say about the incident:
The problem here was not the error; it was the failure of NASA's systems engineering, and the checks and balances in our processes, to detect the error. That's why we lost the spacecraft.
Ultimately, while NASA uses the metric International Standard of units for most of its operations, it still has a substantial amount of data in imperial. Some new projects are inheriting specifications from old designs which are perpetuating the use of imperial measurements, despite an agency wide commitment to gradually shift to metric. An estimate of what it would take to move all of NASA into the International Standard in a single go, came up with the eye-watering cost of $370 million.
This poses an ongoing headache for NASA, which often relies on close collaboration with other international agencies and aerospace corporations in order to achieve its missions. Interoperability in data systems is an important part of this. Short of the expensive complete revamp of their data landscape, it will need to continuously monitor and adjust for the fact that there are imperial remnants still to be found within the organisations. With this situation, another imperial-to-metric data conversion issue seems possible in the future of space travel.
Although this particular data quality problem involves astrophysics and multimillion dollar space craft, it ultimately still comes down to a few very relatable causes. Poorly implemented software leading to knock-on effects that went unnoticed. Lack of validation for vital data. Rigid processes that could not identify a mission-sinking error unless it was glaringly obvious. Costs to modernise a data environment that seem high, until a major incident reveals that the investment would have probably been worth it.
When you are attempting to implement good data governance in your organisation, keep this story in mind and use a tool like Loome in order to keep your data quality from causing a disruption of astronomical proportions.