NASA Logo - Jet Propulsion Laboratory To the JPL Home page To the NASA Home page To the Caltech Home page    + View the NASA Portal
Skip Navigation
JPL Home Earth Solar System Stars & Galaxies Technology
Space Technology 8

Dependable Multiprocessor:  1  |   2  |  3  |  4

EAFTC icon Technologies

Dependable Multiprocessor

>> An Approach to a Solution

The approach embodied in the Dependable Multiprocessor experiment is to use a redundant-hardware architecture and advanced-technology, fault-tolerant software to provide a computing system able to operate in the space environment, provide reliable computing, and offer performance like that available on Earth. Fault tolerance, as the term implies, is the ability to continue to provide correct computation in the presence of faults or errors. In other words, instead of taking the usual approach of fault avoidance—shielding the system and/or designing custom electronics that will not be affected by the radiation—we would like to use standard, low-cost, high-performance, low-power components to build machines that recognize that an error has occurred and autonomously take action to correct the error. Similarly, if radiation damage breaks part of the computer, we would like the machine to identify the damage and fix itself or, if this is not possible, signal its problem and halt.

We can categorize faults into two types: permanent faults—that is, faults that break computer components—and soft errors, which cause an error but do not cause permanent damage. Techniques have been developed to deal with both types of faults. Unfortunately, these techniques, especially those for fixing soft errors, rob the computer of much of its efficiency. They usually consist of a combination of replication and voting, that is, performing the computation multiple times and then voting on the result.

If we have, for instance, three computers, each performing the same computation and feeding their result to a similarly triplicated voter, the computer system could tell whether one of the computers had experienced an error and could identify the incorrect answer, as it would be out-voted by the other two. Alternatively, we could use just two computers and voters to detect an error, but then we would have to repeat the computations as we could not tell which answer was correct.

Other techniques use redundancy in the form of codes or computational algorithms that have the capacity to detect that an error has occurred by checking the results for certain characteristics. This type of redundancy is often called “information redundancy,” and relies not on replicating a computation, but instead on using additional information that can be used to detect an error. A simple example of such a coding technique is parity. Here an extra bit is inserted into each digital word and is used to make the number of ones in the word either even (even parity) or odd (odd parity) depending on the design of the machine (an even parity or odd parity machine). To check for an error, a parity checker simply adds the number of ones in each word every time it is accessed and determines whether it is an even or odd number. If the parity is incorrect, an error has occurred and that data word is incorrect.

Other techniques exist for error detection and correction in data and in computational algorithms. While these techniques require less overhead than straight replication, they still incur overhead and are limited in their application—that is, they cannot cover all machine operations.

The architecture used by the Dependable Multiprocessor uses several COTS processors, a COTS-based mass memory device, a COTS intra-communication system, and a radiation-hardened processor as a controller. Advanced-technology fault-tolerant software is used to detect and correct faults. In effect, extra hardware and additional computations are the costs of using COTS equipment in space. This additional overhead diminishes the gain in performance that the use of COTS equipment would otherwise provide. Even with this overhead, however, a fault tolerant COTS based high performance computer provides orders of magnitude performance increase over traditional radiation hard processors.

Even though not incorporated in the Dependable Multiprocessor’s design, there is a way we can gain back that lost efficiency, or even improve over standard COTS computing capabilities with other techniques. It turns out that much of the science data processing we wish to do makes extensive use of mathematical routines such as linear algebra. These routines could be performed very efficiently if we had special-purpose hardware to carry out theses calculations. Unfortunately, we don't know, beforehand, what algorithms will be required, and custom chips designed to perform these algorithms would be prohibitively expensive.

There is, however, another solution. If we had a component that was made up of reconfigurable hardware elements, i.e., sets of digital hardware elements whose wiring could be "programmed" as needed, we could use these reconfigurable logic parts to implement the required algorithms on an as-need basis. To do this efficiently, we would need to design our computer system so that these parts could be re-programmed on the fly, in space, as needed—perhaps from a library of pre-designed configurations—one for each algorithm we might want to use over the course of the mission.

The readily available Field Programmable Gate Array (FPGA) is such a part. The use of FPGAs as "accelerators" or "reconfigurable co-processors" can give us as much as 100x to 1000x improvement over standard processors for some types of science data analysis programs. Unfortunately, this part, because it is chock full of memory elements (to hold its configuration program) is also subject to soft errors and so, it too, must be made fault tolerant, or the computing system of which it is a part, must be made aware of its susceptibility and taught to deal with it.

While an FPGA-based reconfigurable fault-tolerant co-processor was originally included in the Dependable Multiprocessor technology advance, it was later deleted due to funding and schedule cost. The University of Florida, which was originally contracted by Honeywell to develop the Fault Tolerant FPGA-based Co-processor for the Dependable Multiprocessor, has continued this work under separate funding, some of which comes from Honeywell as well as other sources. So, while it is unfortunately true that the Dependable Multiprocessor technology advance does not currently include the FPGA co-processor, it is also clear that this technology is being pursued and may become a part of a follow-on Dependable Multiprocessor or Dependable Multiprocessor-like spaceborne high-performance computer system.

EAFTC hardware architecture
Hardware architecture of the Dependable Multiprocessor.

Go to next topic

News ArchiveGlossarySite MapImages and Copyright InfoCredits and Contacts
FIRST GOV   NASA Home Page

Webmaster: Diane K. Fisher
JPL Official: Nancy J. Leon
Last updated: 2/13/08
JPL Clearance #: 06-1093

Go to NMP Home page