Reliability Issues and Fault Tolerance Notes

Reliability Issues

Terminology

System: The whole being discussed.
Environment: The surroundings of the system.
Boundaries: The defined limits of the system.
Services: Functions provided by the system.
Real-time service: A service delivered within finite time intervals dictated by the environment.
Real-time system: A system that delivers at least one real-time service.

Main Attributes of a System

Availability: The extent to which a system is ready for usage.
Reliability: The extent to which a system continuously provides its service, indicating confidence in accurate and consistent results.
Safety: The extent to which a system avoids catastrophic consequences on the environment, indicating confidence in preventing accidents.
Security: The extent to which a system prevents unauthorized access and handling of information, indicating confidence in resisting attempts to modify its behavior.

Relationships Between Attributes

Security and reliability are necessary but not sufficient conditions for safety.
System Failure: When a system no longer delivers a service that complies with its specification.
Error: A system state that is liable to lead to a subsequent failure.
Fault: The adjudged or hypothesized cause of an error.

Components of Computers

Computers consist of hardware, software, and data.
- Software defines operations.
- Hardware performs operations.
- Data records the results of operations.

Comparison of Hardware, Software, and Data in Terms of Failure

Hardware:
- Cause of Failure: Deficiencies in design, production, or maintenance; Occurrences; Will eventually fail.
- Failure Rates: Can be predicted in theory from physical principles.
- Redundancy: Improves reliability but is susceptible to common cause failures.
- Diversity: Improves reliability and is less susceptible to common cause failures.
- Environmental Factors: Dependent on temperature, humidity, stress, etc.
- Time Dependence: Time-dependent; failures can be increasing, constant, or decreasing.
- Wear-Out: Responsible for some failures, may be preceded by a warning.
- Maintenance: Can improve reliability.
Software:
- Cause of Failure: Design (logic) errors; Transient Events; May never fail.
- Failure Rates: Cannot be predicted from physical principles.
- Redundancy: Will not improve reliability, since this will only replicate the same failure.
- Diversity: Will improve reliability, minimizing the possibility of the same error occurring in separate modules.
- Environmental Factors: Dependent on internal environment of computer (memory, clock speed, etc.).
- Time Dependence: Not time-dependent; failures occur when a path that contains an error is executed.
- Wear-Out: Not responsible for any failures.
- Maintenance: Will not improve reliability, and may actually worsen it.
Data:
- Cause of Failure: Transient Events; May never fail.
- Failure Rates: Some upset rates can be predicted from tests.
- Redundancy: May improve reliability.
- Diversity: Will improve reliability, minimizing the possibility of the same error occurring in separate modules.
- Environmental Factors: Dependent on both external (radiation, EMI, etc.) and internal environment (memory, clock speed, etc.).
- Time Dependence: Time-dependent; failures can be increasing, constant, or decreasing.
- Wear-Out: Not responsible for any failures.
- Maintenance: Will not improve reliability.

Fault, Error, Failure, and Hazard

Fault → Error → Failure → Hazard(Leads to) Accident
Error Detection and Correction
Fail Safe
Fault Tolerance

Fail-Safe State

An unreliable system can be made safe upon a failure by reverting to a fail-safe state.
Example: The fail-safe state of a word processing program can be that the document being processed has been saved onto the disk.
Fail-safe states help separate the issues of safety and reliability.

Safety-Critical Systems

A safety-critical system is one where any failure can cause severe damage, and no fail-safe state exists.
We cannot revert to a fail-safe state.
Example: An autopilot system; if it fails, there's no fail-safe state like setting down the engine.
For safety-critical systems, even asking the manual pilot to take over may not be sufficient.
The only way to make it safe is by making it extremely reliable.
Safety can be ensured only through increased reliability in safety-critical systems.

Methods for Reliable Real-Time Systems (RTS)

Fault Prevention: How to prevent fault occurrence or introduction.
Fault Tolerance: How to provide a service complying with the specifications in the presence of faults.
Fault Removal: How to reduce the presence of faults, both regarding the number and seriousness of faults.
Fault Forecasting: How to estimate the creation and the consequences of faults.

Implementation of Methods

Fault prevention and fault tolerance should be embedded during the construction of the system.
Fault removal and fault forecasting may be seen as the methodology used to ensure the safety of a system.
Fault Avoidance: The close association between fault removal and fault prevention.
Fault avoidance aims at a fault-free system.

Fault Avoidance and Tolerance

Fault Avoidance:
- By using the most reliable components.
- Implementing the best techniques for the interconnections of components.
- Carrying out comprehensive testing to eliminate hardware and software faults.
Fault Tolerance:
- The ability of an operational system to tolerate the presence of faults is carried out mainly by error processing and by fault treatment.

Error Processing and Fault Treatment

Error processing aims at removing errors from the state of the system, and fault treatment aims at preventing faults from being activated again.
In order to undertake error processing, the system must have detected the error and assessed the damage done by it.
Error detection, damage assessment, error processing, fault treatment: phases to be undertaken in order to tolerate a fault.

Error Detection

Error detection is the detection of an erroneous state.
A state that is liable to lead to subsequent failure.
Since an error is a manifestation of a fault, the effectiveness of the techniques for error detection is crucial for the success of any fault-tolerant system.

Damage Assessment

Carried out when an error has been detected.
In order to establish more precisely to which extent the system is damaged.
This assessment will be highly dependent on decisions made by the systems designer to limit the propagation of errors.

Error Processing (Recovery or Compensation)

When the damage to the system has been assessed, the error may be processed in two ways:
- Error Recovery: An attempt to substitute the erroneous system state with one which is error-free.
  - Backward Recovery: The system is brought back to an error-free state recorded in a recovery point - a "snapshot" of the system state - prior to the erroneous state.
  - Forward Recovery: The transformation of the erroneous state consists of finding a new state from which the system can operate (often in degraded mode).
- Error Compensation: The erroneous state contains enough redundancy to enable the delivery of an error-free service from the erroneous state.

Fault Treatment (Diagnosis and Passivation)

The first step in fault treatment is fault diagnosis, which consists of determining the cause(s) of the error(s), with regard to both location and nature.
When this is done, actions can be carried out in order to prevent the fault(s) from being activated again, i.e., fault passivation.
The process of fault treatment is seldom done with the failed system in operation.
For instance, a commission of inquiry performs fault diagnosis on the plane that actually crashed in order to gather information about the fault.
This information can hopefully be used to passivate the fault that caused the crash - preferably by removing the fault - in other planes (of the same type) which are still in operation.

Fault Tolerance and Redundancy

Redundancy consists of additional components and algorithms attached to the system.
Protective redundancy can be divided into the following domains:
- Space: Hardware redundancy (H).
- Information: Software redundancy (S).
- Repetition: Time redundancy (T).

Hardware Redundancy

Hardware redundancy may be divided into static redundancy and dynamic redundancy.
Static Redundancy: Is employed to mask the effect of hardware failures within a given hardware module and is also called masking redundancy.
Dynamic Redundancy: You actually allow an error to appear in a module, and an attempt is made to recover the failed module.

Software and Time Redundancy

Software Redundancy: Includes all additional software not needed in a fault-free computer system.
- This additional software serves to provide, for example, error detection and recovery.
- This form of redundancy is often used together with dynamic hardware redundancy.
Time Redundancy: Incorporates two major strategies:
- Restart of programs after an error has been detected, and
- Repeated execution for error detection.

Redundancy and Diversity Denotations

Redundancy in a domain is denoted with an $N$ , e.g., hardware redundancy is denoted $N_H$ .
If the redundancy is diversified, this is denoted by $d$ , as in $N_dH$ , which denotes that the redundancy in hardware is comprised of several diverse hardware channels.
Using the definitions of these domains and their respective denotations, a description of a non-fault-tolerant system (also called a simplex system) is 1H/1S/1T, meaning that the system has 1 hardware channel, 1 program, and runs in 1 execution.

Software Diversity

Frequently used in order to tolerate design faults.
Logical independence of software components.
These diverse software components are functionally equivalent but are implemented in different ways.
Diversity may be obtained by:
- Different specifications,
- Different algorithms for the same specification,
- Implementations of the same algorithm, and so on.
However, experimental data has shown that independence between diverse software components is very hard to achieve.

Fault Removal

Consists of three steps: verification, diagnosis, and fault correction.
- Verification is the process of checking whether the system adheres to certain properties, so-called verification conditions.
- If it does not, the other two steps have to be undertaken: diagnosing the fault(s) which prevented the verification conditions to be fulfilled, and
- Then performing the necessary corrections.
- After correction, the process of verification is started from the beginning.

Steps of Fault Removal

Verification
Diagnosis
Fault Correction

Fault Forecasting

When a system has been constructed, it is often desired to make an evaluation of its behavior with respect to fault occurrence and activation.
This evaluation is called fault forecasting and can be carried out in two ways:
- Non-probabilistic, e.g., determining the minimal cut set or path set of a fault tree, conducting a fault mode and effect analysis (FMEA); and
- Probabilistic, which aims at determining the conformance of the system to its objectives expressed in terms of probabilities associated with the attributes of dependability, which may then be defined as measures of dependability.

Fault Injection

Evaluation of fault-tolerant systems often involves the measuring of the coverage of error processing and fault treatment, i.e., a measure of the ability of these mechanisms in the system to process the error and treat the fault.
This evaluation may be done through testing using fault injection.

Dependability

Dependability is defined as the quality of being able to be counted on or relied upon.
When you always do everything that you say you will and never make promises you cannot keep, this is an example of dependability.
Software Dependability is very useful as far as planning and controlling the resources during the design and development process, which leads to the development of a high-quality dependable software system.
Software dependability is also very useful for developing users' confidence about software reliability.

Dependability Attributes

These attributes can serve as measures of dependability.
And may be more or less emphasized depending on the application intended for the considered computer system.
The main dependability attributes are:
- Reliability
- Availability
- Safety
- Security

Aspects of Dependability

Dependability
Available: Readiness for Usage
Reliable: Continuity of Service Delivery
Safe: Non-occurrence of Catastrophic Consequences
Confidential: Non-occurrence of unauthorized disclosure of information
Integral: Non-occurrence of improper alteration of information
Maintainable: Aptitude to undergo repairs or evolutions

Dependability Terminology

Attributes: Dependability, Availability, Confidentiality, Reliability, Safety, Integrity, Maintainability
Means: Fault Prevention, Fault Tolerance, Fault Removal, Fault Forecasting
Impairments: Faults, Errors, Failures

Fault-Tolerant Architectures: TMR Systems

Triple Modular Systems (TMR)

Fault-tolerance in hardware is very intuitive.
Persons working in the hardware area said that the hardware they have developed fault-tolerant standard technique called as triple modular redundancy which is built in self-test.
If we find errors then
- we invoke another one or
- we do a voting between three redundant pieces or
- all working at the same time and then, if one of them fails, the other two results will differ and we remove that one which failed and use that majority result.
Standard TMR is the standard hardware fault-tolerance technique.

Problems with Voters

Module outputs may become valid at slightly different times due to differences in hardware paths as well as in sensor locations.

TMR-Architecture

Input
Module 1
Module 2
Module 3
Voting Element
Output

Multi-Stage TMR

Input Module -> Voter Module -> Module -> Voter

Detecting Faults

Functionality Checks: Periodically execute code and check results. For example, write and read from a RAM
Consistency Checking: Example: Range checking
Signal Comparison: In some systems, you can compare signals at various points within a module
Information Redundancy: Checksums, parity checking
Instruction Monitoring: If the processor fetches an illegal instruction, something is probably wrong
Loop-back Testing: Useful for testing communication channels. Make sure what is received is the same as what is sent

More Ways of Detecting Faults

Bus Monitoring: Watch the bus and make sure that the program accesses memory within an allowable range
Power Supply Monitoring: Possibly a dead system will draw less power. Also, the power supply may fail: this would cause major system failure. There might be a warning you could watch for
Watchdog Timers: Detect the crash of a microprocessor by arranging a timer such that it will cause a reset (or error condition) if it is allowed to time-out. While the processor is operating normally, it periodically loads a value into this register
- Problem: Time delay until fault is detected
- Problem: It is conceivable that the system could crash in such a way that the timer is still loaded with a value, or it is possible that the watchdog timer fails