Notes: System failures and errors

Introduction to System Failures and Errors

This lecture aims to identify causes of system failure through case studies, understand theories on why errors and failures occur, and consider how systems can be made more dependable.

Case Study 1: Titanic

The Titanic was a catastrophic failure of a large system, resulting in significant costs in terms of money, human life, and organizational reputation.
Many mistakes were made during all phases of its design and development.
The Titanic was a very complex socio-technical system, including safety-critical control systems.
It involved cutting-edge technology for its time, such as data communications and engineering technologies.
The system also featured complex management structures and a complex political and organizational context.
Understanding what went wrong requires an entire system perspective, considering technical components, people/knowledge/processes, organizational context, and the environment.

Case Study 2: Post Office (Horizon Scandal)

Described as "the most widespread miscarriage of justice in UK history" by the BBC.
In 1999, Fujitsu's new accounting software system, Horizon, was installed at Post Office branches.
Between 2004 and 2014, over 700 post office branch managers received criminal convictions for faulty accounting and theft.
In reality, the Horizon system was faulty and had falsely suggested cash shortfalls, leading to severe implications, including wrongful imprisonments.
Technical Components: The Horizon system was faulty with many errors and bugs, and Lord Justice Holroyde stated "there were serious issues about the reliability of Horizon".
People, Knowledge, and Processes: Post Office staff complained of bugs but were not taken seriously; the conclusion was drawn that Horizon software must be correct and that staff had stolen money.
Organizational Context and Environment: Factors included over-trust in technology, lack of respect for workers, embarrassment about an expensive failing tech contract, and failings in the legal system (e.g., a legal presumption of proper functioning of computers).

Case Study 3: Boeing 737 MAX

In October 2018 and March 2019, all passengers, pilots, and cabin crew died in Boeing 737 MAX crashes.
Technical Components: Designers used larger engines which had to be repositioned forward and higher, causing unwanted extra lift and pitch-up at a high angle of attack. To reduce pitch-up and the risk of stall, and to make the aircraft behave like the older version, software (Maneuvering Characteristics Augmentation System - MCAS) was used to automatically push the nose down, utilizing AoA sensors. The MCAS system adjusted the angle of the stabiliser, forcing the nose down; this system was covert, forceful, and persistent.
People, Knowledge, Processes & Organizational Context: A software solution was chosen for what was fundamentally a hardware problem. There was little open communication about system risks, and pilots' concerns were not listened to. Some pilots were unaware of the new MCAS system and its operation.
Environment: Market forces pushed airline companies for larger, faster, and cheaper planes.

Theories and Models to Understand System Failures

Different Levels of Failure (Multi-Causal Approach)

Regulatory failures: Can stem from lack of information, undertrained personnel, or lack of regulation.
Managerial Failures: Related to safety climate, lines of command and responsibility, and quality control.
Hardware Failures: Can be due to design failure, requirements failure, or implementation failure.
Software Failures: Include requirements failures and specification failures.
Human Failures: Involve slips, lapses, mistakes, team factors, and human error.

Failure in Complex Systems

Failure in one part may coincide with the failure of a different part.
This combination can cause cascading failures of other parts.
In complex systems, there are many possible combinations of failures.
Complex Interactions: Unfamiliar, unplanned, or unexpected sequences that are not visible or immediately comprehensible.
Tightly Coupled Systems: Characterized by time-dependent processes, rigidly ordered processes, and very little slack.
Systems with interactive complexity and tight coupling are particularly prone to failure.

Reason’s Swiss Cheese Model

Illustrates that accidents occur when holes in successive layers of defenses, barriers, and safeguards align.
Some holes are due to active failures, while others are due to latent conditions.
Limitations:
- Assumes independence of barriers and randomness in holes lining up.
- Layers of defense are not static, constant, or independent; they can interact, support, or erode one another.
- Doesn’t explain what the holes are, how and why they got there, or how they line up.

Understanding Dependability

The Concept of Dependability

For most complex socio-technical systems, dependability is the most important property.
It is a judgment about the user’s trust in a system.
Reflects the extent of the user’s confidence that it will operate as expected and will not ‘fail’ in normal use.
Dependability is defined as "that property of a computer system such that reliance can justifiably be placed on the service it delivers" (Mellor).

Laprie’s Model

Impairments (Faults, Errors, and Failures):
- System failure: When the system does not deliver the service its users expect.
- System error: Where the behavior of the system does not conform to its specification.
- System fault: Incorrect system state not expected by the designers of the system.
- Human error or mistake: Human behavior that results in faults being introduced into a system.
Means (How dependability is achieved):
- Fault avoidance: Preventing the occurrence or introduction of faults.
- Fault tolerance: Delivering correct service, though faults are present.
- Fault removal: Reducing the number or severity of faults.
- Fault forecasting: Estimating the number of faults, future occurrence, and consequences.
Primary Attributes of Dependability:
- Availability: Ability of the system to deliver services when requested.
- Reliability: Ability of the system to deliver services as specified.
- Safety: Ability of the system to operate without catastrophic failure.
- Security: Ability of the system to protect itself against accidental or deliberate intrusion.
Secondary Attributes of Dependability:
- Timeliness: The ability of the system to respond in a timely way to user requests.
- Survivability: The ability of a system to continue to deliver its services to users in the face of deliberate or accidental attack.
- Recoverability: The ability of the system to recover from user or system errors.
- Maintainability: The ease of repairing the system after a failure has been discovered or changing the system to include new features.

Key Points on System Errors and Failures

System failures are the result of many compounding factors.
Failures are more likely in complex systems.
Ensuring dependability is crucial for complex systems.