Reliability Engineering – Comprehensive Notes
Software Reliability: Core Ideas
Expectation of Dependability
• Users normally assume all software is dependable.
• Non-critical apps may tolerate occasional failures; critical systems demand very high reliability (medical, telecom/power, aerospace, etc.).The Fault–Error–Failure Chain (using the wilderness-weather example)
• Human error → Fault injected (adding 1 h to the time without midnight check).
• Fault executed → System error (time becomes ).
• Unhandled error → System failure (no data transmitted).
• Not every fault reaches failure because:
– Faulty code might never execute.
– Error might be transient or corrected (recovery/protection).Fault-Management Strategies
- Fault avoidance – prevent or remove faults during development.
- Fault detection & removal – V&V before deployment.
- Fault tolerance – run-time design so faults do not cause failure.
Cost of Late Fault Removal
• Residual-error removal cost escalates rapidly as the number of remaining errors decreases (graph: “Many → Few → Very Few”).
Availability & Reliability
Formal Definitions
• Reliability: .
• Availability: .
• Example: uptime.User-Centric Perceptions
• Users seldom read specs; perceived reliability matters more.
• Environment & consequence dependent (e.g., wiper failure in dry vs. wet climate).
• Availability perception depends on:
– Number of users affected (peak vs. night).
– Outage length (many shorts < one long).Input/Output View of Systems (I/O mapping diagram)
• Reliability concerns the subset of inputs that drive the program into erroneous states, producing erroneous outputs.Reliability-in-Use Paradoxes
• Removing of faults ≠ reliability gain (rare-path defects).
• Users adapt: they avoid flaky features; software with known bugs can still be “reliable enough.”
Reliability Requirements & Metrics
System-Level Reliability Requirements
• Functional: detect/avoid/tolerate faults.
• May cover hardware failure & operator error.
• Non-functional: quantitative targets (failures allowed or availability window).Metrics
- POFOD – Probability of Failure on Demand (event-driven, e.g., chemical shutdown).
- ROCOF – Rate of Occurrence of Failure (e.g., ).
• Reciprocal . - AVAIL – Fraction of uptime including repair time.
Availability Table (24 h context)
• → 144 min down
• → 14.4 min
• → 84 s
• → 8.4 s (≈1 min/week)Benefits of Quantitative Specs
• Clarify stakeholder needs, stop-testing criteria, design trade-off assessment, and certification evidence.Spec Tactics
• Assign stricter targets to high-cost failures/services.
• Question whether extreme reliability is necessary; alternative mechanisms may suffice.
Illustrative Specifications
ATM Network
• DB availability (7 a.m.–11 p.m.) → <1 min/week downtime.
• Individual ATM software availability (~1–2 min/day).Insulin Pump
• Transient fault POFOD (≤1 in 500 demands).
• Permanent fault POFOD < 0.00002 (≤1/year).Functional Reliability Requirement Examples
• RR1 – range checks (checking).
• RR2 – DB replicas on two buildings (recovery + redundancy).
• RR3 – N-version braking control (redundancy).
• RR4 – Safe subset of Ada, static analysis (process).
Fault-Tolerant Architectures
Rationale
• Mandatory when availability/impact stakes are high; even spec-correct systems need tolerance (spec errors possible).Protection Systems
• Separate emergency controller monitoring environment/control system.
• Act on anomalies (train stop, reactor SCRAM).
• Design goals: redundancy, diversity, simplicity → low POFOD.Self-Monitoring (Multi-Channel) Architectures
• Identical function executed on diverse channels; comparator checks congruence.
• Discrepancy → failure exception.
• Diversity in HW and SW avoids common-mode errors.
• Used in Airbus FCS: 5 computers, different processors, chipsets, languages, functionality partition (primary vs. secondary).N-Version Programming
• Odd number (usually 3) independent SW versions, majority voting.
• Origin: Triple Modular Redundancy (TMR) in hardware.
• Assumes independent faults; evidence shows correlated spec misunderstandings reduce gain.
• Observed reliability improvement ≈5–9×; cost–benefit must be justified.Achieving Diversity
• Different languages, design methods, algorithms, teams.
• Problems: cultural similarity, difficult code regions, shared spec errors.
Dependable Programming Practices
- Guideline Overview
- Limit visibility of information
- Validate all inputs
- Handle all exceptions
- Minimize error-prone constructs
- Provide restart capability
- Check array bounds
- Use timeouts on external calls
- Name real-world constants
Key Details & Implications
Visibility Limitation
• Use abstract data types; prevent accidental state corruption.Input Validation Checks
• Range, size, representation, reasonableness (security + reliability).
• Avoids undefined behaviour under unusual inputs.Exception Handling Strategies
- Signal caller with exception info.
- Alternative processing (local recovery).
- Delegate to run-time support.
• Provides in-program fault tolerance (separates normal vs. exceptional flow).
Error-Prone Constructs
• goto, floating-point comparisions, pointers & aliasing, dynamic memory, parallelism, recursion, interrupts, inheritance, unbounded arrays, default input processing.Restart Capabilities
• Persist forms/state; periodic checkpoints; critical for long interactions.Array-Bound Checking
• Prevent buffer-overflow vulnerabilities; mandatory in non-checking languages (C, C++).Timeouts on RPC / External Calls
• Detect silent remote failures; trigger fail-over or recovery.Named Constants
• Avoid magic numbers; single-point update when real-world values change.
Reliability Measurement & Statistical Testing
Data Required
• Number of failures vs. service requests → POFOD.
• Time/transactions between failures → ROCOF, MTTF.
• Repair/restart duration → Availability (includes downtime).Statistical (Reliability) Testing Process
- Identify operational profiles (usage distribution).
- Prepare matching test data.
- Execute tests, collect failure data.
- Compute observed reliability; iterate until target reached.
Operational Profile Issues
• Uncertainty & evolution over time.
• High cost of generating realistic test data (esp. unlikely inputs).
• Highly reliable systems make statistical significance hard (few failures).
• Failure recognition ambiguity (spec interpretation).
Ethical, Practical & Certification Implications
- Quantifying reliability guides ethical responsibility for safety-critical software.
- Regulators rely on evidence (measured availability, POFOD) for certification (e.g., avionics, medical devices).
- Over-engineering reliability can waste resources; under-specifying endangers lives/economies.
Synthesis of Key Points
- Reliability stems from three pillars: avoiding, removing, and tolerating faults.
- Quantitative metrics (POFOD, ROCOF/MTTF, AVAIL) enable measurable requirements, stopping rules for testing, and trade-off evaluation.
- Architectural redundancy (protection, self-monitoring, N-version) and diversity combat single-point or common-mode failures.
- Dependable programming embeds defensive checks, structured exception handling, and safe language/construct choices.
- Statistical testing with realistic operational profiles validates that implemented reliability meets or exceeds targets.