Reliability Engineering – Comprehensive Notes

Software Reliability: Core Ideas

  • Expectation of Dependability
    • Users normally assume all software is dependable.
    • Non-critical apps may tolerate occasional failures; critical systems demand very high reliability (medical, telecom/power, aerospace, etc.).

  • The Fault–Error–Failure Chain (using the wilderness-weather example)
    • Human error → Fault injected (adding 1 h to the time without midnight check).
    • Fault executed → System error (time becomes 24.XX24.XX).
    • Unhandled error → System failure (no data transmitted).
    • Not every fault reaches failure because:
    – Faulty code might never execute.
    – Error might be transient or corrected (recovery/protection).

  • Fault-Management Strategies

    1. Fault avoidance – prevent or remove faults during development.
    2. Fault detection & removal – V&V before deployment.
    3. Fault tolerance – run-time design so faults do not cause failure.
  • Cost of Late Fault Removal
    • Residual-error removal cost escalates rapidly as the number of remaining errors decreases (graph: “Many → Few → Very Few”).


Availability & Reliability

  • Formal Definitions
    • Reliability: P(no failure during tenv,purpose)P(\text{no failure during }t|\text{env,purpose}).
    • Availability: P(system operational at time t)P(\text{system operational at time }t).
    • Example: AVAIL=0.999=99.9%AVAIL = 0.999 = 99.9\% uptime.

  • User-Centric Perceptions
    • Users seldom read specs; perceived reliability matters more.
    • Environment & consequence dependent (e.g., wiper failure in dry vs. wet climate).
    • Availability perception depends on:
    – Number of users affected (peak vs. night).
    – Outage length (many shorts < one long).

  • Input/Output View of Systems (I/O mapping diagram)
    • Reliability concerns the subset of inputs that drive the program into erroneous states, producing erroneous outputs.

  • Reliability-in-Use Paradoxes
    • Removing X%X\% of faults ≠ X%X\% reliability gain (rare-path defects).
    • Users adapt: they avoid flaky features; software with known bugs can still be “reliable enough.”


Reliability Requirements & Metrics

  • System-Level Reliability Requirements
    • Functional: detect/avoid/tolerate faults.
    • May cover hardware failure & operator error.
    • Non-functional: quantitative targets (failures allowed or availability window).

  • Metrics

    1. POFOD – Probability of Failure on Demand (event-driven, e.g., chemical shutdown).
    2. ROCOF – Rate of Occurrence of Failure (e.g., 0.002  fails/hr0.002\;\text{fails}/\text{hr}).
      • Reciprocal MTTF=1/ROCOFMTTF = 1/ROCOF.
    3. AVAIL – Fraction of uptime including repair time.
  • Availability Table (24 h context)
    0.90.9 → 144 min down
    0.990.99 → 14.4 min
    0.9990.999 → 84 s
    0.99990.9999 → 8.4 s (≈1 min/week)

  • Benefits of Quantitative Specs
    • Clarify stakeholder needs, stop-testing criteria, design trade-off assessment, and certification evidence.

  • Spec Tactics
    • Assign stricter targets to high-cost failures/services.
    • Question whether extreme reliability is necessary; alternative mechanisms may suffice.

Illustrative Specifications
  • ATM Network
    • DB availability 0.9999\approx 0.9999 (7 a.m.–11 p.m.) → <1 min/week downtime.
    • Individual ATM software availability 0.999\approx 0.999 (~1–2 min/day).

  • Insulin Pump
    • Transient fault POFOD 0.002\le 0.002 (≤1 in 500 demands).
    • Permanent fault POFOD < 0.00002 (≤1/year).

  • Functional Reliability Requirement Examples
    • RR1 – range checks (checking).
    • RR2 – DB replicas on two buildings (recovery + redundancy).
    • RR3 – N-version braking control (redundancy).
    • RR4 – Safe subset of Ada, static analysis (process).


Fault-Tolerant Architectures

  • Rationale
    • Mandatory when availability/impact stakes are high; even spec-correct systems need tolerance (spec errors possible).

  • Protection Systems
    • Separate emergency controller monitoring environment/control system.
    • Act on anomalies (train stop, reactor SCRAM).
    • Design goals: redundancy, diversity, simplicity → low POFOD.

  • Self-Monitoring (Multi-Channel) Architectures
    • Identical function executed on diverse channels; comparator checks congruence.
    • Discrepancy → failure exception.
    • Diversity in HW and SW avoids common-mode errors.
    • Used in Airbus FCS: 5 computers, different processors, chipsets, languages, functionality partition (primary vs. secondary).

  • N-Version Programming
    • Odd number (usually 3) independent SW versions, majority voting.
    • Origin: Triple Modular Redundancy (TMR) in hardware.
    • Assumes independent faults; evidence shows correlated spec misunderstandings reduce gain.
    • Observed reliability improvement ≈5–9×; cost–benefit must be justified.

  • Achieving Diversity
    • Different languages, design methods, algorithms, teams.
    • Problems: cultural similarity, difficult code regions, shared spec errors.


Dependable Programming Practices

  • Guideline Overview
    1. Limit visibility of information
    2. Validate all inputs
    3. Handle all exceptions
    4. Minimize error-prone constructs
    5. Provide restart capability
    6. Check array bounds
    7. Use timeouts on external calls
    8. Name real-world constants
Key Details & Implications
  • Visibility Limitation
    • Use abstract data types; prevent accidental state corruption.

  • Input Validation Checks
    • Range, size, representation, reasonableness (security + reliability).
    • Avoids undefined behaviour under unusual inputs.

  • Exception Handling Strategies

    1. Signal caller with exception info.
    2. Alternative processing (local recovery).
    3. Delegate to run-time support.
      • Provides in-program fault tolerance (separates normal vs. exceptional flow).
  • Error-Prone Constructs
    • goto, floating-point comparisions, pointers & aliasing, dynamic memory, parallelism, recursion, interrupts, inheritance, unbounded arrays, default input processing.

  • Restart Capabilities
    • Persist forms/state; periodic checkpoints; critical for long interactions.

  • Array-Bound Checking
    • Prevent buffer-overflow vulnerabilities; mandatory in non-checking languages (C, C++).

  • Timeouts on RPC / External Calls
    • Detect silent remote failures; trigger fail-over or recovery.

  • Named Constants
    • Avoid magic numbers; single-point update when real-world values change.


Reliability Measurement & Statistical Testing

  • Data Required
    • Number of failures vs. service requests → POFOD.
    • Time/transactions between failures → ROCOF, MTTF.
    • Repair/restart duration → Availability (includes downtime).

  • Statistical (Reliability) Testing Process

    1. Identify operational profiles (usage distribution).
    2. Prepare matching test data.
    3. Execute tests, collect failure data.
    4. Compute observed reliability; iterate until target reached.
  • Operational Profile Issues
    • Uncertainty & evolution over time.
    • High cost of generating realistic test data (esp. unlikely inputs).
    • Highly reliable systems make statistical significance hard (few failures).
    • Failure recognition ambiguity (spec interpretation).


Ethical, Practical & Certification Implications

  • Quantifying reliability guides ethical responsibility for safety-critical software.
  • Regulators rely on evidence (measured availability, POFOD) for certification (e.g., avionics, medical devices).
  • Over-engineering reliability can waste resources; under-specifying endangers lives/economies.

Synthesis of Key Points

  • Reliability stems from three pillars: avoiding, removing, and tolerating faults.
  • Quantitative metrics (POFOD, ROCOF/MTTF, AVAIL) enable measurable requirements, stopping rules for testing, and trade-off evaluation.
  • Architectural redundancy (protection, self-monitoring, N-version) and diversity combat single-point or common-mode failures.
  • Dependable programming embeds defensive checks, structured exception handling, and safe language/construct choices.
  • Statistical testing with realistic operational profiles validates that implemented reliability meets or exceeds targets.