Chapter 11 – Reliability Engineering (Ian Sommerville, 10th Ed.)

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/43

Earn XP

Description and Tags

A set of question-and-answer flashcards covering definitions, metrics, requirements, architectures, dependable programming practices, and measurement techniques from Chapter 11 (Reliability Engineering).

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

44 Terms

New cards

What is the definition of software reliability?

The probability of failure-free system operation for a specified time, in a given environment, for a given purpose.

New cards

How is availability defined?

The probability that a system, at a point in time, is operational and able to deliver requested services.

New cards

What does an availability of 0.999 mean?

The system is up and running 99.9 % of the time (about 84 s downtime per 24 h).

New cards

In reliability engineering, what is a human error or mistake?

Human behaviour that introduces faults into a system during development or operation.

New cards

Define a system fault.

A characteristic of a software system that can lead to a system error (e.g., defective code).

New cards

What is a system error?

An erroneous system state that can lead to behaviour unexpected by users.

New cards

Define system failure.

An event where the system does not deliver the service expected by its users at some point in time.

New cards

Do all faults cause failures? Why or why not?

No. Faulty code may never be executed or an erroneous state may be corrected before failure occurs.

New cards

Name the three broad strategies for reliability achievement.

Fault avoidance, fault detection & removal, and fault tolerance.

New cards

What is fault avoidance?

Developing a system so that human error is avoided and system faults are minimised before delivery.

New cards

Explain fault tolerance.

Designing the system so that faults do not lead to system errors or errors do not cause failures during operation.

New cards

Why are the costs of residual fault removal high late in development?

Because fewer faults remain and each is harder to find and fix, increasing cost per detected error.

New cards

Why can reliability be formally defined only with respect to a specification?

Because a failure is a deviation from the specified behaviour; without a spec, failure cannot be objectively stated.

New cards

Why is perceived reliability often more important than formal reliability?

Users rarely read specifications and judge reliability based on their experience and the consequences of failures.

New cards

Give two factors that simple availability percentages ignore.

(1) Number of users affected by an outage; (2) Length of the outage.

New cards

Why does removing X % of faults not guarantee X % reliability improvement?

Remaining faults may reside in frequently executed code, while removed faults may be in rarely used paths.

New cards

List the three main reliability metrics.

Probability of Failure on Demand (POFOD), Rate Of Occurrence Of Failures (ROCOF)/Mean Time To Failure (MTTF), and Availability (AVAIL).

New cards

When is POFOD the most appropriate metric?

For systems with intermittent, infrequent demands where failures have serious consequences (e.g., emergency shutdown).

New cards

Interpret a ROCOF value of 0.002.

About 2 failures are expected in 1000 operational time units (e.g., hours).

New cards

What is the relationship between ROCOF and MTTF?

MTTF is the reciprocal of ROCOF (MTTF = 1 ⁄ ROCOF).

New cards

For which kind of systems is high availability (e.g., 0.998+) most relevant?

Continuously running, non-stop systems like telephone switching or railway signalling.

New cards

State one benefit of specifying quantitative reliability requirements.

They provide an objective criterion for deciding when to stop testing.

New cards

Why might ATM designers prioritise availability over reliability?

Database mechanisms can correct transaction problems; keeping machines operational for customers is more critical.

New cards

What availability was specified for an ATM network’s database service?

About 0.9999 between 7 am and 11 pm (≈ <1 min downtime per week).

New cards

What is an example POFOD requirement for an insulin pump transient failure?

POFOD ≈ 0.002 – no more than 1 failure in 500 demands that users can correct by recalibration.

New cards

Name the four categories of functional reliability requirements.

Checking, Recovery, Redundancy, and Process requirements.

New cards

Give an example of a redundancy requirement from the notes.

"Copies of the patient database shall be maintained on two separate servers not housed in the same building."

New cards

What is the purpose of a protection system?

To independently monitor a controlled system and take emergency action (e.g., shut-down) if a problem is detected.

New cards

Why should protection systems be diverse?

To avoid common-mode failures by using different technologies from the control system.

New cards

Describe a self-monitoring (multi-channel) architecture.

Multiple diverse channels perform the same computation; outputs are compared, and discrepancies raise a failure exception.

New cards

How many computers are in the Airbus flight-control system, and why use diversity?

Five; diversity of processors, chipsets, languages, and teams reduces common-mode failures.

New cards

What is N-version programming?

Running an odd number (usually three) of independently developed software versions in parallel and voting on results.

New cards

Why can design diversity fail to deliver full independence?

Teams share similar cultures and may misinterpret specifications in the same way, leading to common errors.

New cards

State two strategies to promote software diversity.

Use different programming languages and different algorithms/design methods.

New cards

List any four dependable programming guidelines.

(1) Limit visibility of information; (2) Check all inputs; (3) Provide handlers for all exceptions; (4) Minimise error-prone constructs; (5) Provide restart capabilities; (6) Check array bounds; (7) Include timeouts when calling external components; (8) Name all real-world constants.

New cards

Why should visibility of data be limited in a program?

To prevent accidental corruption of program state by components that don't need access.

New cards

Give two examples of validity checks on inputs.

Range checks and representation checks (e.g., no numerals in a name field).

New cards

Name three error-prone language constructs.

Unconditional branches (goto), unchecked pointers, and dynamic memory allocation (others include recursion, parallelism, interrupts, etc.).

New cards

Why include timeouts when calling external components?

To detect ‘silent’ failures of remote services and allow the system to recover after a defined period.

New cards

What operational data is needed to measure POFOD?

Number of service requests and number of failures on those requests.

New cards

Why is statistical (reliability) testing separate from defect testing?

It requires input data reflecting real operational profiles, whereas defect testing often uses atypical or extreme inputs.

New cards

Define an operational profile.

A test data set whose input frequencies match the expected normal usage distribution of the system.

New cards

List two problems in reliability measurement.

(1) Operational profile may not match real use; (2) Highly reliable systems rarely fail, so many tests are needed for statistical significance.

New cards

Summarise the three pillars of software reliability from the key points.

Fault avoidance, fault detection/removal, and fault tolerance.