Errors, Failures and Risks in Computer Systems

Failures and Errors in Computer Systems

  • Applications are complex, making error-free programs "impossible."

  • Applications used in unintended ways.

  • Failures stem from multiple factors.

  • Study failures to:

    • Avoid future errors.

    • Understand the impacts of poor work.

Impact of System Failures

  • Doctors Without Borders (Afghanistan, Oct 2015):

    • Air Force AC-130 gunship mistakenly hit MSF Hospital.

    • 30 dead, 34 injured in 30-minute attack.

    • Contributing factors:

      • Networking failure (loss of video transmission, email, and electronic messages).

      • Information systems failed to alert that the target was on a no-strike list.

  • AirAsia Flight 8501:

    • 162 fatalities.

    • Faulty flight computer sent false warnings (possibly due to a loosely soldered wire).

    • Pilot removed circuit breaker to reset computer (following maintenance precedent).

    • Side-effect: disengaged autopilot and auto-thrust.

    • Aircraft climbed rapidly, stalled, and crashed.

Service System Failures

  • AT&T:

    • Lost voice and data service.

    • Cause: software error in a four-million-line program.

    • A three-line change was not tested after implementation.

  • Galaxy IV Satellite Failure:

    • Pager service interrupted for 85% of the U.S.

    • Weather information delays impacted flights.

    • Credit card verification failures for some store chains.

  • Amtrak Reservation and Ticketing System Failure:

    • Occurred during Thanksgiving weekend.

    • Caused delays due to lack of printed schedules and fare lists.

Unintended Acceleration Incidents

  • Drivers blamed accidents on unintended acceleration due to electronic throttle flaws.

  • NASA study found no fault with the electronic throttle system.

  • Event recorders indicated some drivers stepped on the accelerator instead of the brake.

  • Some accidents involved pedals caught on floor mats.

  • A high percentage of drivers involved were 65 or older.

  • Software experts criticized the source code for the electronic throttle, suggesting it could cause unintended acceleration in some circumstances.

  • Questions raised:

    • What level of proof is necessary to win a lawsuit against the car company?

    • What are the implications for other products with complex software systems?

Problems for Individuals

  • Billing errors.

  • Poor information for law enforcement.

  • Inaccurate and misinterpreted data in databases.

  • Challenges due to large populations with shared names.

  • Automated processing may struggle with special cases.

  • Overconfidence in data accuracy.

  • Errors in data entry.

  • Lack of accountability for errors.

Airport System Failures

  • Hong Kong’s Check Lap Kok Airport:

    • Misrouting of cleaning crews, fuel trucks, baggage, passengers, and cargo.

  • Kuala Lumpur:

    • Manual boarding passes and luggage handling.

    • Flight delays.

    • Food cargo spoilage.

    • System failures attributed to inadequate consideration of user input errors.

Denver Airport

  • Baggage system failure due to real-world issues, problems in other systems, and software errors.

  • Causes:

    • Insufficient development time.

    • Significant specification changes after project commencement.

General System Failure Causes

  • Lack of clear goals and specifications.

  • Poor management and communication between stakeholders.

  • Institutional and political pressures leading to low bids and underestimated timeframes.

  • Use of new technology with unknown reliability.

  • Refusal to acknowledge project problems.

Legacy Systems

  • Reliable but inflexible.

  • Expensive to replace.

  • Limited or no documentation.

Abandoned Systems Development

  • Extreme flaws lead to discarding systems after significant investment.

  • Examples:

    • FBI Virtual Case File System ($170M).

    • British Retailer J Sainsbury ($526M).

    • Ford Motor Purchasing System ($400M).

    • California Child-Support System ($100M).

Voting Systems

  • Technical Failures:

    • Machines failing to count votes (e.g., North Carolina).

    • Memory capacity issues causing vote loss.

    • Programming errors generating extra votes or misallocating votes.

  • Risks:

    • Software rigging by programmers or hackers.

    • Vulnerability to viruses.

Ariane 5 Rocket

  • Software reuse from an older rocket model.

  • Newer rocket was faster, leading to an error in code converting a 64-bit floating-point number to a 16-bit signed integer.

  • Error=64bitfloat16bitsignedintegerError = 64 bit float \rightarrow 16 bit signed integer

  • Rocket veered off course less than 40 seconds after launch.

  • Destroyed as a safety precaution.

  • Importance of re-examining specifications, design, implications, risks, and retesting software in new environments.

The Therac-25 Incident

  • Radiation overdoses were administered, with the machine indicating no dose was given.

  • Caused severe injuries and deaths.

  • Responsibility lies with the manufacturer, programmer, and hospitals/clinics.

Therac-25: Software and Design Problems

  • Reused software from older systems, unaware of previous bugs.

  • Weaknesses in operator interface design.

  • Inadequate test plan.

  • Software bugs allowed beam deployment when the table was not in position.

  • Ignored operator corrections at the console.

Therac-25: Hospital and Manufacturer Response

  • Hospitals were unfamiliar with such massive overdoses and unsure of the cause.

  • Manufacturer initially denied the possibility of machine malfunction and claimed no other incidents were reported (which was false).

  • Changes to the turntable, claimed to improve safety, did not address the identified causes.

Therac-25: Regulatory Action

  • Recommendations for further safety enhancements were ignored.

  • FDA declared the machine defective after the fifth accident.

  • A sixth accident occurred during FDA negotiations with the manufacturer.

Therac-25: Systemic Irresponsibility

  • While minor errors are expected in complex systems, the Therac-25 issues were significant and indicative of irresponsibility.

  • Accidents occurred with other radiation treatment equipment due to technician errors (e.g., leaving a patient unattended, incorrect measurement of radioactive drugs).

  • Ethical Question: How much responsibility should be assigned to the programmer, manufacturer, and hospital/clinic?

Affordable Care Act (ACA)

  • Signed into law March 23, 2010.

  • Expanded Medicaid and CHIP coverage for states.

  • Introduced MAGI (Modified Adjusted Gross Income) for eligibility.

  • Established Health Insurance Marketplaces or Exchanges.

  • Provided federal matching funds for implementation.

  • Incentivized medical providers to adopt Electronic Health Records (EHR).

  • Required "meaningful use" of EHR to receive reimbursement.

Electronic Health Records (EHR) Controversy

  • eClinicalWorks LLC paid 155milliontoresolvecivilFalseClaimsActallegations.</p></li><li><p>Modifiedsoftwaretopasscertificationtestingbyhardcodingdrugcodes.</p></li><li><p>Didnotreliablyrecorddiagnosticimagingordersorperformdruginteractionchecks.</p></li><li><p>Resultedinfalseclaimsforfederalincentivepayments.</p></li><li><p>eClinicalWorksfacedasecondclassactionsuitfor"deceptive"practices.</p></li><li><p>PlaintiffsforfeitedmeaningfuluseincentivepaymentsduetononcompliantEHR.</p></li><li><p>AlawsuitclaimingmillionshadcompromisedpatientEHRs,soughtnearly155 million to resolve civil False Claims Act allegations.</p></li><li><p>Modified software to pass certification testing by hardcoding drug codes.</p></li><li><p>Did not reliably record diagnostic imaging orders or perform drug interaction checks.</p></li><li><p>Resulted in false claims for federal incentive payments.</p></li><li><p>eClinicalWorks faced a second class-action suit for "deceptive" practices.</p></li><li><p>Plaintiffs forfeited meaningful use incentive payments due to non-compliant EHR.</p></li><li><p>A lawsuit claiming millions had compromised patient EHRs, sought nearly1 billion in damages.

State Health Insurance Marketplaces

  • Successful examples:

    • Rhode Island: HealthSourcRI

    • Kentucky: Kynect

    • Massachusetts: The Health Connector

    • Connecticut: Access Health CT

    • California: Covered California

Cover Oregon Failure

  • Oregon's failed 305millionhealthinsuranceexchange.</p></li><li><p>Citedastheworstonlinemarketplaceinthenation.</p></li><li><p>Insuranceapplicationsprocessedmanually.</p></li><li><p>BuiltbyOracle,leadingtoalawsuitwhereOregonsought305 million health insurance exchange.</p></li><li><p>Cited as the worst online marketplace in the nation.</p></li><li><p>Insurance applications processed manually.</p></li><li><p>Built by Oracle, leading to a lawsuit where Oregon sought1 billion, settling for 100million.</p></li><li><p>Politicsmayhavecontributedtothefailure.</p></li></ul><h3id="cf791effe5464cb2a641fa2f2ba4eb43"datatocid="cf791effe5464cb2a641fa2f2ba4eb43"collapsed="false"seolevelmigrated="true">KentuckyKynectandBenefind</h3><ul><li><p>2013:Kynectsuccessful100 million.</p></li><li><p>Politics may have contributed to the failure.</p></li></ul><h3 id="cf791eff-e546-4cb2-a641-fa2f2ba4eb43" data-toc-id="cf791eff-e546-4cb2-a641-fa2f2ba4eb43" collapsed="false" seolevelmigrated="true">Kentucky Kynect and Benefind</h3><ul><li><p>2013: Kynect—successful330 million health insurance exchange (built by Deloitte).

  • 2016: Benefind—aimed to integrate Medicaid enrollment with SNAP eligibility.

  • First system to integrate Medicaid with other benefits.

  • 101millionoriginalcontract,resultinginamajorfailure.</p></li><li><p>AlsobuiltbyDeloitte.</p></li></ul><h3id="457dfcc25ecb429886f2aecced8b7039"datatocid="457dfcc25ecb429886f2aecced8b7039"collapsed="false"seolevelmigrated="true">KentuckyBenefindProblems</h3><ul><li><p>Delayedtwomonths(Dec2015toFeb2016)atacostof101 million original contract, resulting in a major failure.</p></li><li><p>Also built by Deloitte.</p></li></ul><h3 id="457dfcc2-5ecb-4298-86f2-aecced8b7039" data-toc-id="457dfcc2-5ecb-4298-86f2-aecced8b7039" collapsed="false" seolevelmigrated="true">Kentucky Benefind Problems</h3><ul><li><p>Delayed two months (Dec 2015 to Feb 2016) at a cost of7 million.

  • Outgoing administration limited the incoming governor's options.

  • Significant disruption of public aid for thousands of Kentuckians, leaving many without essential services.

Healthcare.gov

  • Health Care Exchange Website.

  • Part of the Affordable Care Act (ACA).

  • Aimed to allow comparison shopping between health insurance options.

  • Go-live date: October 1, 2013.

  • Sign-up deadline: December 23, 2013.

  • Insurance start: January 1, 2014.

Healthcare.gov Launch Issues

  • Day 1:

    • 4. 7 million unique site visitors.

    • 250,000 simultaneous users (expected 50,000-60,000).

    • Only 6 registered for insurance.

  • Day 2:

    • 248 people registered for insurance.

  • Day 10:

    • 7 million unique site visitors.

    • Few thousand registered for insurance.

Healthcare.gov Recovery

  • 8 Weeks after launch:

    • Experts identified technical and leadership issues.

    • Hundreds of software fixes and hardware upgrades implemented.

    • Handling more than 80,000 simultaneous users.

    • Individual Mandate for Insurance extended to March 31, 2014.

Healthcare.gov System Components

  • Systems developed by various contractors.

  • Integration with federal agency databases (IRS, SSA, DHS).

  • Online services of over 170 insurance carriers in 36 states using the Federal Facilitated Marketplace (FFM).

Healthcare.gov Data Services Hub

  • Validated applicant information against existing federal databases.

  • Acted as a "sophisticated switch and mediator".

  • Avoided storing redundant copies of user information.

  • Assumed role of enterprise service bus, performing message routing, protocol conversion, data transformation, security checks, and transaction management.

  • Facilitated enrollment by validating information and reporting back to the applicant.

Healthcare.gov Costs

  • Original cost estimate: 292M.</p></li><li><p>ByFebruary2014:292M.</p></li><li><p>By February 2014:834M.

  • By Summer 2014: 1.7Billion.</p></li><li><p>Involved55contractors.</p></li><li><p>KeyContractors:</p><ul><li><p>CGIFederal(1.7 Billion.</p></li><li><p>Involved 55 contractors.</p></li><li><p>Key Contractors:</p><ul><li><p>CGI Federal (88M) - Prime Contractor.

  • Quality Software Services (55M)DataHub.</p></li></ul></li><li><p>January2014:Accentureawardeda55M) - Data Hub.</p></li></ul></li><li><p>January 2014: Accenture awarded a45 million one-year contract to fix the website.

  • Contract awarded with only two years to implement a project with unclear specifications.

  • Cost-reimbursable nature of the original contract due to undefined requirements.

Healthcare.gov Development Issues

  • Software coding began in March 2013.

  • As late as September 2013, no end-to-end testing had been conducted.

  • Debate continued regarding registration requirements before shopping for health plans.

Healthcare.gov Security Concerns

  • Security risks for a large collection of personally identifiable information.

  • Flaw discovered allowing hackers to take over user insurance accounts.

  • Failure to register misspelled or similar domain names, leading to over 200 imitation websites.

  • Required personal information for account creation before plan review.

  • Captured personal information even from those who didn't buy insurance.

Healthcare.gov Project Review (Early 2013)

  • Conducted by an outside firm.

  • Identified threats:

    • Evolving requirements.

    • Multiple definitions of success.

    • Dependency on external parties.

    • Parallel execution of project phases.

    • Insufficient time for end-to-end testing.

    • Launch at full volume on day one.

Healthcare.gov Project Risks (Early 2013)

  • Federal Exchange unavailability due to system failure.

  • Forced manual processing.

  • Failure to resolve post-launch issues rapidly.

  • Healthcare plan data not loaded on time.

  • Several healthcare providers unable to offer plans.

Healthcare.gov Project Recommendations (Early 2013)

  • Align on the scope of the initial release.

  • Lock down the scope of HealthCare.gov version 1.0 by April 8, 2013.

  • Thoroughly test version 1.0.

  • Appoint a single implementation leader and establish a governance process.

  • Project manager Henry Chao did not review the final report.

Healthcare.gov Launch Decision

  • Despite warnings from programmers about bugs and security holes, the administration launched the website on October 1, 2013.

  • Testing period ideally would have lasted months rather than weeks.

  • Contractors stated that recommending a delay was outside their scope of work.

Healthcare.gov Usability

  • Remote servers returning HTTP 503 - Service Unavailable status code.

What Goes Wrong?

  • Computer systems must interact with the real world.

  • The job they do is inherently difficult.

  • Sometimes the job is done poorly.

  • Computer software is “nonlinear”.

  • Typo in a computer program can cause a dramatic difference in behavior.

Management and Use Problems

  • Data-entry errors.

  • Inadequate training of users.

  • Errors in interpreting results or output.

  • Failure to keep information in databases up to date.

  • Overconfidence in software by users.

  • Misrepresentation, hiding problems, and inadequate response to reported problems.

  • Insufficient market or legal incentives to do a better job.

What Goes Wrong? (Cont.)

  • Lack of clear, well-thought-out goals and specifications.

  • Poor management and poor communication among customers, designers, programmers, etc.

  • Institutional and political pressures that encourage unrealistically low bids, low budget requests, and underestimates of time requirements.

  • Use of very new technology, with unknown reliability and problems.

  • Refusal to recognize or admit a project is in trouble.

Design and Development Problems

  • Inadequate attention to potential safety risks.

  • Interaction with physical devices that do not work as expected.

  • Incompatibility of software and hardware, or of application software and the operating system.

  • Not planning and designing for unexpected inputs or circumstances.

  • Confusing user interfaces.

  • Insufficient testing.

  • Reuse of software from another system without adequate checking.

  • Overconfidence in software.

  • Carelessness.

User Interfaces and Human Factors

  • User interfaces should:

    • Provide clear instructions and error messages.

    • Be consistent.

    • Include appropriate checking of input to reduce major system failures caused by typos or other errors a person will likely make.

  • User needs feedback to understand what the system is doing at any time.

  • System should behave as an experienced user expects.

  • A workload that is too low can be dangerous.

Increasing Reliability and Safety

  • Specifications:

    • Learn the needs of the client.

    • Understand how the client will use the system.

  • Testing:

    • Even small changes need thorough testing.

    • Independent verification and validation (IV&V).

    • Beta testing.

Increasing Reliability and Safety (Cont.)

  • Safety-critical applications:

    • Identify risks and protect against them.

    • Convincing case for safety.

    • Avoid complacency.

  • Redundancy and self-checking:

    • Multiple computers capable of same task; if one fails, another can do the job.

    • Voting redundancy (Multiple teams develop same module).

Trust the Human or the Computer System?

  • Traffic Collision Avoidance System (TCAS).

  • Computers in some airplanes prevent certain pilot actions.

  • A German and Russian plane collided after one of the pilots followed an air traffic controller’s instructions rather than TCAS instructions.

  • A pilot of a Lufthansa 747 ignored instructions from an air traffic controller and instead followed instructions from the computer system, avoiding a midair collision.

Law, Regulation, and Markets

  • Criminal and civil penalties:

    • Provide incentives to produce good systems, but shouldn't inhibit innovation.

  • Regulation for safety-critical applications.

  • Professional licensing:

    • Arguments for and against.

  • Taking responsibility.

Dependence, Risk, and Progress

  • Are We Too Dependent on Computers?

    • Computers are tools.

    • They are not the only dependence - Electricity is another.

  • Risk and Progress:

    • Many new technologies were not very safe when they were first developed.

    • We develop and improve new technologies in response to accidents and disasters.

    • We should compare the risks of using computers with the risks of other methods and the benefits to be gained.