Errors, Failures and Risks in Computer Systems
Failures and Errors in Computer Systems
Applications are complex, making error-free programs "impossible."
Applications used in unintended ways.
Failures stem from multiple factors.
Study failures to:
Avoid future errors.
Understand the impacts of poor work.
Impact of System Failures
Doctors Without Borders (Afghanistan, Oct 2015):
Air Force AC-130 gunship mistakenly hit MSF Hospital.
30 dead, 34 injured in 30-minute attack.
Contributing factors:
Networking failure (loss of video transmission, email, and electronic messages).
Information systems failed to alert that the target was on a no-strike list.
AirAsia Flight 8501:
162 fatalities.
Faulty flight computer sent false warnings (possibly due to a loosely soldered wire).
Pilot removed circuit breaker to reset computer (following maintenance precedent).
Side-effect: disengaged autopilot and auto-thrust.
Aircraft climbed rapidly, stalled, and crashed.
Service System Failures
AT&T:
Lost voice and data service.
Cause: software error in a four-million-line program.
A three-line change was not tested after implementation.
Galaxy IV Satellite Failure:
Pager service interrupted for 85% of the U.S.
Weather information delays impacted flights.
Credit card verification failures for some store chains.
Amtrak Reservation and Ticketing System Failure:
Occurred during Thanksgiving weekend.
Caused delays due to lack of printed schedules and fare lists.
Unintended Acceleration Incidents
Drivers blamed accidents on unintended acceleration due to electronic throttle flaws.
NASA study found no fault with the electronic throttle system.
Event recorders indicated some drivers stepped on the accelerator instead of the brake.
Some accidents involved pedals caught on floor mats.
A high percentage of drivers involved were 65 or older.
Software experts criticized the source code for the electronic throttle, suggesting it could cause unintended acceleration in some circumstances.
Questions raised:
What level of proof is necessary to win a lawsuit against the car company?
What are the implications for other products with complex software systems?
Problems for Individuals
Billing errors.
Poor information for law enforcement.
Inaccurate and misinterpreted data in databases.
Challenges due to large populations with shared names.
Automated processing may struggle with special cases.
Overconfidence in data accuracy.
Errors in data entry.
Lack of accountability for errors.
Airport System Failures
Hong Kong’s Check Lap Kok Airport:
Misrouting of cleaning crews, fuel trucks, baggage, passengers, and cargo.
Kuala Lumpur:
Manual boarding passes and luggage handling.
Flight delays.
Food cargo spoilage.
System failures attributed to inadequate consideration of user input errors.
Denver Airport
Baggage system failure due to real-world issues, problems in other systems, and software errors.
Causes:
Insufficient development time.
Significant specification changes after project commencement.
General System Failure Causes
Lack of clear goals and specifications.
Poor management and communication between stakeholders.
Institutional and political pressures leading to low bids and underestimated timeframes.
Use of new technology with unknown reliability.
Refusal to acknowledge project problems.
Legacy Systems
Reliable but inflexible.
Expensive to replace.
Limited or no documentation.
Abandoned Systems Development
Extreme flaws lead to discarding systems after significant investment.
Examples:
FBI Virtual Case File System ($170M).
British Retailer J Sainsbury ($526M).
Ford Motor Purchasing System ($400M).
California Child-Support System ($100M).
Voting Systems
Technical Failures:
Machines failing to count votes (e.g., North Carolina).
Memory capacity issues causing vote loss.
Programming errors generating extra votes or misallocating votes.
Risks:
Software rigging by programmers or hackers.
Vulnerability to viruses.
Ariane 5 Rocket
Software reuse from an older rocket model.
Newer rocket was faster, leading to an error in code converting a 64-bit floating-point number to a 16-bit signed integer.
Rocket veered off course less than 40 seconds after launch.
Destroyed as a safety precaution.
Importance of re-examining specifications, design, implications, risks, and retesting software in new environments.
The Therac-25 Incident
Radiation overdoses were administered, with the machine indicating no dose was given.
Caused severe injuries and deaths.
Responsibility lies with the manufacturer, programmer, and hospitals/clinics.
Therac-25: Software and Design Problems
Reused software from older systems, unaware of previous bugs.
Weaknesses in operator interface design.
Inadequate test plan.
Software bugs allowed beam deployment when the table was not in position.
Ignored operator corrections at the console.
Therac-25: Hospital and Manufacturer Response
Hospitals were unfamiliar with such massive overdoses and unsure of the cause.
Manufacturer initially denied the possibility of machine malfunction and claimed no other incidents were reported (which was false).
Changes to the turntable, claimed to improve safety, did not address the identified causes.
Therac-25: Regulatory Action
Recommendations for further safety enhancements were ignored.
FDA declared the machine defective after the fifth accident.
A sixth accident occurred during FDA negotiations with the manufacturer.
Therac-25: Systemic Irresponsibility
While minor errors are expected in complex systems, the Therac-25 issues were significant and indicative of irresponsibility.
Accidents occurred with other radiation treatment equipment due to technician errors (e.g., leaving a patient unattended, incorrect measurement of radioactive drugs).
Ethical Question: How much responsibility should be assigned to the programmer, manufacturer, and hospital/clinic?
Affordable Care Act (ACA)
Signed into law March 23, 2010.
Expanded Medicaid and CHIP coverage for states.
Introduced MAGI (Modified Adjusted Gross Income) for eligibility.
Established Health Insurance Marketplaces or Exchanges.
Provided federal matching funds for implementation.
Incentivized medical providers to adopt Electronic Health Records (EHR).
Required "meaningful use" of EHR to receive reimbursement.
Electronic Health Records (EHR) Controversy
eClinicalWorks LLC paid 1 billion in damages.
State Health Insurance Marketplaces
Successful examples:
Rhode Island: HealthSourcRI
Kentucky: Kynect
Massachusetts: The Health Connector
Connecticut: Access Health CT
California: Covered California
Cover Oregon Failure
Oregon's failed 1 billion, settling for 330 million health insurance exchange (built by Deloitte).
2016: Benefind—aimed to integrate Medicaid enrollment with SNAP eligibility.
First system to integrate Medicaid with other benefits.
7 million.
Outgoing administration limited the incoming governor's options.
Significant disruption of public aid for thousands of Kentuckians, leaving many without essential services.
Healthcare.gov
Health Care Exchange Website.
Part of the Affordable Care Act (ACA).
Aimed to allow comparison shopping between health insurance options.
Go-live date: October 1, 2013.
Sign-up deadline: December 23, 2013.
Insurance start: January 1, 2014.
Healthcare.gov Launch Issues
Day 1:
4. 7 million unique site visitors.
250,000 simultaneous users (expected 50,000-60,000).
Only 6 registered for insurance.
Day 2:
248 people registered for insurance.
Day 10:
7 million unique site visitors.
Few thousand registered for insurance.
Healthcare.gov Recovery
8 Weeks after launch:
Experts identified technical and leadership issues.
Hundreds of software fixes and hardware upgrades implemented.
Handling more than 80,000 simultaneous users.
Individual Mandate for Insurance extended to March 31, 2014.
Healthcare.gov System Components
Systems developed by various contractors.
Integration with federal agency databases (IRS, SSA, DHS).
Online services of over 170 insurance carriers in 36 states using the Federal Facilitated Marketplace (FFM).
Healthcare.gov Data Services Hub
Validated applicant information against existing federal databases.
Acted as a "sophisticated switch and mediator".
Avoided storing redundant copies of user information.
Assumed role of enterprise service bus, performing message routing, protocol conversion, data transformation, security checks, and transaction management.
Facilitated enrollment by validating information and reporting back to the applicant.
Healthcare.gov Costs
Original cost estimate: 834M.
By Summer 2014: 88M) - Prime Contractor.
Quality Software Services (45 million one-year contract to fix the website.
Contract awarded with only two years to implement a project with unclear specifications.
Cost-reimbursable nature of the original contract due to undefined requirements.
Healthcare.gov Development Issues
Software coding began in March 2013.
As late as September 2013, no end-to-end testing had been conducted.
Debate continued regarding registration requirements before shopping for health plans.
Healthcare.gov Security Concerns
Security risks for a large collection of personally identifiable information.
Flaw discovered allowing hackers to take over user insurance accounts.
Failure to register misspelled or similar domain names, leading to over 200 imitation websites.
Required personal information for account creation before plan review.
Captured personal information even from those who didn't buy insurance.
Healthcare.gov Project Review (Early 2013)
Conducted by an outside firm.
Identified threats:
Evolving requirements.
Multiple definitions of success.
Dependency on external parties.
Parallel execution of project phases.
Insufficient time for end-to-end testing.
Launch at full volume on day one.
Healthcare.gov Project Risks (Early 2013)
Federal Exchange unavailability due to system failure.
Forced manual processing.
Failure to resolve post-launch issues rapidly.
Healthcare plan data not loaded on time.
Several healthcare providers unable to offer plans.
Healthcare.gov Project Recommendations (Early 2013)
Align on the scope of the initial release.
Lock down the scope of HealthCare.gov version 1.0 by April 8, 2013.
Thoroughly test version 1.0.
Appoint a single implementation leader and establish a governance process.
Project manager Henry Chao did not review the final report.
Healthcare.gov Launch Decision
Despite warnings from programmers about bugs and security holes, the administration launched the website on October 1, 2013.
Testing period ideally would have lasted months rather than weeks.
Contractors stated that recommending a delay was outside their scope of work.
Healthcare.gov Usability
Remote servers returning HTTP 503 - Service Unavailable status code.
What Goes Wrong?
Computer systems must interact with the real world.
The job they do is inherently difficult.
Sometimes the job is done poorly.
Computer software is “nonlinear”.
Typo in a computer program can cause a dramatic difference in behavior.
Management and Use Problems
Data-entry errors.
Inadequate training of users.
Errors in interpreting results or output.
Failure to keep information in databases up to date.
Overconfidence in software by users.
Misrepresentation, hiding problems, and inadequate response to reported problems.
Insufficient market or legal incentives to do a better job.
What Goes Wrong? (Cont.)
Lack of clear, well-thought-out goals and specifications.
Poor management and poor communication among customers, designers, programmers, etc.
Institutional and political pressures that encourage unrealistically low bids, low budget requests, and underestimates of time requirements.
Use of very new technology, with unknown reliability and problems.
Refusal to recognize or admit a project is in trouble.
Design and Development Problems
Inadequate attention to potential safety risks.
Interaction with physical devices that do not work as expected.
Incompatibility of software and hardware, or of application software and the operating system.
Not planning and designing for unexpected inputs or circumstances.
Confusing user interfaces.
Insufficient testing.
Reuse of software from another system without adequate checking.
Overconfidence in software.
Carelessness.
User Interfaces and Human Factors
User interfaces should:
Provide clear instructions and error messages.
Be consistent.
Include appropriate checking of input to reduce major system failures caused by typos or other errors a person will likely make.
User needs feedback to understand what the system is doing at any time.
System should behave as an experienced user expects.
A workload that is too low can be dangerous.
Increasing Reliability and Safety
Specifications:
Learn the needs of the client.
Understand how the client will use the system.
Testing:
Even small changes need thorough testing.
Independent verification and validation (IV&V).
Beta testing.
Increasing Reliability and Safety (Cont.)
Safety-critical applications:
Identify risks and protect against them.
Convincing case for safety.
Avoid complacency.
Redundancy and self-checking:
Multiple computers capable of same task; if one fails, another can do the job.
Voting redundancy (Multiple teams develop same module).
Trust the Human or the Computer System?
Traffic Collision Avoidance System (TCAS).
Computers in some airplanes prevent certain pilot actions.
A German and Russian plane collided after one of the pilots followed an air traffic controller’s instructions rather than TCAS instructions.
A pilot of a Lufthansa 747 ignored instructions from an air traffic controller and instead followed instructions from the computer system, avoiding a midair collision.
Law, Regulation, and Markets
Criminal and civil penalties:
Provide incentives to produce good systems, but shouldn't inhibit innovation.
Regulation for safety-critical applications.
Professional licensing:
Arguments for and against.
Taking responsibility.
Dependence, Risk, and Progress
Are We Too Dependent on Computers?
Computers are tools.
They are not the only dependence - Electricity is another.
Risk and Progress:
Many new technologies were not very safe when they were first developed.
We develop and improve new technologies in response to accidents and disasters.
We should compare the risks of using computers with the risks of other methods and the benefits to be gained.