Software Disasters And How to Avoid Them

It’s every computer scientist’s nightmare. You work – hard – to make a project successful, but some little thing goes awry, unnoticed, and at some point after the launch that little thing causes problems. Maybe big problems. Like the time one error, in a single line of code, set off a chain of events that knocked out AT&T’s service for 75 million calls. Talk about a wrong number.

The following are more classic examples – and some lessons they teach to help avoid similar mistakes in other projects.
 


October 2008: Overstock.com Loses Big From Rushed ERP Implementation

 

It started innocently enough. Back in 2005 Overstock.com was outgrowing its homespun ERP system and wanted a new Oracle system in place in time for that year’s holiday shopping season.

Too much, too fast, the new ERP was unable to provide customers with shipping information or order statuses, earning a place on many shoppers’ “naughty lists.”

And the Ghost of Rushed Implementations Past continued to haunt Overstock.com. In October 2008 the company had to revise its earning statement for several years – showing a $12.9 million reduction in revenue and 10.3 million increase in net loss – because the ERP was improperly integrated with the firm’s accounting system.

  LESSON: Rushed implementations are dangerous. And it’s a good idea to keep running the old system in parallel with a new system for a while.

 


October 2004: One Third of the UK Dept. for Work and Pensions’ Computer Network Goes Down

 

For several days, new pension and benefits claims could not be processed. The failure occurred while the agency was conducting a routine software upgrade.   LESSON: When upgrading key systems, rollback ability can be valuable.

 


December 25, 2004: Software Failure Forces Comair to Cancel 1000+ Christmas-Day Flights

 

Severe weather led to a surge of crew reassignments which overwhelmed the computer reservations system. The resulting snarl entangled passengers throughout the cheerless holiday weekend.   LESSON: Sites that rely on a single database server, yet need instant scalability, need special architecture. We have architected systems that support millions of simultaneous users, in part by avoiding database bottlenecks.

 


1962: Mariner I Gets Lost in Space

 

The Mariner I rocket was supposed to carry a space probe to Venus. But it went off course soon after launch. A programmer had made a tiny mistake in coding a complex formula. Danger, Will Robinson! About four minutes after liftoff, Mission Control was obliged to destroy the rocket and probe (an $18.5 million package). The pain, the pain of it all.   LESSON: Automated “unit tests” that compare important functions in a software program with real-world data help ensure the system is working as it is supposed to.

 


1983: Software Bug Nearly Causes World War III

 

The Soviet Union’s early warning system had software that did not filter out false missile detections caused by sunlight reflecting off clouds – leading to a warning that the U.S. launched five ballistic missiles at Mother Russia! Fortunately for all humanity, the Soviet officer on duty realized that if the U.S. really were attacking, they would have launched far more than five missiles. So he reported the missiles as a false alarm, confirming a widely held supposition that when it comes to global thermonuclear war, the only way to win is not to play.   LESSON: Smart, well trained, experienced staff remain indispensible.

 


September 8, 2008: London Stock Exchange System Breaks Down

 

It would have broken Gordon Gekko’s heart. For nearly seven hours, LSE members could not trade. Worse yet, this occurred just as the world’s equities rallied on news of U.S. government takeover of Fannie Mae and Freddie Mac. Talk about bad market timing.   LESSON: When an outage’s costs would be prohibitive, have redundant systems in place.

 


1998: Mars Climate Orbiter Crashes

 

A software miscalculation caused the Orbiter to misfire its engines so instead of going into orbit the $125 million craft crashed into Mars. What happened? The software used to control Orbiter’s thrusters was calibrated in imperial units, not the metric units NASA had specified.   LESSON: In addition to testing a system’s components, it is effective to conduct system-wide integration testing, using real data whenever possible.

 


1999: British Passport Delay

 

The British flag may fly around the world, but for a time the Britons didn’t. The U.K. Passport Agency implemented a new Siemens computer system, but did not fully test it or effectively train staff in its use. As fate would have it, right at that time the law changed so that children under 16 now needed a passport to travel abroad, precipitating a spike in demand. Compounding the bugs, agency staff could not figure out how to operate the confusing system.   LESSON: Interfaces should be intuitive enough to require little system training. And when designed and developed by user experience experts, they can be that easy to work with.

 


2006: Airbus’s Incompatible Software

 

Construction of the Airbus A380 hit some turbulence. It turned out different subcontractors were using two different versions of the design and assembly software product. Incompatible versions, needless to say. This ended up delaying construction for about a year.   LESSON: Quality control applies to project management, too. WebINTENSIVE, for example, has in place an extensive system of procedures and protocols to help detect oversights.

 


2007: LAX Flightless

 

For eight hours no passengers could be authorized to enter or leave the U.S. through Los Angeles International Airport. That grounded about 17,000 planes. All because one little network card kept sending out inaccurate data, which led the whole system used by the U.S. Customs and Border Protection to shut down.   LESSON: Error-tracking procedures that record source and environment details can make broader problems faster to track.

 


2009: Oak Park Abandons PeopleSoft

 

The city of Oak Park, Illinois spent close to $2 million over about five years to implement a PeopleSoft system to automate its finance and payroll processes, only to abandon the system as too complex for their needs. The program never was fully adopted; many city employees still used the old manual process of cranking on the adding machine.   LESSONS:

  1. Fix the process first, then automate.
  2. Clearly defining needs is crucial to success.
  3. Constituent buy-in (“ownership”) is necessary for widespread system changes to be adopted quickly and easily.
  • Facebook
  • Twitter
  • LinkedIn
  • Digg
  • del.icio.us
  • email
  • RSS
  • Print

Do you know of any out-and-out catastrophes we should add?

Name:  
Email:  
Address:  
Phone:  
Company:  
© Copyright 2011 WebINTENSIVE Software. All rights reserved. Terms and Conditions Privacy Policy