|
Dieter Kranzlmüller Event Graph Analysis for Debugging Massively Parallel Programs |
|||

3.3 Notorious Errors
The reasons for testing have been defined above. Obviously, errors in software are unavoidable and very likely to take place in general. The importance and implications of errors may still be underestimated, and often the financial consequences are overlooked. Porter reports about a survey by the Standish Group, which examined 175.000 software development projects in 1995 [Port 98]. The results of this research revealed, that 31% of these projects were cancelled before completion with a financial loss of US$81 billion. Of the remaining projects, more than 52% suffer significant cost overruns, with an average of 189% more than estimated, resulting in a US$59 billion loss. Additionally, there is the possibility of other consequences for humanity, sometimes even leading to personal harm. The following section contains an overview of some of the most critical errors. The first example of a bug represents the very origin of the term "debugging".
3.3.1 History of Debugging
The story is reported to have taken place on September 9, 1945 at the Computation Laboratory of Harvard University [Stit 92]. Grace M. Hopper, assigned to the Bureau of Ordinance Computation, was working on Mark II, the successor of the Mark I computer engine, when suddenly the machine stopped for no apparent reason. Upon inspection, Hopper and her team detected a moth, which had flown into a relay from an open window; the moth had been pulverized by the relay and consequently had caused the device to fail. Her entry in the log book read: "First actual case of bug being found". Afterwards, the term "bug" was popularized to signify any system malfunction. Yet, it has been noted that the term itself had been used as far back as the end of the last century, when it was applied to electrical equipment.
More recently in 1995, the twentieth anniversary special issue of the Byte computer magazine devoted one chapter to a list of the 20 most notorious bugs [Need 95]. Some of these errors as well as a few other important examples are described in the following.
- Lethal errors
- Airplane crashes
- Governmental and commercial mishaps
- Internet worm/hackers
- Millennium bug
3.3.2 Lethal Software Errors
Probably the most critical errors occurred in situations, where human beings have been harmed or eventually killed. The best-known example in that domain is the story about the Therac-25 linear accelerator machines between 1985 and 1987. The tragedy happened at the East Texas Cancer Center in Tyler, where at least four people received lethal doses of radiation from Therac-25 machines during their medical treatment for cancer [Need 95]. There were several errors, among them the failure of the programmer to detect a race condition1 (i.e., a miscoordination between concurrent tasks) [Joch 95].
Another group of errors, that is seldom published for the public's knowledge but rather more often hidden by governmental agencies, is related to software systems in warfare equipment. Since errors are unavoidable, even these system must be affected from time to time. One example is a critical bug observed on American Patriot missiles during the Persian Gulf War. It is reported that one of these missiles erroneously killed 28 American soldiers in their barracks in Dahran, Saudia Arabia, due to a software glitch in the system [Need 95].
3.3.3 Airplane Crashes
A topic related to software errors that always gets the media's attention is airplane crashes. Although travelling with commercial aircrafts is often declared to be the safest possibility for any kind of journey (accident rates are given with one accident per 1 million flights), an accident with an aircraft is much more media-important than most other means of public and private transportation. A lot of investigations in this area has been performed by Ladkin [Ladk 99], who studies the reasons for computer-related incidents with commercial aircrafts. He states, that until now programming errors have never been mentioned as the main reason for any critical accident. In fact, most of the breakdowns, incidents, and accidents do not happen due to one single source in one independent component, but always due to the cooperation of several factors.
An example is the crash-landing of one of Lufthansa's Airbus A320 in September 1993 in Warsaw, which killed two persons. The A320 is the first commercial aircraft, which connects pilot's joystick (in contrast to other planes that have a control lever) and rudder only via computers, in order to save weight, increase the reliability of the system, and to block human mistakes. The problem in Warsaw appeared, when the plane was preparing to land and the pilots were waiting for a direction change of the wind that never happened. As a consequence, the landing took place with the wind instead of against it. In addition, the pilots had increased the planes speed in order to fight the high winds. This resulted in a far to high landing speed, and due to additional up-wind, the sensors at the planes wheels didn't signal that the machine has landed, which in turn didn't activate the brakes to stop the plane before leaving the runway and hitting a hill.
Other problems with planes and computers are known, e.g. the landing of one of Dragonair's A320 in June 1994 in Hong-Kong. In that case, a blast of wind prevented a change of the flaps, while the computer believed that the flaps were in the correct position, which induced heavy problems during landing. Other computer problems lead to a big catastrophe with 160 victims, when a Boeing B757 of American Airlines hit a hill in Colombia. In that case, the flight management computer (FMC) was operating with an erroneous navigational data base and two corrupt navigational transmitters, which resulted in wrong positional informations for the flight crew.
Besides the planes, an airport itself may be hidden by malfunctioning system. For example, in 1994 bugs in the computerized baggage-handling system delayed the opening of the new Denver airport. Automated baggage carts drove into walls or deposited bags at the wrong airport destination. The financial consequences were some $80 million to fix the system, before finally the airport opened in February 1995 with a manual baggage-handling system [Need 95].
Another well-known example besides commercial airplane travels is the termination of the Ariane-5 rocket on June 4th, 1996 [Ladk 98]. Due to a conversion problem from 64-bit floating point numbers to 16-bit integers, an exception was called that was never taken. As a result, both computers on board the rocket shut the system down, leading to navigational errors of the rocket only 40 seconds after its start. In order to avoid further damage, the self-construction mechanism was initiated and destroyed the rocket. Similarly, in 1993 a software error caused the thruster rockets of satellite Clementine to fire continually, which consumed all its fuel before completing its asteroid-rendezvous mission [Need 95].
3.3.4 Governmental and commercial mishaps
As we have seen above, computer programs often don't work as they should, and too much buggy software reaches the end user [Lieb 97]. An important area of computer application is the governmental and commercial domain of everyday computing. Errors in these system often affect many people, but fortunately, their consequences are not lethal. An example happened in 1985, when a tax computer delivered warning notices about withheld taxes to 27.000 companies, which had already paid their fees [Need 95]. Another story is reported for 1989 in Paris, where 41.000 surprised traffic offenders received letters charging crimes like murder, drug trafficking, extortion, and prostitution instead of their traffic violations [Need 95].
One example for a big software error in a commercial company happened in 1988, when an automated Black & Decker distribution center in Northampton, England, restored corrupted backup data over all the main systems data [Need 95]. The destroyed data could only be corrected by inputting the complete inventory of the depot manually. While this was already an expensive consequence of a software error, failures in computer systems of banks may even be more costly. In 1989 a bug in the system of a British bank transferred an extra 2 million pound to customers within one hour, because payment orders were permitted to be issued twice [Need 95]. The problem in this case was, that the bank had to trust their customers in returning the money, because of missing transaction protocols.
Yet, it has to be mentioned, that these official cases of software bugs are probably only a minority of the estimated number of unknown cases. The reason is that companies and especially banks rely on establishing a basis of trust with their customers, which may certainly be harmed if such errors would be published. Thus, it must be accepted, that many more bugs happen than are reported in the media.
3.3.5 Internet Worm/Hackers
In contrast to the governmental and commercial mishaps reported above, increased media attention is focused on reports about the scene of hackers and phreaks. The high potential of devastating damages is a consequence from the massive expansion of the networking infrastructure and at the same time from the vulnerability of networked computers. The connection between software errors and hackers is the fact, that the former is usually misused as a starting point for illegal intrusions into computer systems.
The Internet Worm of November 2-3, 1988, created by Cornell grad student Robert T. Morris, was to be the largest and best-publicized computer-intrusion scandal to date [Ster 92]. Morris said, that his ingenious "worm" program was meant to explore the Internet harmlessly, but due to bad programming and a math error in the code, the worm replicated itself 14 times faster than intended and slipped out of control [Need 95]. Within a few hours, some six thousand Internet computers crashed, affecting systems for weeks before fully recovering from the damages wrought [Spaf 89].
A similar example is the system crash of AT&T's long distance telephone network happened on January 15th, 1990 [Ster 92]. Its cause was an improvement or rather an attempted improvement in the System 7 software for AT&T's 4ESS switching station, the "Generic 44E14 Central Office Switch Software". This software has been extensively tested, and was considered very stable. Originally, the station with System 7 were programmed to switch over to a backup net in case of any problems. However, in mid-December 1989, a new high-velocity, high-security software patch was distributed to each of the 4ESS switches that would enable them to switch over even more quickly, making the total system much more secure. Yet, at 2:25 P.M. EST on Monday, January 15th, 1990, one of AT&T's 3ESS toll switching systems in New York City had an actual, legitimate, minor problem (a missing break in a C switch statement) and went into fault recovery routines. This was the kickoff for a series of fallouts, and like in a chain reaction one machine after the other went down leading to an immense collapse of the US telephone system. The shut down lasted for 9 hours, leading to some 74 million uncompleted calls, which resulted in the most severe breakdown in the US telephone network ever [Need 95].
The story of this error was rather dubious, because initially hackers were made responsible for this failure and were blamed by the media and US law enforcement agencies. This is a usual way to cover tracks. Quite often, big companies do not blame themselves for software errors, or blame other people to be responsible for their failures instead. At the same time, some companies cannot effort to make system failures publicly known, simply because their relation with customers relies on trust, as in case of banks, which have surely been hit by hackers following software bugs in past, but would never confirm these incidents ever happened.
Similar events are documented from time to time, like one on July 1st, 1991, which disrupted the telephone service in Washington, D.C., Pittsburgh, Los Angeles, and San Francisco. Once again, a seemingly minor maintenance problem had crippled the digital System 7 and affected about 12 million people. This time, it was a single mistyped character, one tine typographical flaw in one single line of the 10 million lines of code.
3.3.6 The Millennium Bug
The most publicly interesting and therefore media-present bug is undoubtedly the millennium bug or Y2k (Year 2000) error. This error even lead to the creation of the International Y2K Cooperation Center under the auspices of the United Nations in order to minimize Y2K impact throughout the world (see http://www.iy2kcc.org). Besides that it has been in the media for quite a long time, and many self-proclaimed prophets predicted the end of the world due to computer failures. The whole problem is based on the fact, that some legacy mainframe computer programs were hard-coded to treat the years by their last two digits. Consequently they cannot handle 4-digit year dates, and dates from the year 2000 may possibly be interpreted as 1900 [Need 95]. Thus, unique problems and bottlenecks were expected and weaknesses of our infrastructure should have been exposed. Worst-case scenarios included examples like disruption of public transportation, banking and finance, hospitals, telephone and mail services, and various kinds of governmental services, e.g. social security, tax offices, and worst of all military installations [YoYo 99]. Besides these immediate failures, many worst-case scenarios were derived for the long-term range, like breakdown of international economy, a plague of lawsuits filed by shareholders, the families of deceased patients, and swarms of other people harmed by Y2k failures [Hyat 99].
Most consequences of the millennium bug have been predicted for the turn of the year, when the clock struck midnight on January 1st, 2000. However, at the time of writing only few and so far harmless errors have been observed. Some examples are given in the following list (Sources: CNN, Reuters, CNET, ZDNet, Der Standard, ORF, Heise Online, IY2KCC, USENET comp.risks and RISKS Forum):
- The radiation controllers in two Japanese nuclear power-plants failed to operate correctly, requiring to manually input the correct data.
- Seven US nuclear power-plants reported minor problems, although none of them affected safety systems.
- A US spy satellite and a French military satellite encountered minor glitches, but were reportedly never out of control.
- The atomic weapon laboratory in Oak Ridge, Tennessee, which is also the biggest US uranium depot, faced allegedly harmless date problems
- Egypt's national news wire service briefly stopped filing.
- In three Swedish hospitals cardiac control systems shut down operation.
- A Stockholm bank refused electronic access to customer's accounts.
- A bank in Cologne incorrectly transferred several millions to some customers.
- Several hundred slot machines at the Delaware horse track shut down operation.
- A CyberCash credit-card verification caused duplicate transactions.
- Some jail sentences in Italy were increased by a century due to Y2k-bugs.
- The South-Korean Millennium Baby was registered with an age of 100 years.
These news items are also confirmed by official authorities. The US government (President Clinton's Y2k-representative John Koskinen) announced, that only minor incidents have been observed, which have been solved immediately. A similar statement was issued by Bruce McConnel, the Year 2000 representative of the United Nations, who stated, that globally no major disturbances have been perceived. Yet, the French magazine "01-Informatique" reported, that more than 15 percent of the French industry observed glitches due to the Y2k-bug, especially in the areas accounting and finance, but also in personnel, payment, and production.
However, all in all it seems, that the preparations for the Y2k-bug have paid off. Considering the amount of money, this was not a cheap effort. For example, the US government estimates, that governments and companies worldwide spent US$200 billion to prevent a Year 2000 catastrophe. These numbers are even exceeded by the International Data Corporation (IDC), who reckoned up to US$280 billion, while the Gartner Group evaluated even US$300 to US$600 billion in expenses. Of course, these amounts and the actually observed effects of the Y2k bug lead to some critical voices concerning waste of money. There are some critical issues, that have to be mentioned in this context. Firstly, it remains to be seen if the upcoming months still do not reveal any major Y2k failures. Secondly, it will forever be unknown, what would have happened if these investments in the information technology sector would not have been placed in advance. Thirdly, there is always the problem of concealment, and some (probably critical) Y2k-issues will never be revealed in public.
Related to the Y2k bug there are other notorious dates, like February 29th, 2000, which is an intercalary day that may be missed by some systems. Another previous example is the "Nines Problem", which was predicted for September 9th, 1999. Concerns over this date arose from a programming convention that used four nines in a row - 9999 - to tell computers to stop processing data or to prepare for maintenance. Some thought, that the 9/9/99 would be a precursor to the millennium bug, although the latter is expected to be much more critical. Yet, September 9th, 1999 has passed without any major incidents or disruption of computer systems in Asia, Europe, and the US. Some reasons for this unexpected non-appearance are, that the date would have to be stored in BCD or ASCII-format and would equal "090999" or "990909" as usual.
1 Race conditions are some of the most critical errors in parallel programs. They are extensively discussed in Section 4.2.4.
![]() GUP Linz http://www.gup.uni-linz.ac.at |