Event Graph Analysis
for Debugging
Massively Parallel Programs

Dieter Kranzlmüller

 



Event Graph Analysis
for Debugging
Massively Parallel Programs

(Ereignisgraph-Analyse zur Fehlersuche
in massiv-parallelen Programmen)

Dissertation

zur Erlangung des akademischen Grades
Doktor der technischen Wissenschaften

von
Dipl.-Ing. Dieter Kranzlmüller

angefertigt am

Institut für Technische Informatik und Telematik
(Abteilung für Graphische und Parallele Datenverarbeitung)
der Technisch-Naturwissenschaftlichen Fakultät der
Johannes Kepler Universität Linz


eingereicht bei
o.Univ.-Prof. Dr. Jens Volkert (Betreuer)
Prof. Dr. Bernard Tourancheau (2. Begutachter)
o.Univ.Prof. Dr. Franz Pichler (2. Prüfer)

Linz, September 2000

 

 

 

 

Danksagung

An dieser Stelle möchte ich mich bei all jenen bedanken, die zum Gelingen dieser Arbeit durch ihre fachliche und persönliche Unterstützung beigetragen haben.

 

Mein besonderer Dank gebührt meinem Betreuer Herrn Prof. Dr. Jens Volkert, der mir im Rahmen meiner wissenschaftlichen Arbeiten einerseits die größtmögliche Freiheit zur Entfaltung meiner Ideen, andererseits aber jede benötigte Hilfestellung gewährt hat.

 

Weiters danke ich Herrn Prof. Dr. Bernard Tourancheau für die Zweitbegutachtung meiner Arbeit, sowie Herrn Prof. Dr. Franz Pichler für seine Bereitschaft als Zweitprüfer beim Rigorosum zu fungieren.

 

Auch meinen Kollegen an der Abteilung für Graphische und Parallele Datenverarbeitung möchte ich hiermit für die anregenden Diskussion und ihre Hilfsbereitschaft danken, allen voran Herrn Dr. Rainer Koppler der mit mir während der letzten sechs Jahre ein Büro geteilt hat. Erwähnt seien auch die unzähligen Studenten, die mich durch ihr Engagement in Diplomarbeiten, Praktikas und meinen Lehrveranstaltungen motiviert und damit auf immer neue Ideen gebracht haben.

 

Von den Freunden möchte ich mich bei Dipl.-Ing. Edith Spiegl, Dipl.-Inform. Ralf Reussner, und Christian Schaubschläger bedanken, welche die Mühen des Korrekturlesens auf sich genommen und damit zur vorliegenden Arbeit tatkräftig beigetragen haben.

 

Schließlich gebührt wesentlicher Dank meiner Familie, hauptsächlich meiner Frau Judith, deren Verständnis und Geduld eine Grundlage für die Vervollständigung der Arbeit darstellten. Durch die Geburt meiner Tochter Miriam habe ich zusätzliche Energie und Kraft für den Abschluss des Studiums bekommen.

Abstract

Error detection and performance analysis of parallel programs is tedious and difficult, especially in massively-parallel environments. Many concurrently executing and communicating tasks result in highly complex applications, which generate huge amounts of state information for the debugging process. Managing these data requires abstraction and steering support from program analysis tools, which can only be achieved with graphical representations of the program execution.

A widely used model of parallel program behavior is the event graph, which can be displayed as a state-time diagram indicating the occurrence of events and their connections in a parallel system. Based on this abstract model, various activities of error detection can be performed, which qualifies the event graph as a suitable debugging interface. Among the debugging techniques are simple analysis features for detecting incorrect communication pairs as well as traditional methods like breakpointing and inspection.

Another important set of event graph analysis techniques covers nondeterministic parallel programs, which exhibit different program behavior in successive executions even if the same input data are provided. Consequences are the irreproducibility effect, the completeness problem, and the probe effect, which require special attention from users and corresponding tool support. The debugging strategy integrated in the event graph model permits to solve these difficulties with sophisticated event manipulation and artificial replay techniques, that allow to steer a program's execution at nondeterministic choices.

A major problem of massively parallel computing is that debugging data grow rapidly with the execution time of a program and the number of participating processes. As a consequence it may often be difficult or even impossible to analyze massively-parallel applications, and many tools are limited to small numbers of processes. With the event graph model, corresponding solutions are provided by automatization and abstraction. Firstly, every debugging approach should be capable of automatically performing as many debugging tasks as possible. This should relieve the users from repeated, time-consuming low-level activities. Secondly, higher levels of abstraction should reduce the complexity of the debugging data and improve the user's understanding of the program's behavior. Since both goals can be achieved with the event graph model, which therefore represents a significant improvement for users during debugging of high performance computing applications.

Kurzfassung

Fehlersuche und Leistungsanalyse in parallelen Programmen ist eine ermüdende und schwierige Aufgabe, besonders in massiv-parallelen Umgebungen. Solche hochkomplexen Anwendungen, in denen eine große Menge an Prozessen miteinander kommuniziert, erzeugen riesige Mengen an Zustandsinformationen für die Fehlersuche. Um diese Daten verwalten zu können werden Werkzeuge benötigt, die entsprechende Abstraktions- und Steuerungsmöglichkeiten basierend auf einer graphischen Darstellung der Programmausführung bereitstellen.

Ein entsprechendes Modell für paralleles Programmverhalten ist der Ereignisgraph, welcher das Eintreten von Ereignissen und deren Zusammenhänge beinhaltet und auf einfache Weise als Raum-Zeit Diagramm dargestellt werden kann. Basierend auf diesem Modell können verschiedenste Aktionen zur Fehlerentdeckung und -verfolgung durchgeführt werden, wodurch sich der Ereignisgraph auch als geeignete Schnittstelle zur Fehlersuche anbietet. Zu den entsprechenden Techniken zählen sowohl einfache Analysen zur Entdeckung von fehlerhaften Kommunikationspaaren, als auch traditionelle Methoden wie Haltepunkte und Inspektionen.

Eine zusätzliche, wichtige Menge an Ereignisgraph-Analysetechniken beschäftigt sich mit nichtdeterministischen parallelen Programmen, die in wiederholten Programmausführungen trotz gleicher Eingabedaten unterschiedliche Ergebnisse liefern können. Die daraus folgenden Konsequenzen sind das Reproduzierbarkeitsproblem, das Vollständigkeitsproblem und der Beobachtungseffekt. Diese erfordern spezielle Aufmerksamkeit von BenutzerInnen und entsprechende Werkzeugunterstützung. Mit einer auf den Ereignisgraph aufbauenden Fehlersuch-Strategie ist es möglich, diese Probleme zu lösen. Durch Ereignismanipulation und künstlicher Wiedergabe können nichtdeterministische Wahlmöglichkeiten während eines Programmlaufes gesteuert werden.

Ein weiteres Hauptproblem des massiv-parallelen Rechnens ist, dass die Menge der zur Fehlersuche benötigten Daten mit der Programmausführungszeit und der Anzahl der teilnehmenden Prozesse schnell ansteigt. Als Folge daraus ist es oft schwierig, wenn nicht gar unmöglich, eine massiv-parallele Anwendung zu analysieren. Viele Werkzeuge sind daher auf kleine Prozesszahlen beschränkt. Mit dem Ereignisgraph stehen entsprechende Lösungen durch Automatisierung und Abstraktion zur Verfügung. Erstens sollte ein Ansatz zur Fehlersuche so viele Fehlersuche-Aufgaben als möglich automatisch durchführen. Dadurch können BenutzerInnen von wiederholten, zeitaufwändigen, niederwertigen Aktivitäten befreit werden. Zweitens sollten höhere Ebenen der Abstraktion die Komplexität der Fehlersuch-Daten verringern und das Verständnis der BenutzerInnen für das Programmverhalten erhöhen. Da beide Ziele mit dem Ereignisgraph-Modell und den dazugehörigen Fehlersuchstrategien erreicht werden können, repräsentiert dieses eine signifikante Verbesserung für BenutzerInnen während der Fehlersuche in Hochleistungsanwendungen.

Table of Contents

Abstract 9

Kurzfassung 11

Table of Contents 13

Symbols, Variables, and Constants 19

List Of Figures 23

List of Tables 27

1. Introduction 29

2. Area of Application: Parallel Computing 33

2.1 The Need for High-Performance Computing 34

2.1.1 Grand Challenge Problems 34

2.1.2 Requirements on Computer Systems 36

2.1.3 Teraflop Computing 36

2.2 Supercomputing 37

2.2.1 History of Parallel Computing 37

2.2.2 What is a Supercomputer? 39

2.2.3 The Top 500 Supercomputers 39

2.2.4 Moore's Law 41

2.2.5 Supercomputer Architectures 43

2.2.6 Massively Parallel Processing 45

2.3 Parallel Programming 46

2.3.1 Basic Approaches 46

2.3.2 Possibilities for Parallelization and Parallel Programming Models 47

2.3.3 Approaches to Implicit Parallel Programming 49

2.3.4 Explicit Parallel Programming in Practice 50

2.3.5 Message-Passing Programs 52

2.4 Summary: Programming the World's Most Powerful Machines 53

3. Problem Domain: Debugging Program Errors 55

3.1 Software Life-Cycle Models 56

3.2 Testing and Tuning 56

3.2.1 Characteristics 56

3.2.2 Problems of Testing 57

3.2.3 Software Quality 59

3.2.4 The Testing Process 60

3.2.5 Debugging Basics 60

3.2.6 Definition and Classification of Errors 63

3.3 Notorious Errors 67

3.3.1 History of Debugging 67

3.3.2 Lethal Software Errors 68

3.3.3 Airplane Crashes 68

3.3.4 Governmental and commercial mishaps 69

3.3.5 Internet Worm/Hackers 70

3.3.6 The Millennium Bug 71

3.4 Errors in Highly-Complex Systems 72

3.5 An Example of Bug Potential 73

3.6 Summary: Testing and Debugging 76

4. Debugging Parallel Programs - Related Work 77

4.1 Traditional Testing and Debugging Cycles 78

4.1.1 Simple Checking 78

4.1.2 The Testing Cycle 78

4.1.3 The Debugging Cycle 80

4.2 Obstacles During Program Debugging 83

4.2.1 Problems of Sequential Debuggers 83

4.2.2 Fundamental Differences between Sequential and Parallel Debuggers 84

4.2.3 Errors from Process Communication and Synchronization 86

4.2.4 Nondeterministic Parallel Programs 87

4.2.5 Race Conditions 91

4.2.6 The Irreproducibility Effect 93

4.2.7 The Completeness Problem 96

4.2.8 The Probe Effect 99

4.3 Extended Parallel Testing and Debugging Approach 101

4.4 Requirements on a Parallel Debugging Tool 104

4.4.1 Architecture and Functionality of an Arbitrary Debugger 104

4.4.2 Criteria for Developing a Parallel Debugger 105

4.5 Parallel Program Analysis and Visualization 107

4.5.1 Collaborative and Standardization Efforts to Debugging 107

4.5.2 Models of Generic Program Analysis Tool 108

4.5.3 Monitoring Tools 110

4.5.4 Software Visualization 113

4.5.5 Visualization Tools for Program Analysis 115

4.5.6 Additional Levels of Abstraction 117

4.6 Debugging Activities at GUP Linz 120

4.7 Summary 123

5. Modelling Programs with the Event Graph 125

5.1 Inspecting Program States during Execution 126

5.1.1 Program State Changes 126

5.1.2 Objects of Interest 127

5.2 Events and Event Attributes 128

5.2.1 Basic Event Definition 128

5.2.2 Characteristics of Events 130

5.3 Event Ordering and Relations 131

5.3.1 The Observability Problem 131

5.3.2 Methods for Event Ordering 134

5.3.3 The "Happened Before" Relation 140

5.3.4 Event Time Zones 141

5.4 The Event Graph Model 143

5.4.1 An Abstract Definition 143

5.4.2 Event Graph Visualization 145

5.5 Summary 148

6. Example: A Simple Event Graph Model 149

6.1 Overview of Event Modelling Approach 150

6.2 Selection of Events 150

6.2.1 Interesting Event Classes 150

6.2.2 Communication Events 151

6.2.3 Mapping Events to MPI Statements 152

6.2.4 Communication Event Data 154

6.2.5 Other Events of Interest 156

6.3 Instrumentation of the Code 157

6.3.1 Applying C-Preprocessor Macros 157

6.3.2 Possibilities for Program Instrumentation 158

6.4 Monitoring the Program's Execution 160

6.4.1 Basic Operation 160

6.4.2 Monitor Operations 162

6.5 Constructing the Event Graph Model 163

6.5.1 Event Record Structure 163

6.5.2 Partial Event Ordering 165

6.6 Visualization of the MPI Program 168

6.7 Summary 170

7. Basic Debugging Activities with the Event Graph 171

7.1 The Debugging Strategy Based on the Event Graph 172

7.2 Automatic Event Graph Analysis for Error Detection 172

7.2.1 Program Failures 172

7.2.2 Computational Errors 173

7.2.3 Example: Communication Errors 174

7.3 Visualization of the Detected Behavioral Characteristics 178

7.3.1 Error Detection with the Space-Time Diagram 178

7.3.2 Performance Analysis with the Space-Time Diagram 179

7.4 Interactive Event Graph Inspection 181

7.4.1 Inspecting Communication Events 181

7.4.2 Variable Inspection Events 182

7.4.3 Array Inspection 182

7.4.4 Control Flow Analysis 185

7.5 Breakpointing in the Event Graph 185

7.5.1 Characteristics of Breakpoints for Parallel Program Debugging 185

7.5.2 Setting Breakpoints in Parallel Programs 187

7.5.3 Generating Cuts in the Event Graph 189

7.5.4 Additional Notes on Breakpoint Cuts 191

7.6 Summary 192

8. Analysis of Nondeterministic Behavior 195

8.1 Modeling Nondeterministic Behavior with the Event Graph 196

8.1.1 Nondeterministic Program Behavior 196

8.1.2 Responsible Events 197

8.1.3 Random Number Generation 198

8.1.4 Nondeterministic Receives 198

8.1.5 Additional Nondeterministic Events in Parallel Programs 199

8.1.6 Problems During Debugging 200

8.2 Detection of Nondeterministic Events 201

8.2.1 Evaluation of Responsible Events 201

8.2.2 Detecting Nondeterministic Receive Events 201

8.2.3 Some Remarks on Nonblocking Receives 204

8.2.4 Evaluation of Racing Messages 205

8.2.5 Simplifications for Evaluation of Race Condition Candidates 207

8.2.6 Visualization of Nondeterministic Behavior 210

8.3 Re-execution of Nondeterministic Programs 212

8.3.1 Equivalent Execution 212

8.3.2 Record Phase 213

8.3.3 Replay Phase 214

8.4 Debugging in the Presence of Nondeterminism 217

8.4.1 Evaluation of Race Condition Candidates 217

8.4.2 Event Manipulation 218

8.4.3 Cut Placement for Event Manipulation 219

8.4.4 Additional Notes on Exchange Cuts 222

8.4.5 Artificial Program Re-execution 225

8.5 Summary 226

 

 

9. Automatization of Nondeterminism Analysis 227

9.1 Problems of Manual Nondeterminism Analysis 228

9.1.1 Complete Nondeterminism Testing 228

9.1.2 Example Problem 1: All-To-One Communication 229

9.1.3 Example problem 2: Data Circulation 231

9.1.4 Example Problem 3: Distance Doubling 235

9.2 Automatic Nondeterminism Analysis Strategy 238

9.2.1 Principal Approach 238

9.2.2 Equivalence of Executions 238

9.2.3 Finger-printing of Executions 240

9.2.4 Possible Optimizations 243

9.3 Ideas for Test Reduction 244

9.3.1 Overview of Ideas 244

9.3.2 Integrating Initial User Feedback 246

9.3.3 Exploiting Control- and Data-Flow Characteristics 246

9.3.4 Estimating Message Arrival Order with Observed Timings 248

9.3.5 Estimating Message Arrival Order with Hardware Characteristics 250

9.3.6 Exploiting Observed Event Graph Patterns 254

9.4 Summary 255

10. Advanced Event Graph Techniques 257

10.1 Requirement and Ideas 258

10.1.1 Problem: Numbers of Processes and Execution Time 258

10.1.2 Principal Idea: Focus Only on Important Aspects 259

10.2 Classification of Event Graph Abstraction Techniques 261

10.3 Basic Techniques for Post-Mortem Abstraction 263

10.3.1 Abstraction Along Time-Axis 263

10.3.2 Abstraction along Process-Axis 265

10.3.3 Combined Post-Mortem Abstraction 267

10.4 Pattern Matching 268

10.4.1 Overview and Principal Idea 268

10.4.2 Specification of Patterns 268

10.4.3 Future Perspectives and Benefits 273

10.5 On-line Abstraction Technique 273

10.5.1 On-line Abstraction Along Time-Axis 273

10.5.2 On-Line Abstraction Along Process-Axis 275

10.6 Summary 277

11. Summary and Conclusions 279

12. References 283

People 311

Program Analysis Tools 315

Index 317

Symbols, Variables, and Constants

 

exclusive-or operation

p. See The communication between neighboring processes in a hypercube is established by the communication partners p and , where is the exclusive-or operation performed on the binary representation of the address of process p. In concrete, means that bit of the address of process p is inverted, revealing the address of the required communication partner.

 

concurrent relation

p. See Two distinct and independent events and are concurrent denoted as if

 

"happened before" relation

p. See The "happened before" relation denoted as "" on a set of events of a system is the smallest transitive, irreflexive relation satisfying the following two conditions for arbitrary events :

a t

instruction at time t

p. See For a given view onto a program, let t represent the instant of time at the terminal state of instruction at, and let st(v) be the value of variable at time t.

B (0)

defect potential of software

p. See The defect potential , where FP is the number of function points in the new software product. For enhanced software, an exponent of 1.27 is used instead of 1.25.

B ( k )

defect potential after k test-and-fix iterations

p. See Defect removal rule [Jone 96]

 

breakpoint on process p

p. See A complete cut C in a parallel program executing on n processes is a set of breakpoints on each process p, so that

C

cut (= set of breakpoints)

p. See A complete cut C in a parallel program executing on n processes is a set of breakpoints on each process p, so that

 

concurrent order

p. See Condition (1) is simply the sequential order of events as defined in Definition 5-2, while condition (2) describes the relation between corresponding events on different processes. The latter may also be called the concurrent order , which is established by communication and synchronization between parallel processes.

d

hypercube dimension

p. See Consider a MPI program that is executed on n processes of a hypercube multiprocessor [HwXu 98], where n=2d with d the dimension of the hypercube, so that each process is executed on one node of this architecture. Each process holds an arbitrary data item, which is used for some computation. After the computation, the data items are exchanged with the other processes, and the received data items are used for another computation. The goal of the data circulation algorithm is to exchange the data, so that each process receives each data once and communication between the process is as efficient as possible.

D

failure density

p. See The failure density is used as a measurement of failures per thousand lines of code.

 

dimension exchange pattern along dimension d starting at event i

p. See Dimension exchange pattern

 

set of events in dimension exchange pattern

p. See Dimension exchange pattern

 

relations between events in dimension exchange pattern

p. See Dimension exchange pattern

DO

debugger operations: breakpoints, trace, state-inspection, state-manipulation

p. See DO = breakpoints trace state-inspection state-manipulation

DS

debugging system

p. See Any tool for error detection, a so-called debugging system, DS = (PD,DO) is defined by

E

set of events e

p. See An event graph is a directed graph G = (E,), where E is the non-empty set of events e of G, while is a relation connecting events, such that x y means that there is an edge from event x to event y in G with the "tail" at event x and the "head" at event y.

 

i th event on any process p

p. See The sequential order of events defines, that the ith event on any process p, denoted as , occurred before the i+1th event .

 

simple exchange pattern between process p and q starting at event i

p. See Simple exchange pattern

 

set of events in simple exchange pattern

p. See Simple exchange pattern

 

relations between events in simple exchange pattern

p. See Simple exchange pattern

f ( )

result at nondeterministic event

p. See is the finger-print on process p for all nondeterministic events , with f() being the result that occurred at each of these nondeterministic events during execution.

f ( x )

arbitrary mathematical function over input x

p. See Let f be an arbitrary mathematical function. Then consider the following task: given an input x and an output y, determine whether or not y = f(x); that is, determine whether or not the input/output pair represents a correct computation of f. This task is called simple checking.

f 0 , f 1 , ..., f n

subfunctions of f

p. See In addition there are other problems which exist on smaller program scales, too. An example is the set of errors in parallel programs, which consists of multiple sequential bugs and errors established by synchronization and communication of concurrently executing tasks. While the multiplication of sequential bugs may already represent a big obstacle, the existence of anomalous effects due to concurrency makes parallel debugging even worse. The reason for these effects in parallel programs stems from the increased influence of the underlying parallel hardware system. Consider the debugging cycle for a simple program, consisting of the subfunctions f0, f1,... fn as sketched in the left diagram of Figure 4-3. Applying the simple checking method, we would use repeated executions in order to determine the correct intermediate states between these subfunctions. If the sequence of subfunctions f0, f1, and f2 describes the desired computation, the resulting output y is determined by the provided input x and the interaction with the system environment. If the input is the same and system-interactions can be reproduced, each iteration of the debugging cycle would yield the same output, and intermediate results can be analyzed with breakpoints and single-stepping during iterative re-executions.

fp p

finger-print of process p

p. See is the finger-print on process p for all nondeterministic events , with f() being the result that occurred at each of these nondeterministic events during execution.

FP( x )

finger-print of a nondeterministic execution

p. See The finger-print FP(x) of a nondeterministic parallel program execution with a given set of input data x is the set

FP

number of function points

p. See The defect potential , where FP is the number of function points in the new software product. For enhanced software, an exponent of 1.27 is used instead of 1.25.

first( E , p )

function to determine first event in set E on process p

p. See First event

 

time zone for a given event

p. See Every event establishes three time zones PAST, FUTURE, and PARALLEL for all other events occurring in a system, which may be defined as follows:

G

event graph

p. See An event graph is a directed graph G = (E,), where E is the non-empty set of events e of G, while is a relation connecting events, such that x y means that there is an edge from event x to event y in G with the "tail" at event x and the "head" at event y.

 

global exchange pattern along dimension d starting at event i

p. See Global exchange pattern

 

set of events in global exchange pattern

p. See Global exchange pattern

 

relations between events in global exchange pattern

p. See Global exchange pattern

k

required test-and-fix iterations

p. See Required test-and-fix iterations

last( E , p )

function to determine last event in set E on process p

p. See Given the set of events E and a process p, the function last(E,p) is defined as

messageLength( )

function to determine message length of communication event

p. See Two events and are called events with different message length, if

Mflop/s

million floating-point operations per second

p. See One of the problems of computer science is that often many definitions are used out of their original context with confusing notations and ambiguous meanings, while others are intentionally fuzzy to avoid critical arguing. To cope with this problem, we adapt the notations and conventions as defined by [HwXu 98]. In particular, readers should not be confused with the shorthand notations for basic units of time, byte and bit respectively. The basic unit of time is second, abbreviated as s, while the basic information units are byte and bit, where one byte is 8 bits. A frequently used workload unit is the number of floating-point operations, abbreviated as flop. A unit for computing speed is the number of floating-point operations per second (flop/s), with the prefix indicating the scale of the operation. Thus, we use 1 Mflop/s as the proper notation for the speed of executing 1 million floating-point operations per second. Consequently, we will avoid using 1 Mflops/s, 1 MFLOPS, or 1 Mflops, even though they are often used in the field of computing. Another remark concerns the logarithmic function, which will always be used with base 2 unless otherwise specified.

M

set of racing messages

p. See A program P with a valid input x is defined to be nondeterministic, if

n

number of processes

p. See Assuming that the number of processes for Example 9-1 is n, then the number of nondeterministic events is n-1, and each of these events is a blocking wild card receive. Consequently, the possibilities for each of these nondeterministic events is the corresponding set of racing messages, which can be given as follows:

next()

method to determine next successive event on same process

p. See Algorithm 7-1 provides some more details about the implementation of the cut mechanism. With this simple algorithm, an cut is generated with its breakpoint on each process in the vector cut[]. The breakpointing event is stored in the object breakpoint, which is an instance of class event. Three methods of this class are required in our algorithm, process_id(), vector(), and next(). The method process_id() delivers the process identifier of its event object, while vector() provides the vector clock as produced by Fidge's algorithm (see Section 5.3.2). With the next() function, the successive event on the same process can be selected. Besides these three methods of class event, our algorithm also requires an operation ("<") to compare vector clocks, which effectively checks the precedence condition (see Condition 5-1 on page 139). Additionally, a function to access the first event on a process (= first()) is used.

N max

Linpack : problem size

p. See For the Top 500, Dongarra et al used the version of the Linpack benchmark, that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. Then, this number reflects "... the performance of a dedicated system for solving a dense system of linear equations." Due to the regularity of the problem the achieved performance is expected to be rather high and give a good correction of a machine's peak performance. By measuring the performance for different problem sizes, the maximal possible performance Rmax for problem size Nmax as well as the problem size N1/2 for half of the performance Rmax can be determined. These three numbers are included in the list together with the theoretical peak performance of each machine Rpeak. Table 2-1 shows an excerpt from the June 1999 list with the data for the top 10 supercomputers. The units for Rmax and Rpeak are given in Mflop/s.

numProcesses

number of processes (constant)

p. See for (i=0; i < numProcesses; i++) {

N 1/2

Linpack : problem size for half of R max

p. See For the Top 500, Dongarra et al used the version of the Linpack benchmark, that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. Then, this number reflects "... the performance of a dedicated system for solving a dense system of linear equations." Due to the regularity of the problem the achieved performance is expected to be rather high and give a good correction of a machine's peak performance. By measuring the performance for different problem sizes, the maximal possible performance Rmax for problem size Nmax as well as the problem size N1/2 for half of the performance Rmax can be determined. These three numbers are included in the list together with the theoretical peak performance of each machine Rpeak. Table 2-1 shows an excerpt from the June 1999 list with the data for the top 10 supercomputers. The units for Rmax and Rpeak are given in Mflop/s.

p

process

p. See The sequential order of events defines, that the ith event on any process p, denoted as , occurred before the i+1th event .

PD

program behavior data: arbitrary connections of control flow-data flow, and concurrency events

p. See PD = arbitrary connections of control flow-data flow, and concurrency events with operators AND, OR, and a suitable ordering relation, as well as

P ( X )

program P with valid input x

p. See In this case, the function f(x) describes the behavior of some program P(x) as intended by the user. Therefore it serves as a specification of intended program behavior, which is required for evaluating the correctness of the program as defined in Section 3.2.3. To recapitulate, a program is always correct relative to its specification. This correctness is verified during testing cycles, which apply simple checking iteratively for a set of inputs .,

p. See A program P with a valid input x is defined to be nondeterministic, if

 

time zone for a given event

p. See Every event establishes three time zones PAST, FUTURE, and PARALLEL for all other events occurring in a system, which may be defined as follows:

 

time zone for a given event

p. See Every event establishes three time zones PAST, FUTURE, and PARALLEL for all other events occurring in a system, which may be defined as follows:

process_id()

method to determine process identifier of current event

p. See Algorithm 7-1 provides some more details about the implementation of the cut mechanism. With this simple algorithm, an cut is generated with its breakpoint on each process in the vector cut[]. The breakpointing event is stored in the object breakpoint, which is an instance of class event. Three methods of this class are required in our algorithm, process_id(), vector(), and next(). The method process_id() delivers the process identifier of its event object, while vector() provides the vector clock as produced by Fidge's algorithm (see Section 5.3.2). With the next() function, the successive event on the same process can be selected. Besides these three methods of class event, our algorithm also requires an operation ("<") to compare vector clocks, which effectively checks the precedence condition (see Condition 5-1 on page 139). Additionally, a function to access the first event on a process (= first()) is used.

R

learning rate during one test-and-fix step

p. See The defect potential , where R is the learning rate during one test-and-fix step, while k is the number of applied test-and-fix iterations.

receive

blocking receive event

p. See Blocking receive (receive)

R max

Linpack : maximal possible performance for N max

p. See For the Top 500, Dongarra et al used the version of the Linpack benchmark, that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. Then, this number reflects "... the performance of a dedicated system for solving a dense system of linear equations." Due to the regularity of the problem the achieved performance is expected to be rather high and give a good correction of a machine's peak performance. By measuring the performance for different problem sizes, the maximal possible performance Rmax for problem size Nmax as well as the problem size N1/2 for half of the performance Rmax can be determined. These three numbers are included in the list together with the theoretical peak performance of each machine Rpeak. Table 2-1 shows an excerpt from the June 1999 list with the data for the top 10 supercomputers. The units for Rmax and Rpeak are given in Mflop/s.

R peak

Linpack : theoretical peak performance

p. See For the Top 500, Dongarra et al used the version of the Linpack benchmark, that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. Then, this number reflects "... the performance of a dedicated system for solving a dense system of linear equations." Due to the regularity of the problem the achieved performance is expected to be rather high and give a good correction of a machine's peak performance. By measuring the performance for different problem sizes, the maximal possible performance Rmax for problem size Nmax as well as the problem size N1/2 for half of the performance Rmax can be determined. These three numbers are included in the list together with the theoretical peak performance of each machine Rpeak. Table 2-1 shows an excerpt from the June 1999 list with the data for the top 10 supercomputers. The units for Rmax and Rpeak are given in Mflop/s.

send

nonblocking send event

p. See Nonblocking send (send)

 

sequential order of events

p. See The sequential order of events defines, that the ith event on any process p, denoted as , occurred before the i+1th event .

S0 , S1 , ..., Sn

program states

p. See The state data of a program during execution are the main information for debugging. Based on these state data, the user tries to locate the place in the program, where an incorrect program state first occurred. This idea is sketched in Figure 5-1, which contains a diagram describing the state changes of a program. The selected program consists of 8 program states, the initial state S0, the final state S7, and the intermediate states S1 to S6. The final state S7 describes the expected behavior at the program's termination point together with the expected results as defined by the program's specification. Therefore, these state changes may express a valid program execution up to an expected termination point which delivers correct computation results.

s t

state of the program at time t

p. See Then, the state st of the program at time t is the set of the values of its variables st(v), so that: .

s t ( v )

value of variable at time t

p. See For a given view onto a program, let t represent the instant of time at the terminal state of instruction at, and let st(v) be the value of variable at time t.

test

nonblocking receive event

p. See Nonblocking receive (test)

Tflop/s

10 12 floating-point operations per second

p. See The above mentioned requirements have also been visible in the promises of computer manufacturers, who specified their goal for 1995 to be the magical "3 T's" - 1 Tflop/s (1012 floating-point operations per second) in execution rate, 1 Tbyte of main memory, and 1 Tbyte/s interconnection bandwidth [Hoss 99]. This goal lead to the definition of key attributes for such a powerful computing system, which have been specified by Hossfeld as [Hoss 99]:

V

set containing all the variables of a program

p. See In this definition, at is an instruction, whose operation assigns a new value to variable . Although the set V contains all the variables of a program, the user's main interests are those variables, that are modified by instruction at [Grab 99]. Figure 3-1 describes the relation between the instruction and its variables in a small diagram. In [Pine 99], the author introduces another distinction by classifying those variables as "noncurrent", whose actual values as computed by instruction at differ from the values expected by the user.

vector()

method to return vector clock of current event

p. See Algorithm 7-1 provides some more details about the implementation of the cut mechanism. With this simple algorithm, an cut is generated with its breakpoint on each process in the vector cut[]. The breakpointing event is stored in the object breakpoint, which is an instance of class event. Three methods of this class are required in our algorithm, process_id(), vector(), and next(). The method process_id() delivers the process identifier of its event object, while vector() provides the vector clock as produced by Fidge's algorithm (see Section 5.3.2). With the next() function, the successive event on the same process can be selected. Besides these three methods of class event, our algorithm also requires an operation ("<") to compare vector clocks, which effectively checks the precedence condition (see Condition 5-1 on page 139). Additionally, a function to access the first event on a process (= first()) is used.

x

selected input

p. See Let f be an arbitrary mathematical function. Then consider the following task: given an input x and an output y, determine whether or not y = f(x); that is, determine whether or not the input/output pair represents a correct computation of f. This task is called simple checking.

X

set of possible (and valid) inputs x

p. See In this case, the function f(x) describes the behavior of some program P(x) as intended by the user. Therefore it serves as a specification of intended program behavior, which is required for evaluating the correctness of the program as defined in Section 3.2.3. To recapitulate, a program is always correct relative to its specification. This correctness is verified during testing cycles, which apply simple checking iteratively for a set of inputs .

X p

exchange sequence (for hypercube data circulation)

p. See The exchange sequence Xd for data circulation in a hypercube multiprocessor with processes is defined as

 

input/output pair

p. See Let f be an arbitrary mathematical function. Then consider the following task: given an input x and an output y, determine whether or not y = f(x); that is, determine whether or not the input/output pair represents a correct computation of f. This task is called simple checking.

y

computed output or result

p. See Let f be an arbitrary mathematical function. Then consider the following task: given an input x and an output y, determine whether or not y = f(x); that is, determine whether or not the input/output pair represents a correct computation of f. This task is called simple checking.

Y

set of possible outputs y

p. See Formally, this means that the testing approach of Section 4.1.2 has to be extended for nondeterministic parallel programs. Again, we apply the simple checking approach of Blum and Wasserman [BlWa 96] (see Definition 4-1). After choosing an arbitrary mathematical function f(x) that describes the behavior of some program P(x), and a valid input x, we have to determine whether or not y = f(x) in order to confirm if the input/output pair represents a correct computation of f. The problem is, that there may be more than one correct output y for each set of inputs . Thus, for satisfying the test completion criterion with nondeterministic programs it is necessary to compute each output y from a set of possible outputs for each input of the set of valid inputs in X. As a consequence, some means of steering the choices at nondeterministic events have to be provided. The goal is to obtain alternate versions of particular test cases during replay, based on some previously recorded program execution.

 

List Of Figures

1. Introduction 30

2. Area of Application: Parallel Computing 34

2-1 Overlapped design space of clusters, MPPs, SMPs,
and distributed computer systems [HwXu 98] 45

3. Problem Domain: Debugging Program Errors 56

3-1 State of a program at a given instruction 64

3-2 Conceptual overview of notions and terms 66

4. Debugging Parallel Programs - Related Work 78

4-1 Traditional testing and debugging cycle 79

4-2 Iterations of the debugging cycle 81

4-3 Comparison of sequential and parallel program influences [KrVo 98] 86

4-4 Testing and debugging cycle for nondeterministic parallel programs 103

5. Modelling Programs with the Event Graph 126

5-1 State changes in a program 126

5-2 Observability effect (compare [Fidg 93]) 132

5-3 Observability effect for multiple observers (compare [Fidg 93]) 133

5-4 Observability effect for incorrect perceived orders (compare [Fidg 93]) 134

5-5 Logical clock ordering with Lamport's approach [Fidg 93]) 136

5-6 Possible total order for Lamport's approach of Figure 5-5 137

5-7 (Partial) logical vector clock ordering [Fidg 93]) 138

5-8 Event time-zones 143

5-9 Example event graph [Kran 96a] 145

5-10 Time tunnel visualization [Reed 95] 146

5-11 Example event graph visualization of a program execution 147

6. Example: A Simple Event Graph Model 150

6-1 Monitoring functionality during program execution 161

6-2 Monitoring activities [Kran 96a] 163

6-3 Simplified event ordering mechanism 168

6-4 Visualization of events observed during program execution 169

7. Basic Debugging Activities with the Event Graph 172

7-1 Basic event graph analysis (communication errors) 179

7-2 Visualization of blocking times 180

7-3 Interactive event inspection (communication errors) 181

7-4 Variable inspection 183

7-5 Array inspection 184

7-6 Errors detected with array inspection 186

7-7 Setting a cut (breakpoints on all processes) 190

7-8 Setting a cut in the event graph display 192

8. Analysis of Nondeterministic Behavior 196

8-1 Nondeterministic receive events 203

8-2 Nondeterministic nonblocking receive events 204

8-3 Nondeterministic receive without effect 206

8-4 Wild card receives and racing messages 211

8-5 Distance doubling: incorrect nondeterministic executions 212

8-6 Testing and debugging in case of nondeterminism 218

8-7 Placement of exchange cut for event manipulation 220

8-8 Cut for exchange (above) and cut for breakpointing (below) 221

8-9 Placing cuts for event manipulation 223

8-10 Possible program executions (resulting from Figure 8-9) 224

9. Automatization of Nondeterminism Analysis 228

9-1 Four executions of a nondeterministic all-to-one communication program 229

9-2 Three executions of a nondeterministic data circulation program 233

9-3 Executions of a nondeterministic distance doubling program 236

9-4 Tree of executions 237

9-5 Testing cycle for automatic nondeterminism analysis 239

9-6 Complete set of executions of distance doubling program 242

9-7 Control- and data-flow analysis and event graph [Kran 97b] 248

9-8 Probability of message arrival 249

9-9 Results on the nCUBE 2 multiprocessor [Kran 99a] 251

9-10 Cumulative results on the nCUBE 2 multiprocessor [Kran 99a] 252

9-11 Results on the Origin 2000 without process placement [Kran 99a] 253

9-12 Origin 2000 hypercube architecture [Kran 99a] 253

9-13 Results on the Origin 2000 with process placement 254

10. Advanced Event Graph Techniques 258

10-1 Complex event graph displays observed in massively parallel programs 259

10-2 Detecting loop deviations 264

10-3 Process grouping 266

10-4 Process extraction and hiding 266

10-5 Basic event graph patterns 269

10-6 Simple event patterns (round robin, tree) 270

10-7 Scalable patterns 271

10-8 More complex pattern 272

10-9 Detecting almost correct patterns 273

10-10 Debugging and checkpointing 275

10-11 Process isolation 275

10-12 Combined on-line abstraction 277

11. Summary and Conclusions 280

12. References 284

List of Tables

1. Introduction 30

2. Area of Application: Parallel Computing 34

2-1 Excerpt from the Top 500 supercomputer list [Dong 99] 40

2-2 Top 500 position N=500 [Dong 99] 42

2-3 Top 500 position N=1 [Dong 99] 42

2-4 Top 500 supercomputers: share of architectures over time [Dong 99] 45

3. Problem Domain: Debugging Program Errors 56

3-1 Reliability of commercial UNIX utilities 74

4. Debugging Parallel Programs - Related Work 78

5. Modelling Programs with the Event Graph 126

5-1 Event types and event data 129

6. Example: A Simple Event Graph Model 150

6-1 Example places for event occurrence 155

6-2 Example event record structure for MPI_Isend 163

6-3 Example data of simplified event ordering mechanism 166

7. Basic Debugging Activities with the Event Graph 172

7-1 Communication event data 176

8. Analysis of Nondeterministic Behavior 196

8-1 Receive parameters (standard case) 202

8-2 Receive parameters (non-overtaking case) 209

8-3 Example for record&replay mechanism 216

9. Automatization of Nondeterminism Analysis 228

9-1 Number of executions corresponding to number of processes 230

9-2 Experimental results for 2 iterations of distance doubling 237

9-3 Example finger-print file 242

10. Advanced Event Graph Techniques 258

10-1 Classification of abstraction mechanisms 263

11. Summary and Conclusions 280

12. References 284

Introduction

Introduction

Mankind considers itself as the toolmaker in the Theory of Evolution and consequently the computer is often considered as the most superior tool developed by mankind1. Today computers are part of everyday life, and it would be difficult or probably impossible to continue living if computers would somehow be removed from our way of life.

Computers are accepted and ubiquitous2. They occur in more and more areas, occupying regions which we may not have thought about applying a "calculating machine" before. Obvious examples are the Internet in every household, personal digital assistants (PDAs), and various kinds of entertainment machines that rely heavily on high processing power. For instance, consider how much computing power is required to produce cinema films today.

Of course, there is also the backside of the medal, and some people believe that there are already too many computers around and many machines are applied for more activities than useful. An example is the Millennium Bug and the related Y2k-discussion (year 2000), where one could ask, why an elevator needs to handle dates and thus may even be affected by the Y2k-problem. More questions about usefulness are raised when people read about refrigerators with display screens and Internet-connection.

Yet, besides "everyday"-computing there is still the traditional field of computing, which is often called "Computational Science & Engineering". In this area, computer simulation is used as the third category of the scientific methodological tripod besides the traditional categories theory and experiment. With computer simulation, theories can be verified without performing expensive and probably impossible experiments. Furthermore, this tripod of science and engineering is often designated as the only stable methodological basis and instrumental laboratory to approach the many existing complex problems, which are crucial to the future of science, technology, and society.

At present, these problems are either unsolved or only insufficiently solved due to the limited existing computing power, and while it might be possible to solve some of them in the near future, it is also obvious that new problems will emerge with even higher demands to computer performance. This group of problems is called the "Grand Challenges", and some examples are climate research and weather forecast, chemical reactions and combustions, astrophysics and cosmology.

As a consequence, there is an ever increasing demand for computing power in this domain, which also drives the high-performance computing research in order to provide faster computers with more memory and disk space. An example is the USA's Department of Energy ASCI Program (Accelerated Strategic Computing Initiative), which seeks to provide Teraflop computing systems far beyond the current level of available performance to meet the needs of nuclear stockpile stewardship. One of the next steps of this program is the ASCI Option White supercomputer, which is being developed by IBM and the Lawrence Livermore National Laboratory to be installed in the year 2000.

This scalable parallel machine is basically a highly improved version of IBM's "Deep Blue", which defeated the world's leading human chess master in a publicized series of games. It is expected to achieve up to 10 Tflop/s (i.e., 10 trillion floating-point operations per second), will require 17,000 square feet of floor space and draws over 6.2 megawatts for power, cooling, and mechanical equipment - enough power to supply a small town full of air-conditioned homes.

Clearly, the users of this field of computing differ significantly from the users of everyday applications. Their goals are efficiency and feasibility. They know how to program their applications and they know how to target the complete computing system most efficiently at one particular problem. Furthermore, these computers are working days, weeks or even months on solving only one particular problem.

On the other hand, these large scale computers together with their applications introduce new problems, which are established mainly by their complexity, the number of processes, and the amount of data to be processed. Additionally, programming high-performance parallel computers is still rather difficult compared to sequential programming. Often, message-passing computing is compared with assembly language from the old days. However, to achieve a certain level of performance, this low-level programming is often required.

Many problems are observed during the parallel software engineering life-cycle, especially during the debugging and tuning phase. The goal of this phase is to detect and correct critical errors and performance bottlenecks in order to improve the reliability and efficiency of the code, and thus the overall quality of the application. This is achieved by analyzing program states during the execution and interpreting the contents of execution traces. Obviously, the amount of data describing a program state depends on the number of participating processes. Consequently, the size of the traces depends on the number of supervised program states and the overall execution time of the program. Therefore, parallel programs tend to generate huge amounts of trace data and program states from massively-parallel programs can easily exhaust the available memory and thus limit the debugging activities.

The traditional solution to this problem was to down-scale the application to small numbers of processes, smaller problem sizes, and thus shorter execution times. While this may be successful in many cases, it may also be a pitfall with serious consequences for other cases. Parallel debugging tools should provide software-engineers with efficient functionality to cope with the amount of data in debugging traces.

At present, the biggest problem of debugging tools for parallel programs is, that they usually combine several instances of a sequential debugger and integrate them under a common user interface. Thus, they still rely on sequential functionality, which may be insufficient in many cases. Besides that, many tools are base on textual representations which may be inappropriate in many cases to display and manage the inherent complexity of parallel programs. This loads a heavy burden onto the debugging user.

The work described in this thesis differs from existing approaches due to the fact, that debugging activities are based on the event graph model instead of the underlying source code. This parallel program model describes a program's execution by occurring state changes and their interactions on concurrently executing processes, which allows to cope equally with programs based on message-passing and the shared memory paradigm. In addition, this model allows to integrate sophisticated techniques for analyzing small-scale parallel applications as well as massively-parallel programs.

The event graph model also serves as the basis for two other important ideas - automatization and abstraction. With automatic error detection facilities, repeated, time-consuming debugging activities can be carried out by the debugging tool without user interaction, allowing the analysis of even large program traces. Users are relieved from tedious activities and their focus is drawn onto the important results of this analysis, which should cover the most critical sections of the program.

Another attempt to improve the debugging process is automatic investigation of nondeterministic behavior. Nondeterminism in parallel programs can be observed, when successive executions of a single parallel program delivers different results, although the same input data has been provided. This can be attributed to different orders of memory accesses or variations in the message-delivery order and may often be the source of critical bugs and runtime failures.

The goal of abstraction is to reduce the amount of debugging data presented to the users. With ideas like grouping and hiding of processes, extraction of failures, and pattern matching, the size of the event graph can be reduced without reducing the data in the traces. This allows to display the data without compromising the information contents of the traces. Another idea is the integration of checkpointing mechanisms, which store a snapshot of a parallel program's execution to disk, in order to allow restarting the program from that consistent state. For the debugging cycle this means, that the complete program execution can be divided into smaller time slices, where only some of the slices need to be analyzed.

This thesis is organized as follows. Firstly, we will introduce the target of our approach, which is the high performance computing domain, where supercomputers are trying to solve grand challenge applications. We will briefly describe the currently preferred architectures and how they are programmed. This leads to the general description of debugging parallel programs, and an overview of the current state of work in this area. Most important is the specification of requirements to a parallel debugger, which have not been met by any of the existing tools. We describe the modelling process for the event graph, and how an event graph is generated from the observation of a parallel program run. Based on this simple model, basic debugging activities will be described which have already been implemented in an existing debugging environment. These basic features are improved, leading to extensions required for debugging massively parallel programs, which have been described above as automatization and abstraction.

Notations and conventions

One of the problems of computer science is that often many definitions are used out of their original context with confusing notations and ambiguous meanings, while others are intentionally fuzzy to avoid critical arguing. To cope with this problem, we adapt the notations and conventions as defined by See [HwXu 98]. In particular, readers should not be confused with the shorthand notations for basic units of time, byte and bit respectively. The basic unit of time is second, abbreviated as s, while the basic information units are byte and bit, where one byte is 8 bits. A frequently used workload unit is the number of floating-point operations, abbreviated as flop. A unit for computing speed is the number of floating-point operations per second (flop/s), with the prefix indicating the scale of the operation. Thus, we use 1 Mflop/s as the proper notation for the speed of executing 1 million floating-point operations per second. Consequently, we will avoid using 1 Mflops/s, 1 MFLOPS, or 1 Mflops, even though they are often used in the field of computing. Another remark concerns the logarithmic function, which will always be used with base 2 unless otherwise specified.

Area of Application: Parallel Computing

This chapter introduces the environment for the work described later in this thesis. The problem domain is part of parallel computing, and it is therefore important to describe the reasons for establishing parallel computing hardware, which can be attributed to the largest problems of science and engineering. This area depends on high-performance architectures in order to solve the memory-intensive problems within a certain amount of time and accuracy. We will try to illustrate the term supercomputer and will give an overview of the current situation. Afterwards, we will briefly introduce the software engineering process in order to direct the reader's attention to the field of parallel program debugging.

Huge Problems: Grand Challenges
  • High Performance Computing: Parallel Computing
  • What is a supercomputer?
  • How to program: SMP/MPP

Area of Application: Parallel Computing

The Need for High-Performance Computing

Grand Challenge Problems

Parallel computing is sometimes called the "key enabling technology" of modern computers that allows to fulfil the ever-increasing demand for higher and higher performance at lower production costs, which allows sustained productivity in real-life applications See [Hwan 93]. Therefore, parallel computing is integrated wherever performance is one of the key issues, and parallelism appears in various forms, such as pipelining, vectorization, concurrency, data parallelism, time sharing, multitasking, multiprogramming, multithreading, and distributed computing at different processing levels. Yet, while parallelism is widely available in every computing platform, from powerful game consoles to mainframes in companies, some problems still continue to require more processing power than available.

At the foremost front of problems are the so-called "Grand Challenges" See [Gran 89], See [Gran 92], and See [Wald 95], which can be defined as follows:

Grand challenge problems See [WiAl 99]

A grand challenge problem is one that cannot be solved in a reasonable amount of time with today's computers.

These fundamental problems generate increasingly complex data, require realistic simulations of the processes under study, and demand intricate visualizations of the results See [Gran 99a]. These problems often require numerous large-scale calculations and have significant elements of multi-discipline and multi-geographic collaborations See [Gran 98]. Their solutions will have a broad economic and scientific impact, not only on the computing industry but probably on the human life in general See [Gran 99b]. The following list contains an overview of some problems in the Grand Challenges category as summarized in See [Gran 96], which can only give an impression of today's situation, since yet unrecognized application might emerge soon, for example from the expanding multimedia field See [Hoss 99].

Applied Fluid Dynamics
  • Meso- to Macro-Scale Environmental Modeling
  • Ecosystem Simulations
  • Biomedical Imaging and Biomechanics
  • Molecular Biology
  • Molecular Design and Process Optimization
  • Cognition
  • Fundamental Computational Sciences
  • Grand-Challenge-Scale Applications

Another idea would be to identify the equations, that have to be solved in order to get results for the Grand Challenge problems. These so-called Grand Challenge equations have been collected by See [Gran 99a]:

Newton's Equations
  • Schrödinger Equation (time dependent)
  • Navier-Stokes Equation
  • Poisson Equation
  • Heat Equation
  • Helmholtz Equation
  • Discrete Fourier Transform
  • Maxwell's Equations
  • Partition Function
  • Population Dynamics
  • Combined 1st and 2nd Laws of Thermodynamics
  • Radiosity
  • Rational B-Spline

To be more concrete, we will now give two representative examples of Grand Challenge problems in more detail:

Grand Challenge problem See [WiAl 99]:
Numerical weather prediction (weather forecasting by computer)

The global atmosphere is modeled by dividing it into three-dimensional regions or cells, e.g. cells of size 1 mile x 1 mile x 1 mile which would lead to roughly 5 x 10 8 cells. The conditions in each cell (temperature, pressure, humidity, wind speed and direction, etc.) are computed at time intervals with rather complex mathematical equations and using various conditions from the previous time intervals. If we assume 200 floating-point operations for each calculation, one time step would require 10 11 floating-point operations. In addition, the calculations for each cell have to be repeated iteratively in order to model the passage of time. Thus, predicting the weather over a time period of 10 days using 10-minute intervals would require 10 4 time steps and 10 15 floating-point operations in total. Using a computer that operates at 100 Mflop/s (10 8 floating-point operations per second) would than lead to a computation time of 107 seconds or approximately 116 days to perform this calculation. On contrary, to perform this calculation within 10 minutes would require a computer operating at 1.7 Tflop/s (1.7 x 10 12 floating-point operations per second). If you compare this with the current Top 500 supercomputer list See [Dong 99] as described in Section See The Top 500 Supercomputers, only the world's fastest supercomputer, Intel's ASCI Red, achieves more computational speed for the Linpack benchmark See [Dong 94].

Grand Challenge problem See [WiAl 99]:
Astrophysical N-body simulation
(motion prediction of astronomical bodies in space)

Each astronomical body in space is attracted to each other body by gravitational forces. These are long-range forces that can be calculated with a simple formula (see See [WiAl 99], page 126), and the movement of each body can be predicted by calculating the total force experienced by the body. If there are N bodies, there will be N-1 forces to calculate for each body, or approximately N 2 calculations in total. Thus, if we assume that a galaxy might contain 10 11 stars, we would require 10 22 calculations. Again, these are only the calculations for one time step, while scientists are usually interested in simulating a certain period of time, which means that these calculations will have to be performed iteratively. If we provide a powerful computer, that could perform each calculation within 1μs (10 -6 seconds), it would still take approximately 10 9 years for one iteration of this N 2 algorithm.

Requirements on Computer Systems

These two examples indicate, that there exist problems, which can only be solved with very powerful architectures that may not even exist today. Some of them require systems of unheralded capability See [Sieg 92]. Hence, besides the problems of feasibility, Grand Challenge applications also serve as the representative guides to the development of system software and hardware See [Gran 98]. Furthermore, workstations, however powerful they are or will be in the near future, cannot replace the potential of parallel computers, which are basically built upon the technology of powerful microprocessor chips by tying them together via sophisticated broadband interconnection networks in order to support massive parallelism See [Hoss 99].

The following three requirements manifest the need for parallel computing:

Computational speed
  • Amount of memory
  • Accuracy of results

Clearly, computational speed is the most obvious characteristic required from a parallel computing system. Often, computations must be completed within a "reasonable" time period or have a specific deadline for the computations See [WiAl 99]. For example, the predictions of a weather forecasting program that requires two days to compute next days weather is certainly useless.

A second requirement is the amount of memory consumed by this large and complex problems, which can easily exhaust conventional computing systems. However, by coupling large numbers of processors which have access to their local memory, the total amount of memory can be scaled up. Please note, that the amount of memory also affects the computational speed, because obviously the data to be processed has to be transported from memory to CPU and return. This defines the need for a sufficient degree of interconnection bandwidth.

The last requirement is the accuracy of the results, that have to be computed. Yet, this is strongly connected to the other requirements, simply because the higher the computational speed and the amount of memory available, the higher the degree of accuracy that can be achieved. On contrary, restrictions in the available computing power and/or the available memory certainly reduce the obtainable accuracy, and many actual problems cannot be solved with sufficient degree of accuracy by today computers.

Teraflop Computing

The above mentioned requirements have also been visible in the promises of computer manufacturers, who specified their goal for 1995 to be the magical "3 T's" - 1 Tflop/s (10 12 floating-point operations per second) in execution rate3, 1 Tbyte of main memory, and 1 Tbyte/s interconnection bandwidth See [Hoss 99]. This goal lead to the definition of key attributes for such a powerful computing system, which have been specified by Hossfeld as See [Hoss 99]:

Multiple high-performance commodity priced compute nodes from regular commercial product lines (not special-purpose designs)
  • Hierarchical memory systems (e.g. cache-only memory architectures and distributed shared memory systems with low-latency memory access)
  • Very high performance storage and parallel I/O systems
  • Scalable programming environments and operating systems
  • A universal programming paradigm

Today, four years later than expected, the 1995 goal in execution rate was achieved, which can be attributed to the Accelerated Strategic Computing Initiative (ASCI) project See [ASCI 99]. This $1-billion program launched in 1994 by the US Department of Energy (DOE) seeks to build Tflop/s computer systems within 10 years See [Craw 96]. The participating government laboratories - Sandia National Laboratories, Lawrence Livermore National Laboratory, Los Alamos National Laboratory - collaborate in creating the leading-edge computations modeling and simulation capabilities required for maintaining safety, reliability, and performance of US nuclear stockpile and reducing the nuclear danger See [Gust 98]. Besides nuclear weapons research, simulation of other areas like biotechnology, medical and pharmaceutical research, weather prediction, aircraft and automobile design, industrial process improvement, environment projection, etc. are targeted See [HwXu 98]

The strategies adopted by the ASCI program have two notable features: accelerated development and balanced scalable design. Thus, ASCI aims not just at the peak speed of a particular computer system, but at a total system solution that can deliver sustained application performance at five orders of magnitude better than that in 1994. The scalable design approach defines the following steps See [HwXu 98]:

Focus on high-end platforms for scientific computing
  • Use of commodity off-the-shelf (COTS) hardware and software components
  • Use of massively parallel architectures

A step further than the ASCI program is targeted by the PetaFlops project See [NASA 97]. This project aims at the long term goal to achieve a supercomputing speed of 1 Pflop/s (10 5 flop/s). Recent studies suggest, that such a system could eventually be built in 2007 with exotic technologies and unknown costs. However, the more realistic idea is to build a massively parallel processor with conventional technology and COTS components by year 2015 at a cost of US$ 1 billion See [HwXu 98]. Such a system must effectively exploit massive parallelism in the range of thousands of processors to over a million of threads concurrently. In the following section, we will briefly summarize the historical steps that led to the current situation and have been performed to achieve these goals, before presenting an overview of the current state of supercomputer developments.

Supercomputing

History of Parallel Computing

The history of parallel computing is clearly part of the historical milestones in the development of computers. The earliest machine can be tracked back at least as early as to the Abacus used in China 500 BC See [Hwan 93]. Later, machines capable of performing arithmetic operations were developed in the seventeenth century with the mechanical calculator of Wilhelm Schickhard in Germany 1623, the mechanical counter of Blaise Pascal in France 1642, and the machine of Gottfried Leibnitz again in Germany 1671 See [Haye 88]. These machines remained academic curiosities until the nineteenth century, where fundamental advances were achieved by Charles Babbage with the Difference Engine (1823) and the Analytical Engine (1834), although both remained uncompleted See [Haye 88].

No significant attempts to build general-purpose digital computers were made after Babbage's death until the 1930s, when independently Konrad Zuse in Germany and Howard Aiken in the United States developed their machines. Zuse started with first theoretical ideas in 1934, which lead to the "test models" of mechanical computers Z1 and Z2 in 1936. See [Zuse 76]. The subsequent Z3 binary mechanical computer, accepted as the first fully operational general-purpose program-controlled computer, was completed in 1941 See [Zuse 86]. On contrary, Aiken proposed a decimal mechanical machine in 1937, which finally lead to the Harvard Mark I computer that was completed by International Business Machines Corp. (IBM) in 1944 See [Haye 88].

Since these mechanical computers suffered from limited computing speed as well as cumbersome and unreliable information transmission, electronic machines were proposed as a replacement. The first widely known general-purpose electronic computer was the ENIAC (Electronic Numerical Integrator and Calculator) built in the United States under the direction of John W. Mauchly and J. Presper Eckert in 1946 See [Haye 88]. Associated with this project is also the Hungarian-born mathematician John von Neumann, who described the principle architecture of sequential computers.

Starting in 1945, electromechanical computers can be divided into five generations See [Hwan 93]. The first generation (1945-1954) consisted of vacuum tubes and relay memories, CPUs driven by a program counter (PC) and an accumulator with fixed-point arithmetic, while the second generation (1955-1964) already integrated discrete transistors and core memories, as well as floating-point arithmetic, input/output processors, and multiplexed memory access. During the third generation (1964-74) integrated circuits, microprogramming, pipelining, caches and look-ahead processors already provided the technology and architecture to perform multiprogramming and time-sharing by the operating system, obviously a start for parallel computing.

However, a parallel computer was not a novel idea at that time See [WiAl 99]. In fact, for computing terms it is already a rather old idea, since Gill already wrote about parallel computing in 1958 See [Gill 58]. Similarly, Holland wrote about a "computer capable of executing an arbitrary number of subprograms simultaneously" in 1959 See [Holl 59], while Conway described the design of a parallel computer and its programming in 1963 See [Conw 63]. Around that time, Zuse also developed some ideas concerning parallel processing in his work on the "calculating cosmos" (sometimes also called "calculating space", in German: "Rechnender Raum", See [Zuse 69]), which are nowadays mostly related to cellular automata and massively parallel computers. In 1960 Burroughs announced the D825, a four processor machine that was finally delivered two years later to serve in military command and control applications See [LeRu 93]. Finally, in 1967 the 64-processor ILLIAC IV, which is often called "The First Supercomputer" See [AlGo 94], was designed.

After this initial steps of parallel computing, further technological progress was achieved with developments to support parallelism in many ways. The fourth generation computer systems (1975-90) contained LSI/VLSI and semiconductor memory, multiprocessors, vector supercomputers, and multicomputers. These systems provided multiprocessor operating systems, high-level programming languages, compilers and environments for parallel processing.

The history can be tracked until today with the fifth generation computers (1991-present) which consist of ULSI/VHSIC processors, memory, switches, high-density packaging, and scalable architectures. With these components, massively parallel processors and heterogeneous processing clusters can be built to target Grand Challenge applications. Flynn and Rudd wrote in 1996 See [FlRu 96]: "... the continued drive for higher- and higher-performance systems... leads us to one simple conclusion: the future is parallel".

What is a Supercomputer?

Supercomputer is a term, that is often closely associated with parallel processing. A straight forward definition would identify it as one of the fastest computers available. Obviously, this definition is time dependent and last decade's supercomputers are less powerful than current personal computers See [LeRu 93]. Today, all supercomputers employ parallelism in one way or another to attain some of their high performance See [Quin 87], although it may sometimes be hidden on low levels of the hardware-architecture (e.g. pipelining in the processor).

Another definition of a supercomputer is given by See [Morr 92]:

Supercomputer See [Morr 92]

Computer Technology: 1. Any of a category of extremely powerful, large-capacity mainframe computers that are capable of manipulating massive amounts of data in an extremely short time. 2. Any computer that is one of the largest, fastest, and most powerful available at a given time.

More characteristics are added by See [HwXu 98], who place supercomputers at the top of the pyramid of computer classes, because their sales volume is very low due to their high price. In addition, a supercomputer integrates many resources into a single system, employs cutting-edge technology, and has the highest performance.

Therefore, we can identify the following characteristics for a supercomputer at any given period of time:

Fastest available processing speed
  • Largest possible memory size
  • Biggest physical dimensions
  • Greatest monetary cost

Please note, that these characteristics may be appropriate in most cases, but are also too vague to be exact. Thus, we need some kind of metric to establish a list of supercomputers, that must be based on one of the characteristics mentioned above. Since processing speed is obviously the most important identifier in many cases, we will now describe one approach to rank existing supercomputer architectures.

The Top 500 Supercomputers

Based on their performance, the team around Jack J. Dongarra, Hans W. Meuer, and Erich Strohmeier are maintaining the Top 500 list See [Dong 99], which is published twice a year in June at the Mannheim Supercomputer conference and in November at the Supercomputing conference series since 1993. The objective of this list is to provide manufacturers, users, and potential users with statistics on high-performance computers. The ranking of the machines is based on the Linpack benchmark, which belongs to the group of micro benchmarks in the area of numerical computing (linear algebra) See [HwXu 98]. A benchmark is a testing program that supposedly captures processing and data movement characteristics for a class of applications. The Linpack benchmark was created and maintained by Dongarra at the University of Tennessee See [Dong 94]. Besides its usage as a benchmark, it is still widely used for real computation work, since it provides functions to analyze and solve linear equations and linear least-squares problems for general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal square matrices.

For the Top 500 , Dongarra et al used the version of the Linpack benchmark, that allows the user to scale the size of the problem and to optimize the software in order to achieve the best performance for a given machine. Then, this number reflects "... the performance of a dedicated system for solving a dense system of linear equations." Due to the regularity of the problem the achieved performance is expected to be rather high and give a good correction of a machine's peak performance. By measuring the performance for different problem sizes, the maximal possible performance R max for problem size N max as well as the problem size N 1/2 for half of the performance R max can be determined. These three numbers are included in the list together with the theoretical peak performance of each machine R peak . Table See Excerpt from the Top 500 supercomputer list [Dong 99] shows an excerpt from the June 1999 list with the data for the top 10 supercomputers. The units for R max and R peak are given in Mflop/s.

Excerpt from the Top 500 supercomputer list See [Dong 99]

N

Manufacturer

Computer

Installation Site

Location/Year

#

Proc.

R max

R peak

N max

N 1/2

1

Intel

ASCI Red

Sandia Natl. Labs

Albuquerque USA/1999

9472

2121.3

3154

251904

66000

2

SGI

ASCI Blue Mountain

Los Alamos Natl. Lab.

Los Alamos USA/1998

6144

1608

3072

374400

138000

3

SGI

T3E 1200

Government

USA/1998

1084

891.5

1300.8

259200

26400

4

Hitachi

SR 8000/128

University of Tokyo

Tokyo Japan/1999

128

873.6

1024

120000

16000

5

SGI

T3E 900

Government

USA/1997

1324

690.90

1191.6

229248

26880

6

SGI

Origin 2000/250

Los Alamos Natl. Lab./ACL

Los Alamos USA/1999

2048

815.1

1024

134400

80640

7

SGI

T3E 900

UK Meteorological Office

Bracknell UK/1997

876

552.92

788.4

-

-

8

IBM

SP Silver

IBM

Poughkeepsie USA/1998

1952

547

1296

244000

58000

9

SGI

T3E 900

Naval Oceanographic Office

Bay Saint Louis USA/1999

812

515.1

730.8

-

-

10

SGI

T3E 1200

UK Centre for Science

Manchester UK/1998

612

509.9

734.4

-

-

The most interesting entries in this table are the number of processors and of course the achieved performance. As can be seen, there are only two machines (N=1 and N=2) with more than 6000 processors, while all other machines are less than 2048 processors. Accordingly, only the first two machines achieve a performance above 2 and 1 Tflop/s respectively, while all other machines are still below this barrier.

In terms of architectures, the Top 500 list contains only the following machines:

One Intel: ASCI Red
  • One Hitachi: SR8000
  • One IBM: SP Silver
  • One SGI: ASCI Blue Mountain
  • One SGI: Origin 2000/250
  • Five SGI: T3E 900/1200

Please note, that the SGI ASCI Blue Mountain machine is actually composed of 48 Origin 2000 machines with 128 nodes each. Thus, it would be possible to identify this machine as a superscale SGI Origin 2000. As can be seen, SGI (Silicon Graphics) is currently the company with the most supercomputer architectures, which can be attributed to the acquisition of Cray Research, one of the traditional supercomputer manufacturers.

Another interesting fact, that is only indirectly given is the achieved level of peak performance, which is 67% for the Intel ASCI Red machine, but only 52% for the SGI ASCI Blue Mountain. The best result compared to the peak performance is reached by the Hitachi SR8000/128 (N=4), which achieves 85% although it consists of the smallest number of processors (128). The worst level of utilization for the Linpack benchmark is obtained with the IBM SP Silver (N=8), that achieves only 42% of its possible peak performance with a very high number of processors (1952).

Moore's Law

The list of Table See Excerpt from the Top 500 supercomputer list [Dong 99] only gives an impression of the situation at present, while an overview over a few years confirms, that the term supercomputer depends on the developments of the computer architecture in general. For example, while in June 1993 the SNI VP200 with 0,4 Gflop/s was a supercomputer, today's last position in the supercomputer list is occupied by the SGI T3E900 with 24.73 Gflop/s on 38 processors (see Table See Top 500 position N=500 [Dong 99]). This speedup factor of 62 is even bigger than for the number one supercomputer, which offers a speedup of almost 36 for the same time period (see Table See Top 500 position N=1 [Dong 99]).

 

Top 500 position N=500 See [Dong 99]

Date

Performance

Computer

Installation Site

06/93

0,4 Gflop/s

SNI VP200

Debis Munich, Germany

06/96

3,3 Gflop/s

Convex SPP 1000/XA-32

Uni Michigan, USA

06/99

24,73 Gflop/s

SGI T3E900

CIEMAT, Spain

Top 500 position N=1 See [Dong 99]

Date

Performance

Computer

Installation Site

06/93

59,7 Gflop/s

TM CM-5/1024

Los Alamos, USA

06/96

220,4 Gflop/s

Hitachi SR2201/1024

Uni Tokyo, Japan

06/99

2.121,3 Gflop/s

Intel ASCI Red

Sandia, USA

The phenomenon of exponential improvement was already observed as early as 1979 by Intel cofounder Gordon Moore See [Moor 96], and can be interpreted in three ways See [HwXu 98]:

Every 18 to 24 months delivers

the double number of transistors on a microchip for the same price of the chip.
  • the double speed of the microprocessor for the same price of the chip.

while at the same time

the price of the microchip drops about 48% assuming the same processor speed and on-chip memory capacity.

One of the consequences of Moore's law is, that predictions about "the future supercomputer" are difficult, because technology changes constantly and will be completely different within a few years See [Simo 99]. Considering the performance of the Top 500 list, Moore's law would have expected an increase in performance of about 8 to 16 for the 6 years between June 1993 and June 1999. This has clearly been outdone by reality, with a speedup factor of 36 for that period. For the future, the ASCI program expects to extend from 1 Gflop/s in 1994 to 100 Tflop/s in 2004, which calls for a performance increase of 10 5 within this 10 years See [HwXu 98]. Yet, according to Moore's law, such a 10 5 -fold performance increase with constant hardware costs would require 26 to 32 years, which implies that 100 Tflop/s could not be achieved until year 2025.

Please note, that benchmarking is only one popular method to characterize the performance of applications and systems. Other performance metrics may be more appropriate for other areas of computing. Some well-known benchmarks are SPEC, NAS, STAP, LMBENCH, STREAM, PARKBENCH, Splash, TPC (Transaction Processing Performance Council) See [HwXu 98], and SKaMPI See [Reus 98].

Supercomputer Architectures

After showing one of the most well-known approaches to identify the world's most powerful supercomputers and their current representatives, we will now analyze their hardware architecture and operation. Of course, there are different possibilities to characterize these machines and clearly, there are several different classification schemes See [Volk 97]. One of the best-known schemes is the Flynn classification See [Flyn 72]See [Flyn 96], that divides computers by identifying the number of instruction streams and the number of data streams. The four basic classes of Flynn are:

SISD: Single Instruction stream, Single Data stream
  • SIMD: Single Instruction stream, Multiple Data streams
  • MISD: Multiple Instruction streams, Single Data stream
  • MIMD: Multiple Instruction streams, Multiple Data streams

Another approach is the Erlanger Classification Scheme (ECS) developed by Händler See [BoHä 83]. It describes parallel computer architectures based on the number of control and processing elements that are available for problem solving. The notation divides the architectures into pipelining and concurrency principles, as well as mixtures between these two categories See [Quin 87].

These first two approaches to classify parallel systems are still widely used. Yet, today another way of distinguishing supercomputers seems to be equally appropriate. Kai Hwang and Zhiwai Xu identify five different classes of parallel architectures See [HwXu 98]4:

PVP - Parallel Vector Processors
  • SMP - Symmetric Multiprocessors
  • MPP - Massively Parallel Processors
  • DSM - Distributed Shared-Memory Machines
  • COW - Clusters of Workstations

Most of today's parallel machines are built with commercially available, off-the-shelf commodity hardware and software components. The only exception are PVP machines, where many building blocks are custom-made. Such systems contain a small number of powerful custom-designed vector processors, each capable of at least 1 Gflop/s performance. Examples for PVPs are the Cray C-90, the Cray T-90, and the NEC SX-4.

In contrast to the PVPs, SMPs apply commodity microprocessors with on-chip and off-chip caches, and access a shared memory through a high-speed bus architecture. These systems are symmetric, because every processor has equal access to the shared memory, the I/O devices, and the operating system services. A problem of SMP architectures is scalability, which is limited mainly due to the use of a centralized shared memory and a bus or crossbar system interconnect, which are both difficult to scale once built. Examples for SMPs are the IBM R50, the SGI Power Challenge, and the DEC Alpha server 8400.

The last three groups of parallel architectures, MPPs, DSMs, and COWs, are distributed memory machines, which allow to overcome the limitations of scalability and to take advantage of higher parallelism available in some applications of scientific computing, engineering simulation, signal processing, and data warehousing.

The group of MPPs generally refers to very large-scale computer systems. They apply commodity microprocessors with distributed memory as processing nodes, which are interconnected with a high communication bandwidth and low latency network, that can be scaled up to hundreds or even thousands of processors. Programs on these machines consist of multiple processes, each with its private address space, that communicate via message-passing routines. Synchronization takes place through blocking message-passing operations instead of shared-variable synchronization operations.

The DSM machines provide special hardware and/or software extensions, that provide the application user with a single address space although the memory is physically distributed among different nodes. Examples for DSM machines are the Stanford DASH architecture and the Cray T3D.

The last group of distributed memory machines is based on the cluster concept as implemented in Digital's TruCluster, IBM's SP2, and the Berkeley NOW. In a certain sense, Clusters are the low-cost variation of MPPs, which makes them an interesting choice for many research institutions (See [Buyy 99a] and See [Buyy 99b]). One important characteristic of COWs is that each node is a complete workstation (probably minus peripherals like monitor, keyboard, etc.) which is connected to the other nodes through a low-cost commodity network (e.g. Ethernet, FDDI, ATM, or Myrinet See [Pryl 99]). In contrast to MPPs, the network interface is only loosely coupled to the I/O bus of a node, and each node in a COW usually contains a local disk as well as a complete operating system, while MPPs may not provide a local disk at each node and only a microkernel.

However, it has to be noted, that the boundaries between these classes of supercomputer architectures are fuzzy, since it is often difficult to assign a selected hardware architecture to only one group. For example, the node of a COW may be an SMP itself, while DSM machines can also be implemented with software techniques on networks of workstations. Other examples are highly scalable clusters, which are often considered MPPs. Therefore, SMPs, MPPs, DSMs, and COWs are four overlapped architectural concepts as shown in Figure See Overlapped design space of clusters, MPPs, SMPs, and distributed computer systems [HwXu 98] See [HwXu 98]. For a more detailed discussion about the differences and similarities between these architectures, please take a look at See [HwXu 98] pages 31 to 36.

Another remark concerns the possibility extend the clustering idea into large-scale systems by connecting distributed supercomputer systems into so-called "Metacomputers" See [Brun 99], "Metasystems" See [Grim 98], "Computational Grids" See [FoKe 99], or the "Information Power Grid" See [LeKu 99]. A well-known example is the multi-institutional Globus project See [FoKe 98], which tries to enable computational grids that provide pervasive, dependable, and consistent access to high-performance computational resources for distributed users and resources. Such systems are composed from loosely coupled hardware architectures in order to perform computation on one particular problem See [FoKe 99]. Therefore, they are often addressed as the most-powerful computing platforms around.

Massively Parallel Processing

Analysis of the Top 500 list according to the classification of See [HwXu 98] shows, that the most important supercomputers are certainly MPPs, followed by SMPs. Yet, this was not always the case, as is indicated by Table See Top 500 supercomputers: share of architectures over time [Dong 99]. In fact, in June 1993 vector processor machines made up over two thirds of the most powerful machines worldwide. During the last 6 years the situation changed and PVPs vanished completely from the Top 500 list, which is now determined by MPPs (and clusters of SMPs arranged as MPPs). This is also reflected by the top 10 entries, which are all considered massively parallel architectures.

Top 500 supercomputers: share of architectures over time See [Dong 99]

Date

MPPs

SMPs

PVPs

Others

06/93

156

31,2%

 

 

344

68,8%

 

 

06/96

265

53,0%

101

20,2%

134

26,8%

 

 

06/99

358

71,6%

118

23,6%

 

 

24

4,8%

An example is the current leader of the supercomputing list, the Intel ASCI Red machine installed at the USA's Sandia National Labs. This machine is a successor to the Intel Paragon MPP system and follows therefore the traditional distributed-memory MPP approach, with small nodes, a tightly coupled network interconnection, and a microkernel operating system (LWK - Light-weighted kernel) on the compute nodes. The compute nodes consist of two Pentium class processors with up to 256MB of memory each.

Other example MPP architectures in the top 10 are the SGI Cray T3E series, which shipped in 1995 as a successor to the Cray T3D. This machine is considered a distributed shared-memory multiprocessor, where the processing elements (6 to 2048 DEC Alpha 21164 microprocessors) are interconnected by a 3D bidirectional torus network for fast communication. The processing elements apply the UNICOS/mk operating system and can address between 64 MB and 2 GB memory space. The main difference between the T3E-900 and the T3E-1200 is a faster processor clock (450 Mhz vs. 600 MHz Alpha processors), thus offering higher processing speeds for the machines (900 Mflop/s vs. 1200 Mflop/s) See [HwXu 98].

Another approach to constructing MPP-class architectures is offered by the SGI Cray Origin 2000. Again, the memory is physically distributed among the processing nodes, but it is globally accessible with hardware support for cache coherency (while the T3E supports distributed shared memory without hardware support for cache coherency). Another difference are the processing nodes on the Origin 2000, which consist of 2 MIPS processors combined on one board. In addition, the interconnection network resembles a hypercube architecture. SGI calls the Origin 2000 approach scalable shared-memory multiprocessing (S 2 MP) See [LaLe 97].

However, the differences between the machines are blurry. For example, the Hitachi SR8000 Super Technical Server See [Hita 99] is intended as a replacement system for former vector supercomputers. Yet, the traditional vector processing units have been replaced with pseudo vector processing facilities, that allow to fetch data from memory in a pipelined manner without stalling the succeeding instructions in order to fed the data effectively into the arithmetic units. Each of the 128 SR8000's nodes is formed from multiple RISC microprocessors, which operate on a single address space.

Another example is IBM's SP Silver at position 8 of the Top 500 list. This machine is at present the most powerful SP (Scalable Powerparallel) architecture, which became famous by defeating World Champion Garry Kasparov in a chess match See [HaGa 97]. While it is often considered an MPP, it is built upon a cluster architecture with a proprietary high performance switch as the communication network See [HwXu 98].

This small discussions about only the top 10 supercomputers already revealed some of the problems when analyzing parallel machines. A clear classification is difficult if not impossible to achieve. In the next section we will try to identify the paradigms needed to program these architectures, and we will eventually see, that similar problems occur on the software side.

Parallel Programming

Basic Approaches

Building powerful machines as mentioned above is only one part of the story. Obviously, such supercomputers must be programmed and dedicated software must be developed to obtain the desired performance. Software development for any system is difficult and time consuming, but it is even more difficult when dealing with the details of parallelism See [ElRe 93]. Quinn suggests the following three ways to design a parallel algorithm to solve a particular problem See [Quin 87]:

Detect and exploit any inherent parallelism in an existing sequential algorithm
  • Invent a new parallel algorithm
  • Adapt another parallel algorithm that solves a similar problem

While sequential programming is already widely accepted and many people have been practicing it for a long time, parallelism is sometimes viewed as a rare and exotic subarea of computing, interesting and intellectually challenging but of little relevance to the average programmer See [Fost 94]. Furthermore, it is generally agreeable that parallel programming is in a sorry state, and we can identify the following reasons for its inferiority See [HwXu 98]:

Parallel programming involves all issues of sequential programming plus many more issues for managing the parallel complexity.
  • Sequential programmers rely only on the von Neumann model, where a processor executes one sequence of instructions, while there exist many different parallel programming models See [Fost 94].
  • Software environments (compilers, debuggers, profilers) are much more advanced for sequential program development.
  • Sequential programming is more mature, with a huge base of accumulated knowledge, including mistakes discovered and lessons learned in the past.

Foster reports about studies of trends in applications, computer architecture, and networking which indicate that this view is changing. He believes that "parallelism is becoming ubiquitous, and parallel programming is becoming central to the programming enterprise" See [Fost 94]. This view is supported by Hwang and Xu, who describe recent advances in this area See [HwXu 98]:

Many parallel algorithms have already been developed or are currently being investigated.
  • A small set of simple parallel algorithmic paradigms has emerged and is becoming accepted (e.g. phase parallelism, divide and conquer, pipelining, etc.).
  • Native models are converging either towards the single-address space, shared-variable model for PVPs, SMPs, and DSMs, or towards the multiple-address space, message-passing model for MPPs and COWs.
  • High-level models are converging toward three standard models, data parallel, message-passing, and shared-variable.

Before discussing these models, we will briefly identify the fundamental requirements for parallel software. Regardless of the architecture of the target parallel computer the challenge of parallel programming is to harmoniously coordinate several program segments while assuring correctness as well as high performance See [Lewi 93]. Due to the fact that good performance can only be achieved if an algorithm fits the architecture See [Quin 87], extensive interaction between algorithm designers and computer system architects is required See [Sieg 92]. Thus, it is necessary to investigate parallel programming paradigms in connection with the intended hardware architectures.

Possibilities for Parallelization and Parallel Programming Models

In principle, every parallel program is simply a collection of concurrently executing processes, which may be connected to one another through either message-passing or access to shared data. If these processes operate independently of each other, the program is called "trivially parallel" See [Lewi 93]. This can be desired if a high-performance parallel computer is viewed as a huge workstation with a single system image. Then, users may want to take advantage of the enhanced processing and storage capabilities to run multiple sequential jobs. This mode of operation is called "throughput processing" See [HwXu 98] or the serial program parallel system (SPPS) model See [Pfis 97].

Although the "trivially parallel"-model is certainly important for many different applications, it is less interesting in the scope of this work because the actual program is still sequential. More challenging are programs with processes that communicate and coordinate among themselves. For such programs, we distinguish the following characteristic items:

Programming model
  • Data-parallel programming
  • Task-parallel programming
  • Parallelization approach
  • Implicit parallelism
  • Explicit parallelism

The parallel programming model identifies, how a program synchronizes its parallel parts. In the data-parallel or Single-Program-Multiple-Data (SPMD) model, the execution of the code is driven by the program's data, which means that the same code is executed on different data domains See [Hwan 93]. Therefore, the application's data are partitioned into separate sets, and identical operations are performed on all sets at the same time whenever possible. Obviously, the data-parallel approach is only meaningful, if large arrays of data are processed, e.g. with vectors and matrices See [Lewi 93]. This limitation does not exist in the task-parallel (also called control-parallel) or Multiple-Program-Multiple-Data (MPMD) model, which supports more than one thread of control. A single program can perform different operations in the same time interval, and different programs may operate on different data within one application. Additionally, the order in which concurrent parts are executed is governed by program control rather than by the availability of data.

Both types of programs, SPMD and MPMD, can be executed on MIMD architectures See [Flyn 72], because different instructions can be executed by different processes at the same time. However, SPMD programs may also be more restrictive, if they permit not only multiple processes executing the same code, but also require them to execute the same instruction at the same time. Therefore, such restrictive SPMD programs may also be executed on SIMD architectures See [HwXu 98].

In addition to the parallelism model, the parallelization approach describes different methods to develop corresponding programs for parallel machines. We distinguish two different ways of expressing the programs parallelism in the source code, implicit and explicit parallelism See [Bemm 92]. Implicit parallelism means that the program's parallelism is hidden from the programmer, who applies accustomed abstractions as available in sequential programming languages. The explicit parallelization approach is characterized by the active role of the programmer in exploiting the parallelism; the applied programming models offer explicit constructs for describing the concurrent execution as well as communication and synchronization between the processes.

Please note, that it is often difficult to draw a clear line between the implicit and the explicit parallelization approach. In fact, for some people implicit parallelism would mean to write purely sequential code and leave all the parallelization work to the compiler and the optimizer (see See [Bode 95]). For them, even adding declarative statements to the code would mean to introduce some explicit means of parallelization. This is for example the case in HPF-dialects (see below), which offer declarations to improve the compiler's results during data partitioning. Yet, these declarations are added in comments, so that every HPF program is still applicable to sequential machines. For that reason, the usual interpretation of explicit parallelism is, when the control flow of the program is actually determined by dedicated statements (e.g. for sending and receiving messages).

Another more practically oriented approach to characterize parallel programming is provided by Foster See [Fost 94] as well as by Hwang and Xu See [HwXu 98]. While Foster concentrates on the explicit parallelization approaches, Hwang and Xu also include the implicit approach. In concrete, they identify the following four models that are used on todays existing parallel computers:

Implicit
  • Data-parallel
  • Message-passing
  • Shared variable

Although this list is somehow contrary to the classification described above and mixes some areas, it is well-established in the parallel computing community due to its relevance in practice. We will also stick to these items and describe them in more detail below.

Approaches to Implicit Parallel Programming

In the implicit model, parallelism is automatically exploited by the compiler and the run-time support system See [HwXu 98]. The programmer does not have to explicitly specify parallelism using special language constructs, compiler directives, or library function calls. The underlying system is concealed from the user, which shields programmers from the increased complexity of parallelism and shifts the burden to the compiler writer See [ElRe 93]. The most popular approach of implicit parallelism is automatic parallelization of sequential programs. Based on dependence analysis a parallelizing compiler or restructuring compiler performs a suite of program transformation techniques on a sequential code in order to derive the parallel program.

The problem of parallelizing compilers is that their effectiveness is often questioned. Several performance studies indicate, that the generated code derived from real sequential programs is often not very effective See [BlEi 92]See [HwXu 98]. Thus, a common approach is to include some user knowledge into the parallelizing process by providing the compiler with more information. This can be done with compiler directives or based on interaction between the compiler and the programmer. Another idea is run-time parallelization, where some of the parallelism, that cannot be recognized during compile time, is revealed at runtime.

The implicit parallelism approach is an interesting research issue due to its many advantages. The huge repository of existing sequential software can be reused for parallel computers. Programmers familiar with sequential languages do not need to know about parallel programming or parallel architectures to exploit their parallelism. In addition, it is easier to understand the semantics of implicit programs than of explicit ones. The resulting programs are usually portable among different parallel computers.

However, there are also many people who believe, that explicit parallelism is needed and comparable results can never be reached by parallelizing compilers. Their belief stems partially from the achieved performance of current automatic compilers See [HwXu 98], and partially by the theory of Bernstein, who showed that the general problem, whether two segments of a sequential program can be executed concurrently, is undecidable See [ZiCh 90]:

Bernstein's theorem See [Bern 66]

In the general case it is undecidable whether two arbitrary operations in an imperative sequential program can be executed in parallel.

This theorem implies, that there is no automatic technique, neither compile time nor run time, that can fully exploit all potential parallelism in a sequential program. As a consequence to overcome this theoretical limitations, two solutions are suggested: Either apply explicit parallelism as discussed below or use a programming language which makes parallelism recognition easier and abolish the imperative style altogether. The latter is an important research issue, which includes besides others the following computational styles See [HwXu 98]:

Functional programming
  • Logic programming
  • Computing-by-learning
  • Object-oriented programming

There are some advantages of these approaches, for example a purely functional program is coherent, meaning it is guaranteed to be determinate, deadlock-free, and compositional, and it is abstract, meaning there is no need to explicitly specify parallelism, communication, synchronization, scheduling, etc.

However, there are also advantages in the imperative style. Firstly, it is mature, time-proven and familiar to most programmers. Secondly, it is derived from the von Neumann model with commodity hardware and can therefore be implemented efficiently. Thirdly, the imperative style dominates programming of today's sequential and parallel computers. As a consequence, we will now focus on the explicit parallel programming models with imperative computational style.

Explicit Parallel Programming in Practice

In terms of explicit parallel programming models, many different approaches and languages have been proposed in the past, with different intentions and advantages. From a practical point of view especially three groups of languages became dominant over all others. These groups and some of their representatives are See [HwXu 98]:

HPF-like: Fortran 9x, HPF, CM C*
  • Message-passing: PVM, MPI, SP2 MPL, nCUBE 2
  • Shared-variable: X3H5, OpenMP, Cray Craft, SGI Power C

The HPF-like group (often called the data-parallel languages) assumes a single address space, removing the need for data allocation. However, many of these languages allow to apply data allocation with explicit directives to improve the memory accesses. In addition, HPF-like programs are single-threaded, which ensures that they are determinate and deadlock-free, and loosely synchronous, because there is no need for explicit synchronization.

For the HPF-like group, there are two influential standard languages: Fortran 9x and High-Performance Fortran (HPF). Programs written in these languages are portable on a wide range of machines (MPPs, DSMs, SMPs, COWs, and PVPs). Furthermore, there exist machine-specific languages, such as C* for the CM-5 (Thinking Machines Connection Machine), which are not standardized and thus not portable.

In message-passing programs, users must explicitly allocate data and workload to processes. Such programs are multithreading because multiple threads of control exist that may possibly execute different code. Therefore, both control parallelism (MPMD) and data parallelism (SPMD) are supported. In order to reduce designing and coding complexity, many parallel message-passing programs are implemented as SPMD programs, where only a single code is executed on all processes. (Consider an MPP with over 1000 processes, where each of them would execute a different program!) Furthermore, every MPMD program can be formulated as an SPMD program.

These programs execute asynchronously, which requires special operations to synchronize processes for correct execution order (e.g. barrier, blocking communication). Due to separate address spaces of the processes, the data variables in one process are not visible to other processes. Explicit interaction operations, including data mapping, communication, synchronization, and aggregation must be resolved by the user. This introduces the "owner-compute" rule See [HwXu 98], where the process that owns a piece of data performs the computations associated with it.

An advantage of the message-passing over the HPF-like languages is its flexibility. Yet, it still lacks support to manage a global data structure efficiently. In addition, message-passing programs usually exploit large-grain parallelism, which can be attributed to high communication latencies occurring during message transfer. For programming, there exist two widely used libraries, PVM and MPI, which will be discussed in more detail in Section See Message-Passing Programs. Both of them are implemented in virtually all types of parallel computers, thus ensuring a high level of portability for message-passing programs.

The last of the three explicit parallel models is the shared-variable model, often called shared-memory model. It is comparable to the HPF-like approach because it offers a single address space with global naming where all shared data reside. Consequently, there is no need for data allocation. Furthermore, it is similar to the message-passing model due to its multi-threading and asynchronous behavior, which again requires explicit synchronization methods for maintaining correct execution order. Yet, communication can take place implicitly through variable writes and reads in the shared memory.

A widespread misconception is that the shared-variable model is generally and in any case better suited for fine-grain parallelism than the message passing model See [HwXu 98]. This is not true because as a model it can be implemented on any parallel architecture, be it PVP, SMP, DSM, MPP, or COW. On contrary the only requirement for fine-grain parallelism is the existence of efficient communication and synchronization methods on the underlying platform. This includes, that a shared-variable program may incur higher interaction overhead and run more slowly than a message-passing program on a COW, an MPP, or even on an SMP.

The advantage of the shared-memory model is that it is assumed to be easier to program than message-passing programs See [Karp 87]. Some people consider message-passing as the low-level approach to parallel programming, while shared-memory is on a higher-level. While this popular statement is not wrong, it is also not an established fact backed by scientific research. Fact is that there are many more existing applications for SMPs and PVPs, than there are for MPPs and COWs See [HwXu 98]. Developing an efficient parallel program which is loosely synchronous and operates in a regular communication pattern does not necessarily favor the shared-variable approach, but may also be implemented with message-passing. The shared-variable model has an edge for irregular parallel applications, and wherever global pointer operations are required. Shared-variable programs do not require explicit partitioning and data allocation, although their absence may affect performance badly.

One big disadvantage of shared-memory programs is, that subtle synchronization mistakes may probably occur easily and more frequently than in message-passing programs (see Section See Errors from Process Communication and Synchronization) See [Karp 87]. In addition, there is no widely accepted standard yet, that allows to port programs easily between architectures See [HwXu 98]. Several solutions to this problem are in the line, most notable the three shared-memory programming standards ANSI X3H5, OpenMP, and POSIX Threads.

The ANSI X3H5 standard has not gained wide acceptance, but has substantially influenced the design of several commercial shared-memory languages. In fact, no commercial shared-memory systems adhere to X3H5 completely, although some parts are supported See [HwXu 98]. The X3H5 definition consists of a conceptual standard programming model and bindings for C, Fortran 77, and Fortran 90. Its main ideas are parallelism constructs, parallel blocks, parallel loops, implicit barriers, thread interaction, and synchronization.

Furthermore, it was the basis for the OpenMP standard See [Thro 99], which is now being supported by a group of hardware and software vendors, such as DEC, Intel, IBM, Kuck&Associates, SGI, Portland Group, Numerical Algorithms Group, US DOE ASCI program, etc. See [HwXu 98]. It is a set compiler directives, library routines, and environment variables that collectively provide a shared-memory API (Application Programming Interface) for different platforms and operating systems. The key features of OpenMP are compiler directives, called orphan directives, to enable parallel execution of major portions of the code with small code modifications. Besides that, OpenMP provides a set of run-time library routines with associated environment variables, which are used to control and query the parallel execution environment, to provide general-purpose lock functions, set execution mode, etc.

The POSIX Threads, also known as Pthreads, stands for the official IEEE POSIX 1003.1c-1995 thread standard, which was established by the IEEE standards committee. It offers function for thread management (create, exit, join,...), thread synchronization (lock, unlock, wait, post,...), and others. It is widely available as an influential standard for multithreading at the Unix operating system layer. An example are Solaris Threads, which offer many common features of Pthreads and are a mature commercial product.

Message-Passing Programs

After providing an overview of current parallel programming approaches, we will now briefly describe the message-passing standards MPI and PVM. We choose message passing because it is probably the most widely used parallel programming model today See [Fost 94]. In addition, the work described in the later sections builds upon message-passing systems.

The group of existing message-passing systems is very large, for example nCUBE's Vertex, Thinking Machines CMMD, Chameleon, Intel's Nx, P4, PICL, PARMACS, and Zipcode. Most of them have been implemented explicitly for one dedicated hardware architecture or series. Thus, programs are not very portable. This was noted before and inspired Geist, Heath, Peyton, and Worley at Oak Ridge National Laboratory to implement PICL, the Portable Instrumented Communication Library See [Geis 90]See [Worl 92]. A similar idea was followed by Butler and Lusk at Argonne National Laboratory with the p4-library (Portable Programs for Parallel Processors) See [BuLu 94]. Both libraries offer function calls for programming parallel machines, with a focus on portability between different hardware architectures. The difference between PICL and p4 is, that the former defines a generic message-passing interface, while the latter implements various features for shared-memory and message-passing programming.

While the message-passing systems described above where intended mainly for real parallel architectures, the Parallel Virtual Machine PVM has been started in 1989 as a research project on heterogeneous distributed computing at the Oak Ridge National Laboratory See [Geis 96]. Its initiators Geist and Sunderam See [GeSu 92] have designed the notion of a "virtual machine", which is a set of heterogeneous hosts connected by a network that appears to the user logically as a single large parallel computer. The exchange of data between the parallel tasks was accomplished using simple message-passing constructs. Two goals defined the original path of PVM See [Sund 90]: Firstly, it should provide a simple interface that is easy to use and understand. Secondly, performance was not considered as important as portability, because communication across the Internet was slow anyway and some of the research goals were scalability, fault tolerance, and heterogeneity. The use of PVM grew rapidly worldwide up to its current status as a pseudo-standard.

In contrast to PVM, which has started and continues to evolve as a research project, the Message Passing Interface library MPI has been specified by a committee - the MPI Forum - of about 40 high performance computing experts from research, academics, and industry around 1993 See [MPI 95]. Its intention was to define a standard message-passing API, that allows to write portable parallel applications See [Snir 98]. The vendor participants of the MPI Forum intended to implement it for their particular parallel computer architecture. Since hardware vendors always try to provide the highest possible performance, this was also one of the main goals of the MPI design, and it is expected that MPI is faster than PVM on dedicated MPP architectures See [Geis 96]. Furthermore, MPI provides the ability to specify a logical communication topology and a richer set of point-to-point and collective communication operations, which may be important, if an algorithm is executed on a machine with a special communication feature. Yet, even though MPI's functionality is very rich (Version 1.1 contains around 120 routines), it has been noted before, that successful programs could be written with only six functions See [Fost 94]See [Grop 94].

There exist several free implementations of the MPI standard, including MPICH from the Argonne National Laboratories and the Mississippi State University, LAM from the Ohio Supercomputing Center, CHIMP from the Edinburgh Parallel Computing Center, and UNIFY, a subset of MPI within a PVM environment, from Mississippi State University. In addition, there are specialized versions for high-performance networks like Myrinet See [Pryl 99] and vendor implementations for almost every multiprocessor architecture currently available. Similar to PVM all the function calls are available for both C and Fortran See [WiAl 99].

Each of these message-passing libraries has its unique strength and programmers often wonder, if they should stick to the existing de facto standard, PVM, or whether they should shift their codes to the MPI standard. Yet, this question cannot be answered without further examination of the application, its functional requirements, and the running environment.

Summary: Programming the World's Most Powerful Machines

After the brief discussion of existing high-performance machines and some explanations about how to program the wide variety of parallel architectures, we will now summarize the current situation. In Section See Massively Parallel Processing we have shown, that the most powerful computers today are Massively Parallel Processors with up to several thousands of processors. Their share in the Top 500 list is currently more than twice that of the second contenders, the Symmetric Multiprocessors. Yet, outlooks about future supercomputer architectures predict an increased amount in clustered SMPs (See [Simo 99] and See [Geof 99]), which means that they must be considered as potential targets.

Considering the architecture, MPPs are almost always distribute memory machines, which suggest that their main model of programming is the message-passing paradigm. On contrary, SMPs support both, the message-passing and the shared-variable approach, and any message-passing algorithm can be executed on a shared-memory architecture with no additional running time See [AlGo 94]. Yet, if we take the number of applications into account, message-passing will still cover the majority of parallel machines today.

The remaining question is, if MPI or PVM is more suitable for the most powerful machines today. Clearly, both message-passing libraries may be suitable for users in this area. However, in most cases MPI will be given the preference, simply because of its efficient implementation and its inherent goal for high performance. We should not forget, that the target problems discussed in Section See Requirements on Computer Systems are Grand Challenges, which require the highest possible performance from a given computing system.

As a summary, we can identify the key attributes of the most-powerful systems suggested by the survey of the current Top 500 list as follows:

Architecture: Massively Parallel Processor (MPP)
  • Programming Paradigm: Message Passing
  • Message-Passing Library: MPI standard (Message-Passing Interface)

Such machines are the basic background of the research described in the following chapters. However, the actual situation also points to a change for either SMP or COW systems. Therefore, we tried to operate with an open mind onto these other representative supercomputer architectures and all our suggestions may also be used for other machines with minor modifications, e.g. for SMPs with shared-variable programming under OpenMP.

Problem Domain: Debugging Program Errors

In this chapter, we will review the software engineering process with a special focus on the areas error detection and performance analysis. This will lead to a discussion about the debugging task itself, with its goals and main problems. Furthermore, we will study this history of bugs and the origin of bug-hunting, and describe some of the most notorious errors in computing history. The results of this digression will allow us to summarize the importance of the debugging process for the quality of the resulting program and its problems for parallel computing.

Software Engineering/Debugging/Performance Analysis
  • Debugging: Goals/Problems
  • Bug Hunting History
  • Overview of most notorious Bugs
  • Importance of Debugging

Problem Domain: Debugging Program Errors

Software Life-Cycle Models

The professional development process of software systems is called software engineering and covers the application and usage of methods and tools by teams of software developers in order to create software systems of different size See [FlZü 97]. Main aspects are the management and coordination of such projects, its components and their relations, as well as sufficient understanding of the software development process. This knowledge must consist of life-cycle models, working conventions, and quality assurance capabilities. As a consequence, software is often considered to include not only computer programs, but also the associated documentation required to develop, operate, and maintain the programs See [Boeh 76].

Software life-cycle models define the order between different activities and their relations in the software development process. Three life-cycle models can be distinguished See [FlZü 97]: the linear phases model or waterfall model, the spiral model, and the cyclic models. The most basic approach is represented by the phases model See [Boeh 76]See [Enge 88], which often contains the following activities:

Problem analysis and definition
  • Design and specification
  • Implementation and documentation
  • Testing and tuning
  • Installation and maintenance

Although the phases model exists in different variations, it always consists of similar tasks that are carried out in chronological sequence. Intermediate result (milestones) between the tasks determine the status of the project and may require and thus initiate reiteration of earlier phases. Therefore, the intermediate results at the end of each phase must be thoroughly understood, and preferably documented, before continuing with subsequent phases. If this isn't done, critical and costly errors may be detected too late, usually during testing, but frequently even not before operation. Correcting these errors later involves much greater expenses and may delay project schedules due to the required reworking (and retesting) of design, code, documentation, training material, etc. See [Boeh 83].

While the phases model is rather simple, its main problem is flexibility which is important since initial specifications and goals may be altered. Therefore, its importance is often questioned and other alternatives are proposed for the handling of projects See [FlZü 97]. However, it is sufficient for our purposes.

Testing and Tuning

Characteristics

In the scope of this work, we are interested mainly in the "testing and tuning" phase, which consists of two main subtasks, the testing of functionality and the evaluation of achieved performance See [Enge 88]. The former is used to examine the input/output behavior of a program corresponding to its specification. The latter is applied, whenever a functionally correct program has been achieved, and its runtime performance and efficiency are investigated. In the following, we will focus mainly on the testing phase. We will also show, that testing and tuning are closely related and that their activities are often connected.

The testing phase is usually initiated, whenever a module or a complete program has been implemented and its correct behavior must be reviewed. For that reason, testing is closely related to the two activities verification and validation. While the former is used to proof the target's correctness according to its behavioral specifications, the latter is applied to check to target's correct execution for a limited sets of inputs.

Verification means that the characteristics of modules or programs are thoroughly checked according to their behavioral specifications. A program is defined to be correct, whenever it complies fully to the specifications, which means that for all given inputs it reacts with correct outputs. We distinguish partial and total correctness of a program, where the former means that it terminates for a given input and delivers the correct results. A program is totally correct, if it terminates and it is partially correct for all inputs See [Enge 88].

In contrast to verification, the process of validation is applied to check, if the system or one particular module corresponds to the given requirements with correct results according. To achieve this, the program is either executed or simulated. Two basic approaches can be distinguished: the validation of the complete system (according to the user's requirements) and the validation of the functional requirements. Testing may also be described as the iterative application of validation.

The testing phase can be divided into the module test, the integration test, the installation test, and the acceptance test. For each of them, we distinguish two different approaches, black-box testing and white-box testing See [Paul 99]. During black-box testing, the program is only verified according to its specification. Often, such tests are applied, when the input space and output space of a program are finite. In that case, a set of black-box tests can be constructed, that can show conclusively that the program is correct and its behavior matches its specification See [Boeh 76]. White-box testing also considers the internal structure of a program, in order to cover certain instructions or execution paths. White-box tests are especially important for the respective developer during component testing See [FlZü 97].

Problems of Testing

A principle problem of testing is that in the general case and for a program of arbitrary size and complexity, it can always only demonstrate the existence of errors, but it can never prove the absence of errors See [Paul 99]. Furthermore, exhaustive testing is usually impossible due to the amount of possible inputs a program states in most software systems See [FlZü 97]. Yet, since every test suite feeds a program under consideration only with selected inputs out of the often enormous space of all possibilities, certain bugs in the code may never be evidenced See [WaBl 97]. This means, that the total elimination of bugs is an unrealistic goal See [Lieb 97].

Usually, programs are not correct at the first place. In contrast, it is even highly probable, that any developed program contains some errors See [Enge 88]. The reason for this is partially the complexity of the program development process itself, partially the difficulty to define solutions and realization as software, but also the high precision that is required from software engineering in general.

Another main problem of testing is, that it is surprisingly often not considered until the code has been run the first time and is found not to work See [Boeh 76]. Statistics show that many errors have already been made before coding begins, and the later the errors are detected, the higher are the associated costs to rework the system in order to function correctly. Boehm states that the most important message about developing software is to get the errors out early See [Boeh 83]. In addition, the testing phase is still a highly tedious task, that is mostly carried out manually, which is therefore error-prone in itself. A related question is the amount of required testing, which is more often defined by the amount of available budget or the given schedule.

And finally, it is very difficult to determine the sufficient amount of testing. It is clear, that exhaustive testing is impossible in both principle and testing. In principle, there are two possibilities to decide about test completion: statistical, when error statistics attest a sufficient degree of reliability, and systematic, which tries to determine methods for identifying the functional properties and classes of possible errors in programs See [KrWi 98]. A very common approach is to fix every detected bug and perform testing until the project management orders a product release. Bach tries to articulate and perhaps quantify the amount of testing See [Bach 98], that is sufficient for different projects depending on the circumstances and may be useful for everyday projects as well as mission- or life-critical projects.

Good Enough testing See [Bach 98]

Good Enough testing is the process of developing a sufficient assessment of quality, at a reasonable cost, to enable wise and timely decisions to be made concerning the product.

This definition identifies several characteristics of the professional software engineering task related to the quality of a product and its evaluation. The following questions are of major importance:

How accurate and complete is the product's quality assessment?
  • What level of testing is possible within the project costs constraints?
  • How well does the assessment serve the project and the business in terms of potential product risks?
  • How to time all these activities?
  • When has enough testing been performed?

Similarly, Dubois argues that the verification of a program is often based on a series of problems with known results See [Dubo 99]. Since it must be prepared to solve a general problem, its results can only be taken as an approximate answer and it requires good judgement to decide about the program's correctness and reliability. These differences between the actual and the desired answers arise from numerical noise, chosen approximations, inaccurate models of physical properties, or actual coding errors See [Dubo 99]. Therefore, it is often difficult to judge whether an answer is a bug or a modeling problem. Many bugs are indistinguishable from errors in modeling or deficiencies in the applied numerical techniques. At worst, a bug may cause a model's revision or a complete rejection of an approach. At best, a bug can only be detected through considerably high efforts of testing5.

Software Quality

As defined above, the goal of testing is correctness or reliability, respectively. Software reliability is defined as the probability of future satisfactory operation of software See [Boeh 83]. Software development using formal methods faces two problems in achieving reliability See [Boyl 99]:

Obtaining a formal specification of an informally described problem.
  • Based on this formal specification, selection of a mathematical verification method to prove the software's correctness with effective, affordable methods

The quality of software is defined by its characteristics and features offered during application and development to achieve the demanded requirements See [FlZü 97]. Most frequently, these quality criteria are correctness, reliability and portability See [Enge 88]. The correctness of a program can be defined as follows:

Program correctness See [Schn 91]

A program is correct, if it exactly meets its given specification.

Consequently, the correctness of a program is always expressed relative to its specification. A major problem in determining the correctness of a program is the difficulty of finding a complete and adequate specification, which is often described in abstract mathematical or physical statements See [Dubo 99].

Any method applied to achieve a high level of quality is summarized under quality assurance techniques, which can be divided into constructive quality assurance and analytical quality assurance. Constructive quality assurance is based on the application of methods, tools, construction principles, formal procedures, and life-cycle models, and covers the area of application quality, product quality and process quality See [FlZü 97].

Analytical quality assurance techniques verify, if predefined quality characteristics are fulfilled by an existing software component. This contains static and dynamic analysis. During static analysis, the object under consideration is not executed. Examples are the application of metrics, various informal methods like review and audit, but also the usage of analysis tools and formal verification methods. Dynamic analysis is performed by testing. In that case, the program is executed in order to detect errors See [FlZü 97].

The best software is that which has been tested by thousands of users under thousands of different conditions over years. Then it is known as "stable". Yet, this does not mean, that the software is now flawless, free of bugs. It generally means that there are plenty of bugs in it, but the bugs are well identified and fairly well understood See [Ster 92]. There is simply no way to assure that software is free of flaws. Though software is mathematical in nature, it cannot be "proven" like a mathematical theorem; software is more like language, with inherent ambiguities, with different definitions, different assumptions, different levels of meaning that can conflict.

Bach defines requirements as the set of ideas that collectively characterize the quality for a particular product, and testing as a process of developing an assessment of product quality See [Bach 99]. There are at least four alleged truisms about testing a product against requirements, and each of these principles reflects some truth about the dynamics of testing:

Without stated requirements, no testing is possible.
  • A software product must satisfy its stated requirements.
  • All test cases should be traceable to one or more stated requirements and vice versa.
  • Requirements must be stated in testable terms.
The Testing Process

Boehm defines the main types of automated aids for the steps of testing between the point the code has been written and the point it is pronounced acceptable for being used, as follows See [Boeh 76]:

Static code analysis
  • Test case preparation
  • Test monitoring and output checking
  • Fault isolation, debugging
  • Retesting (once a presumed fix has been made)
  • Integration of routines into (larger) systems
  • Stopping

Static code analysis starts with the usual compiler diagnostics and detailed data-type checking. In addition control flow and reachability analysis are applied with structural analysis programs and others. As an extension to structural analysis tools, test cases preparation aids assist the engineers in choosing (input) data values to make the program execute along a desired path. Yet, these tools are only sufficient to generate a limited amount of inputs, for which the outputs must still be calculated manually or with other means of computer-support.

For this work, the more important parts are test monitoring and output checking, as well as fault isolation and debugging. In order to perform a maximum of testing automatically, techniques have been developed for various kinds of dynamic data-type checking and assertion checking, and for timing and performance analysis. In addition, test post-processing tools aid the users with output comparators and exception report capabilities, as well as test-oriented data reduction and report generation packages. For the area of fault isolation and debugging, we may distinguish well-established methods like core dumps, traces, snapshots, and breakpointing, from highly interactive capabilities like replay or backtracking.

After fixing the code, retesting is used to compare previous test cycles with the results of the corrected code. Again comparators to check the differences in the code, inputs, and outputs between the original and the modified program are applied. These test management tools are further extended for integrating the routines into bigger systems, where additional activities like interface consistency checking have to be performed. Finally, the halting problem of testing as defined above with Good Enough testing may also be supported by tools, which means that tools offer a means of quantifying the amount of testing that has been performed.

Debugging Basics

Probably the most important part of testing is debugging or error location and correction. It is an essential part of the software life-cycle, because a program is obviously only useful if it executes correctly and is sufficiently reliable See [WiAl 99]. In standard software engineering terms, this activity can be defined as follows:

Debugging See [McHe 89]

Debugging is the process of locating, analyzing, and correcting suspected errors.

Today, a lot of a programmer's development time is spent on debugging See [Agra 91]. Therefore, it is often ranked as one of the most significant bottlenecks while at the same time the weakest link in the software life-cycle See [Stit 92]. The debugging phase is applied after runtime failures or incorrect results have been observed during testing. In that case, the errors and their origin must be detected and corrected See [Enge 88]. Whenever bugs arise, programmers should be able to locate and correct them with a minimum amount of effort, leaving more time and energy available for progressive work See [Stit 92].

While the importance of testing and debugging is highly accepted in the software engineering domain, there is still a surprising lack of books that thoroughly discuss software debugging at a practical level See [Stit 92]. Of course, there are books describing the features of one or more debugging tools from a user's view, but these do not provide a general look onto the problem of solving program bugs. From the scientific point of view, debugging is still studied very little, compared, for example, to compilers See [Rose 96]. In addition, debugging is still absent from current-day computer science curriculums, which leaves students and programmers without much guidance.

In See [Lieb 97], Lieberman defines debugging as "... the dirty little secret of computer science". He argues that the computer science community as a whole has largely ignored the debugging problem. Today's commercial programming environments provide debugging tools that are little better than the tools that came with programming environments 30 years ago. The most difficult problem is to figure out exactly what went wrong. This results from the fact, that programs are complex artifacts, and consequently, debugging often struggles against this complexity See [Lieb 97]. As a consequence, debugging is still a matter of trial and error.

Bug-hunting is often described as detective work. Stitt defines the phases of the debugging process to be arranged around the "research, gain knowledge and deduce phase". Furthermore, he identifies five major tactics of debugging, which may be used individually or in combination with othersSee [Stit 92]:

Preliminary investigations: is it really a bug?
  • Static debugging: verification of requirements and design as well as source code analysis
  • Run-time observation: testing, experimenting and analysis of program results
  • Source code manipulation: instrumentation and partial execution
  • Dynamic debugging: breakpointing, single-stepping, monitoring and testing

A tool that supports the user during this process is called a debugger. Two possible definitions are given below:

Debugging tool See [Stit 92]

A debugging tool is anything that provides useful knowledge about a program's execution and its occurring program state changes. Different forms of evidence are given by software debuggers, source language instrumentation, instrumentation drivers, interrupt interception drivers, and hardware monitoring aids.

Debugger See [Rose 96]

A debugger is a tool to help track down, isolate, and remove bugs from software programs. In truth, debuggers are tools to illuminate the dynamic nature of a program in order to understand it as well as find and fix defects.

Rosenberg also compares debuggers to other tools like the magnifying glass, the microscope, the logic analyzer, the profiler, and the browser with which a program can be examined. In principle, a debugger is something like a sophisticated software analyzer.

Such debugging tools are as necessary for the incremental evolution of software as they are for finding errors See [Lieb 97]. The basic capabilities of a debugger are:

to follow the executable program step by step (tracing)
  • to stop and continue the programs execution at breakpoints (breakpointing)
  • to display and manipulate the contents of variables, registers, and memory locations (inspection)

In addition to these basic features, debuggers may also include support for more advanced or specialized activities. An example advanced technique often anticipated by users is program slicing, a technique which extracts only those lines of code that are related to the current track of investigations (see See [Weis 82] and See [Weis 84]). Other examples are debugging capabilities tailored to one particular set of errors, like memory leaks or incorrect pointer references.

Probably the most important activity of debugging is breakpointing. It is based on breakpoints, which can be defined as follows:

Breakpoints See [Stit 92]

A breakpoint is a controlled way to force a program to stop its execution. Breakpointing may occur on software interrupt calls, on calls to a program subfunction or on selected points within the program.

Single-stepping is a special kind of breakpointing, which may occur on either machine instruction and assembler level or on source lines in a higher-level language. Other convenient functions may be based on breakpoints, like watchpoints, "run-to-here"-features, and conditional program control See [Rose 96].

Please note, that breakpointing is a difficult technique and implementing an accurate breakpointing mechanism is a delicate challenge. In principle, a breakpoint is a special instruction inserted into the executable by the debugger, which causes the target to halt its execution immediately and to call a matching service routine (the breakpoint exception handler) of the operating system See [Rose 96]. These special "trap"-instructions may be either general interrupt instructions with corresponding return values (e.g. on Intel's x86 architecture) or dedicated breakpoint instruction (e.g. on the MIPS RISC architectures or the Digital Alpha processors). Dealing with breakpoints always means handling of exceptions, which is a big obstacle during processor design. Some delicate examples are overlapping of instructions and branch prediction on pipelined-architectures. In these cases, an instruction is executed piece by piece and is not completed for several clock cycles. Thus, exceptions may require to abort instructions in the pipeline before they complete, or predicted branches may have to be discarded altogether. For a more detailed discussion of these critical issues, please refer to the corresponding literature, e.g. See [HePa 96]. Breakpoints in parallel and distributed systems will be addressed in Section See Breakpointing in the Event Graph.

Definition and Classification of Errors

Diagnosing and troubleshooting a software failure is a difficult activity. It starts with the problem, that most of the terms and notions in this area have a common sense appeal and lack precision See [Brit 98]. Thus, it is important to try to exactly distinguish the following words:

Error
  • Failure
  • Fault
  • Bug

We will provide corresponding definitions in the following paragraphs, which have been derived from existing literature, in this case especially the opinions given in See [Brit 98], See [Choi 91], See [Enge 88], See [KrWi 98], See [McHe 89]. Probably the most precisely and formally defined approach is offered by Grabner's view of errors and associated program states See [Grab 99], which is also included in this section.

Remark

Although the definitions of Grabner are preferred in most cases, we do not need this high precision throughout this work. Therefore, we will stick to the common usage of the term "error", which covers all different kinds of errors, bugs, failures, or faults. This view is sufficiently accurate and precise for our needs.

A very informal definition of the most basic term "error" corresponding to computer science is as follows:

Error (informal definition) See [Enge 88]

An error is defined as the computation (calculation and outputting) of one or more incorrect results by a computer.

This definition of an error is the most commonly used, and the term "error" has also been used in Definition See Debugging [McHe 89] for debugging. The noted lack of precision is established by two issues: On the one hand, there needs to be a distinction between the observed erroneous behavior and the original error itself. On the other hand, the term "results" identifies not only the output of a program, but also incorrect states during execution.

Grabner discusses these problems in more detail by focusing on the states of application programs, which consist of simple and compounded instructions See [Grab 99]. For each of these instructions there is a well-defined initial state - before the instruction is executed - and a terminal state - after the instruction has been executed. A compounded instruction may consist of several intermediate states. During debugging, the user can investigate these states and apply corresponding activities. Grabner defines the state of a program during execution as follows:

State of program during execution See [Grab 99]

For a given view onto a program, let t represent the instant of time at the terminal state of instruction a t , and let s t ( v ) be the value of variable at time t .

Then, the state s t of the program at time t is the set of the values of its variables s t ( v ), so that: .

In this definition, a t is an instruction, whose operation assigns a new value to variable . Although the set V contains all the variables of a program, the user's main interests are those variables, that are modified by instruction a t See [Grab 99]. Figure See State of a program at a given instruction describes the relation between the instruction and its variables in a small diagram. In See [Pine 99], the author introduces another distinction by classifying those variables as "noncurrent", whose actual values as computed by instruction a t differ from the values expected by the user.

An important issue of Definition See State of program during execution [Grab 99] is the view onto the program. This view is usually defined by the user by selecting a corresponding granularity for the analysis activities. The most coarse granularity is certainly the program itself. In this case, the initial state is defined by the initial state of the variables, while the final state represents the results of the program. In most cases, the user selects a finer level of granularity. Typical examples for debugging are the instructions of the program in its high-level programming language or in assembler language See [Grab 99]. In Section See Program State Changes, we will review and adapt this definition of a program's state for our event graph.

Based on the definition of a program state, an error can formally be defined as follows:

Error See [Grab 99]

An error in a program is observed through an incorrect program state s t , which means that one or more variables contain a wrong value at time t .

In terms of errors, two kinds can generally be distinguished: hardware errors and software errors. The former result from erroneous functions of electronic elements, which have been initiated e.g. by dust, heat, variations in electrical currents, and others, or have been introduced through corrupted connections and cables or by other faulty components. The appearance of hardware errors leads to interruptions of service, which often result in complete breakdowns or system shutdowns. An improvement to this situation is offered by fault-tolerant systems, which increase the systems hardware reliability by incorporating replicated and redundant system components. In the scope of this work, we will not consider hardware errors any further, but instead concentrate on the second kind of errors, that manifest themselves in software.

Software errors (often called logical errors) are based on an incorrect sequence of statements, which can further be attributed to errors in the implementation or in the basic algorithm. These kinds of errors are related to program behavior, that is unexpected, unintended, or otherwise wrong due to some incorrect activity of a human or the environment See [KrWi 98].

Additionally, Britcher believes, that a fault is rarely the work of a single operation. More often than not is it a logical sequence that runs to dozens of lines of code and crosses modules See [Brit 98]. In this context, Grabner distinguishes between the following kinds of errors See [Grab 99]:

Errors of origin
  • Subsequent errors

An error of origin is one, that is observed at the program state where it occurs initially. Therefore, such an error manifests the first occurrence of an incorrect operation. This is also the place, where the correction should be applied. However, since errors are seldom detected immediately, users are usually confronted with subsequent errors first. Such subsequent errors exist always only as a consequence of an error of origin. They reveal an incorrect program state, that results from both, their operation and incorrect input variables. This relation between several states of a program is also expressed in the term fault, which can be defined as follows:

Fault See [Brit 98], See [McHe 89]

A fault is a logical sequence of statements (or rather state changes) or an accidental condition that produces a result other than that which the programmer intended or causes a program to fail to perform its required function. Therefore, a fault gives rise to errors and inversely, an error is the result of a fault during the creation of the program.

Probably the most common faults are typos in the source code text, that have not been detected by the compiler, or some other incorrect statements which do not comply with the intended semantics of a program.

The worst kind of fault is a failure, which can be defined as follows:

Failure See [KrWi 98]

A failure is the inability of a system (or one of its components) to perform a given function within given limits.

Additionally, a failure is often defined as a loss of some service to the user See [Brit 98]. Therefore, it represents one special kind of manifestation of erroneous program behavior See [Choi 91], because a program fails to produce expected results and may even terminate abnormally.

The missing term in this context is "bug", which is often used as an equivalent replacement for error. However, in practice most people identify some additional characteristics for a bug as specified in the following definition:

Bug See [KrWi 98]

A bug is the commonly used term for an error with originally unknown location and reason. A bug can also be attributed to a poorly prepared and executed test, as well as a bad program description that may lead users to mistake correct behavior for bugs.

The notions and terms described above can also be described with a conceptual overview as presented in Figure See Conceptual overview of notions and terms. A special group are the false errors, which occur at state changes, that are incorrectly assumed to be wrong.

During debugging, we want to pinpoint the very origin of an error, i.e. the first incorrect operation, which is often a difficult task. For that reason, we may identify several more characteristics of bugs.

In principal, if the software's design is logically and mathematically correct, it should be so into infinity (with exception of certain types of calendar calculations - see Section See The Millennium Bug, "See The Millennium Bug"). Consequently, if a software error arises it may only originate from the following set of possibilities See [Stit 92]:

Included in the original design or coding.
  • Introduced through program modification.
  • Introduced as a side effect of another bug fix.

Additionally, there are those errors related to failures of the hardware on which the program is executed, and to the corruption of the media on which the program is stored. Software components differ fundamentally from hardware components, because they don't suffer from aging or transient failures. Instead, when software components fail, the reasons are See [Boyl 99]:

Inaccuracies of the problem specification, which results in programs that behave differently than intended by the specifier.
  • Errors introduced during software development, which results in programs that behave differently than intended by the programmer.
  • Errors introduced during compilation, which results in programs that behave differently than intended by the compiler.

The so-called group of "occasional errors" may reside undetected in a program for very long time, even several years. They manifest themselves sporadically only on certain rare combinations of input and system state See [BlWa 96]. These malicious errors may only surface, when other factors of the system environment change (e.g. marginal stack allocation, conflicting use of resources between system components, timing failures, and other results from heavy system load) See [Stit 92].

Stitt offers some more characteristics to establish a classification of errors with the following three classes See [Stit 92]:

Consistency: errors may occur either with consistency or with a frustrating lack of consistency
  • Intermittence: both, consistent and variable type of failures may be intermittent
  • Fuzzy cases: is the problem really a result of a errors?

These different characteristic are very useful during debugging. Only if users are aware about all the different possibilities, they are able to react to incorrect program behavior and perform corresponding error detection activities. To underline the importance of these tasks, the following section describes actual cases of erroneous behavior observed in computer systems.

Notorious Errors

The reasons for testing have been defined above. Obviously, errors in software are unavoidable and very likely to take place in general. The importance and implications of errors may still be underestimated, and often the financial consequences are overlooked. Porter reports about a survey by the Standish Group, which examined 175.000 software development projects in 1995 See [Port 98]. The results of this research revealed, that 31% of these projects were cancelled before completion with a financial loss of US$81 billion. Of the remaining projects, more than 52% suffer significant cost overruns, with an average of 189% more than estimated, resulting in a US$59 billion loss. Additionally, there is the possibility of other consequences for humanity, sometimes even leading to personal harm. The following section contains an overview of some of the most critical errors. The first example of a bug represents the very origin of the term "debugging".

History of Debugging

The story is reported to have taken place on September 9, 1945 at the Computation Laboratory of Harvard University See [Stit 92]. Grace M. Hopper, assigned to the Bureau of Ordinance Computation, was working on Mark II, the successor of the Mark I computer engine, when suddenly the machine stopped for no apparent reason. Upon inspection, Hopper and her team detected a moth, which had flown into a relay from an open window; the moth had been pulverized by the relay and consequently had caused the device to fail. Her entry in the log book read: "First actual case of bug being found". Afterwards, the term "bug" was popularized to signify any system malfunction. Yet, it has been noted that the term itself had been used as far back as the end of the last century, when it was applied to electrical equipment.

More recently in 1995, the twentieth anniversary special issue of the Byte computer magazine devoted one chapter to a list of the 20 most notorious bugs See [Need 95]. Some of these errors as well as a few other important examples are described in the following.

Lethal errors
  • Airplane crashes
  • Governmental and commercial mishaps
  • Internet worm/hackers
  • Millennium bug
Lethal Software Errors

Probably the most critical errors occurred in situations, where human beings have been harmed or eventually killed. The best-known example in that domain is the story about the Therac-25 linear accelerator machines between 1985 and 1987. The tragedy happened at the East Texas Cancer Center in Tyler, where at least four people received lethal doses of radiation from Therac-25 machines during their medical treatment for cancer See [Need 95]. There were several errors, among them the failure of the programmer to detect a race condition6 (i.e., a miscoordination between concurrent tasks) See [Joch 95].

Another group of errors, that is seldom published for the public's knowledge but rather more often hidden by governmental agencies, is related to software systems in warfare equipment. Since errors are unavoidable, even these system must be affected from time to time. One example is a critical bug observed on American Patriot missiles during the Persian Gulf War. It is reported that one of these missiles erroneously killed 28 American soldiers in their barracks in Dahran, Saudia Arabia, due to a software glitch in the system See [Need 95].

Airplane Crashes

A topic related to software errors that always gets the media's attention is airplane crashes. Although travelling with commercial aircrafts is often declared to be the safest possibility for any kind of journey (accident rates are given with one accident per 1 million flights), an accident with an aircraft is much more media-important than most other means of public and private transportation. A lot of investigations in this area has been performed by Ladkin See [Ladk 99], who studies the reasons for computer-related incidents with commercial aircrafts. He states, that until now programming errors have never been mentioned as the main reason for any critical accident. In fact, most of the breakdowns, incidents, and accidents do not happen due to one single source in one independent component, but always due to the cooperation of several factors.

An example is the crash-landing of one of Lufthansa's Airbus A320 in September 1993 in Warsaw, which killed two persons. The A320 is the first commercial aircraft, which connects pilot's joystick (in contrast to other planes that have a control lever) and rudder only via computers, in order to save weight, increase the reliability of the system, and to block human mistakes. The problem in Warsaw appeared, when the plane was preparing to land and the pilots were waiting for a direction change of the wind that never happened. As a consequence, the landing took place with the wind instead of against it. In addition, the pilots had increased the planes speed in order to fight the high winds. This resulted in a far to high landing speed, and due to additional up-wind, the sensors at the planes wheels didn't signal that the machine has landed, which in turn didn't activate the brakes to stop the plane before leaving the runway and hitting a hill.

Other problems with planes and computers are known, e.g. the landing of one of Dragonair's A320 in June 1994 in Hong-Kong. In that case, a blast of wind prevented a change of the flaps, while the computer believed that the flaps were in the correct position, which induced heavy problems during landing. Other computer problems lead to a big catastrophe with 160 victims, when a Boeing B757 of American Airlines hit a hill in Colombia. In that case, the flight management computer (FMC) was operating with an erroneous navigational data base and two corrupt navigational transmitters, which resulted in wrong positional informations for the flight crew.

Besides the planes, an airport itself may be hidden by malfunctioning system. For example, in 1994 bugs in the computerized baggage-handling system delayed the opening of the new Denver airport. Automated baggage carts drove into walls or deposited bags at the wrong airport destination. The financial consequences were some $80 million to fix the system, before finally the airport opened in February 1995 with a manual baggage-handling system See [Need 95].

Another well-known example besides commercial airplane travels is the termination of the Ariane-5 rocket on June 4th, 1996 See [Ladk 98]. Due to a conversion problem from 64-bit floating point numbers to 16-bit integers, an exception was called that was never taken. As a result, both computers on board the rocket shut the system down, leading to navigational errors of the rocket only 40 seconds after its start. In order to avoid further damage, the self-construction mechanism was initiated and destroyed the rocket. Similarly, in 1993 a software error caused the thruster rockets of satellite Clementine to fire continually, which consumed all its fuel before completing its asteroid-rendezvous mission See [Need 95].

Governmental and commercial mishaps

As we have seen above, computer programs often don't work as they should, and too much buggy software reaches the end user See [Lieb 97]. An important area of computer application is the governmental and commercial domain of everyday computing. Errors in these system often affect many people, but fortunately, their consequences are not lethal. An example happened in 1985, when a tax computer delivered warning notices about withheld taxes to 27.000 companies, which had already paid their fees See [Need 95]. Another story is reported for 1989 in Paris, where 41.000 surprised traffic offenders received letters charging crimes like murder, drug trafficking, extortion, and prostitution instead of their traffic violations See [Need 95].

One example for a big software error in a commercial company happened in 1988, when an automated Black & Decker distribution center in Northampton, England, restored corrupted backup data over all the main systems data See [Need 95]. The destroyed data could only be corrected by inputting the complete inventory of the depot manually. While this was already an expensive consequence of a software error, failures in computer systems of banks may even be more costly. In 1989 a bug in the system of a British bank transferred an extra 2 million pound to customers within one hour, because payment orders were permitted to be issued twice See [Need 95]. The problem in this case was, that the bank had to trust their customers in returning the money, because of missing transaction protocols.

Yet, it has to be mentioned, that these official cases of software bugs are probably only a minority of the estimated number of unknown cases. The reason is that companies and especially banks rely on establishing a basis of trust with their customers, which may certainly be harmed if such errors would be published. Thus, it must be accepted, that many more bugs happen than are reported in the media.

Internet Worm/Hackers

In contrast to the governmental and commercial mishaps reported above, increased media attention is focused on reports about the scene of hackers and phreaks. The high potential of devastating damages is a consequence from the massive expansion of the networking infrastructure and at the same time from the vulnerability of networked computers. The connection between software errors and hackers is the fact, that the former is usually misused as a starting point for illegal intrusions into computer systems.

The Internet Worm of November 2-3, 1988, created by Cornell grad student Robert T. Morris, was to be the largest and best-publicized computer-intrusion scandal to date See [Ster 92]. Morris said, that his ingenious "worm" program was meant to explore the Internet harmlessly, but due to bad programming and a math error in the code, the worm replicated itself 14 times faster than intended and slipped out of control See [Need 95]. Within a few hours, some six thousand Internet computers crashed, affecting systems for weeks before fully recovering from the damages wrought See [Spaf 89].

A similar example is the system crash of AT&T's long distance telephone network happened on January 15th, 1990 See [Ster 92]. Its cause was an improvement or rather an attempted improvement in the System 7 software for AT&T's 4ESS switching station, the "Generic 44E14 Central Office Switch Software". This software has been extensively tested, and was considered very stable. Originally, the station with System 7 were programmed to switch over to a backup net in case of any problems. However, in mid-December 1989, a new high-velocity, high-security software patch was distributed to each of the 4ESS switches that would enable them to switch over even more quickly, making the total system much more secure. Yet, at 2:25 P.M. EST on Monday, January 15th, 1990, one of AT&T's 3ESS toll switching systems in New York City had an actual, legitimate, minor problem (a missing break in a C switch statement) and went into fault recovery routines. This was the kickoff for a series of fallouts, and like in a chain reaction one machine after the other went down leading to an immense collapse of the US telephone system. The shut down lasted for 9 hours, leading to some 74 million uncompleted calls, which resulted in the most severe breakdown in the US telephone network ever See [Need 95].

The story of this error was rather dubious, because initially hackers were made responsible for this failure and were blamed by the media and US law enforcement agencies. This is a usual way to cover tracks. Quite often, big companies do not blame themselves for software errors, or blame other people to be responsible for their failures instead. At the same time, some companies cannot effort to make system failures publicly known, simply because their relation with customers relies on trust, as in case of banks, which have surely been hit by hackers following software bugs in past, but would never confirm these incidents ever happened.

Similar events are documented from time to time, like one on July 1st, 1991, which disrupted the telephone service in Washington, D.C., Pittsburgh, Los Angeles, and San Francisco. Once again, a seemingly minor maintenance problem had crippled the digital System 7 and affected about 12 million people. This time, it was a single mistyped character, one tine typographical flaw in one single line of the 10 million lines of code.

The Millennium Bug

The most publicly interesting and therefore media-present bug is undoubtedly the millennium bug or Y2k (Year 2000) error. This error even lead to the creation of the International Y2K Cooperation Center under the auspices of the United Nations in order to minimize Y2K impact throughout the world (see http://www.iy2kcc.org). Besides that it has been in the media for quite a long time, and many self-proclaimed prophets predicted the end of the world due to computer failures. The whole problem is based on the fact, that some legacy mainframe computer programs were hard-coded to treat the years by their last two digits. Consequently they cannot handle 4-digit year dates, and dates from the year 2000 may possibly be interpreted as 1900 See [Need 95]. Thus, unique problems and bottlenecks were expected and weaknesses of our infrastructure should have been exposed. Worst-case scenarios included examples like disruption of public transportation, banking and finance, hospitals, telephone and mail services, and various kinds of governmental services, e.g. social security, tax offices, and worst of all military installations See [YoYo 99]. Besides these immediate failures, many worst-case scenarios were derived for the long-term range, like breakdown of international economy, a plague of lawsuits filed by shareholders, the families of deceased patients, and swarms of other people harmed by Y2k failures See [Hyat 99].

Most consequences of the millennium bug have been predicted for the turn of the year, when the clock struck midnight on January 1st, 2000. However, at the time of writing only few and so far harmless errors have been observed. Some examples are given in the following list (Sources: CNN, Reuters, CNET, ZDNet, Der Standard, ORF, Heise Online, IY2KCC, USENET comp.risks and RISKS Forum):

The radiation controllers in two Japanese nuclear power-plants failed to operate correctly, requiring to manually input the correct data.
  • Seven US nuclear power-plants reported minor problems, although none of them affected safety systems.
  • A US spy satellite and a French military satellite encountered minor glitches, but were reportedly never out of control.
  • The atomic weapon laboratory in Oak Ridge, Tennessee, which is also the biggest US uranium depot, faced allegedly harmless date problems
  • Egypt's national news wire service briefly stopped filing.
  • In three Swedish hospitals cardiac control systems shut down operation.
  • A Stockholm bank refused electronic access to customer's accounts.
  • A bank in Cologne incorrectly transferred several millions to some customers.
  • Several hundred slot machines at the Delaware horse track shut down operation.
  • A CyberCash credit-card verification caused duplicate transactions.
  • Some jail sentences in Italy were increased by a century due to Y2k-bugs.
  • The South-Korean Millennium Baby was registered with an age of 100 years.

These news items are also confirmed by official authorities. The US government (President Clinton's Y2k-representative John Koskinen) announced, that only minor incidents have been observed, which have been solved immediately. A similar statement was issued by Bruce McConnel, the Year 2000 representative of the United Nations, who stated, that globally no major disturbances have been perceived. Yet, the French magazine "01-Informatique" reported, that more than 15 percent of the French industry observed glitches due to the Y2k-bug, especially in the areas accounting and finance, but also in personnel, payment, and production.

However, all in all it seems, that the preparations for the Y2k-bug have paid off. Considering the amount of money, this was not a cheap effort. For example, the US government estimates, that governments and companies worldwide spent US$200 billion to prevent a Year 2000 catastrophe. These numbers are even exceeded by the International Data Corporation (IDC), who reckoned up to US$280 billion, while the Gartner Group evaluated even US$300 to US$600 billion in expenses. Of course, these amounts and the actually observed effects of the Y2k bug lead to some critical voices concerning waste of money. There are some critical issues, that have to be mentioned in this context. Firstly, it remains to be seen if the upcoming months still do not reveal any major Y2k failures. Secondly, it will forever be unknown, what would have happened if these investments in the information technology sector would not have been placed in advance. Thirdly, there is always the problem of concealment, and some (probably critical) Y2k-issues will never be revealed in public.

Related to the Y2k bug there are other notorious dates, like February 29th, 2000, which is an intercalary day that may be missed by some systems. Another previous example is the "Nines Problem", which was predicted for September 9th, 1999. Concerns over this date arose from a programming convention that used four nines in a row - 9999 - to tell computers to stop processing data or to prepare for maintenance. Some thought, that the 9/9/99 would be a precursor to the millennium bug, although the latter is expected to be much more critical. Yet, September 9th, 1999 has passed without any major incidents or disruption of computer systems in Asia, Europe, and the US. Some reasons for this unexpected non-appearance are, that the date would have to be stored in BCD or ASCII-format and would equal "090999" or "990909" as usual.

Errors in Highly-Complex Systems

While these examples of bugs have shown, that there is a great impact of software errors on mankind, there may be even more upcoming difficulties.

A system, that was cancelled before it could cause any dramatic errors and failures, was the proposed US Strategic Defense Initiative (SDI) in 1983. It was intended to defend the US against nuclear missile attack with computer-aimed weapons. Estimations predicted it to require up to 100 millions lines of code, a complexity that hasn't been reached by existing systems until now. Yet, testing was impossible, since it would have required to fire missiles onto US targets, which meant, that the system was expected to work bug-free the first time it was ever used.

However, testing and debugging has already to face problems of this magnitude. Due to the ever increasing number of elements in computer systems, the probability of failures increases steadily, even though the reliability of the system components is improved See [Feyn 96]. For example, the scale of massively parallel systems will also mean drastically reduced system stability and reliability See [Gust 98].

Reduced system stability and reliability (compare with See [Hoss 99])

Consider a parallel system that consists of 100 SMP clusters with 128 processing elements each, making 12800 processing elements in total. If we assume the mean time between interrupt (MTBI) for hardware failures of one such cluster is 500 hours, than we would obtain an overall MTBI of 5 hours for the total system. If we assume that a processor-memory module yields a mean time between failure (MTBF) of 3 years, than the MTBF of the whole 12800 modules would be approximately 2 hours. Furthermore, if we assume a system being capable of 1 Tflop/s (10 12 operations per second) operational speed, only 10 16 operations could be performed between two failures, which is not really much measured at the criteria of Grand Challenge applications.

This means, that parallel systems offer a high degree of possible errors. The situation is getting worse when programming and other software bugs are taken into account See [Hoss 99]. Dealing with Computational Science & Engineering (CSE) problems of Grand Challenge scale, one has to face the fundamental responsibility to verify the results of numerical computer simulations, since these will be used as replacements of experimental explorations See [Hoss 99]. During testing, the expected results of a program are often compared with the output of an older, slower program, which is believed to be more reliable See [WaBl 97]. Yet, there may not exist a program for problems of that scale.

The Advanced Strategic Computing Initiative aims to replace physical nuclear-weapons testing with computer simulations. This focuses much-needed attention on an issue common to all Grand Challenge computation programs See [Gust 98]: How do you get confidence - opposed to mere guidance or suggestion - out of a computer simulation? Making the simulation of some physical phenomena rigorous and immune to the usual discretization and rounding errors is computationally expensive and practiced by a vanishing, small fraction of the high performance computing community. Some of ASCI's proposed computing power should be placed not into brute-force increases in particle counts, mesh densities, or finer time steps, but into methods that increase confidence in the answers it produces. This will place the debate over ASCI's validity on scientific instead of political grounds See [Gust 98].

An Example of Bug Potential

In order to understand the implications of bugs for large scale highly-complex computer systems, we will now describe a well-known example of a widely available operating system, Microsoft's Windows. Of course, the discussion below may be biased toward's one operating system or another, but it is sufficient for our intention. In addition, this software product is the most widely used operating system currently available, and almost everybody in the modern world is confronted with it from time to time.

The operating system of Redmond's software manufacturer Microsoft has always been the target of error reports, which are very often presented in everyday's media. Early reports are already presented in 1991, when the "unrecoverable application error" of Windows 3.0 and later the "general protection fault" of version 3.1 became a household phrase after the introduction of Microsoft's new operating system See [Need 95]. Another example is an error in the calculator of MS Windows, which did not display correct answers in certain cases. The interesting thing about this errors is, that it was admitted by Microsoft only after the detection of the Pentium Bug. This glitch in Intel's flagship processor architecture, which is the main hardware platform for Windows, is one of the most widely reported bug in history See [Need 95]. Its results were incorrect floating-point divisions on Intel's chip, that were estimated to have an occurrence frequency from days to millennia See [BlWa 96]. Although this bug happens only occasionally, it still is embarrassing and led to a flurry of controversy and an unexpected public-relations debacle for Intel. And yet, a stunned Intel management pointed out, that occasional bugs are the rule rather than the exception in released microprocessors See [BlWa 96].

Of course, there are comparable examples from other software vendors. Even compilers contain lots of bugs that may seriously affect the program development process. Evidence is given in See [Boyl 99], which describe experiences of programmers with bugs, that would only go away after compiler patching or upgrading. Furthermore, projects like the GNU C compiler, a de facto standard on UNIX machines, have evolved over a long history of problems and bugs. Sun Microsystems' Java Development Kit, JDK 1.1, was originally shipped with bug fixes planned for future versions. All these problems introduced the construction of verified compilers, which are compilers that have been proven to correctly compile any correct program into assembly language.

Another example is the failure rate of utilities on commercial versions of Unix, as indicated by Table See Reliability of commercial UNIX utilities See [Lewi 99]. The operating systems reliability is used as a metric to compare different software products, in concrete the reliability of a large collection of basic UNIX utility programs, X-Window applications and servers, and network services. The study was conduced by Miller et al in 1990 See [Mill 90b] and repeated in 1995 See [Mill 95b]. The applied testing methods were largely automatic and subjected these programs to random input streams. For this test, the results delivered the failure rate of these tools in terms of crashes and hangs (infinite loops).

Reliability of commercial UNIX utilities

 

lines of code

failure rate

Commercial UNIX

10 million

15-43%

Linux

1,5 million

9%

GNU utilities

 

7%

Neglecting the commercial and public consequences of this comparison as well as its accuracy, we can identify the defect rate as a possible estimation of software quality. Yet, the defect rate as a single-point estimate is less important as the defect density, which can be derived from the failure rate and the lines of code. In that case, the defect density of the Linux utilities is still higher than that of other commercially available UNIX utilities. This would lead to the conclusion, that commercial UNIX systems are more reliable than the Linux counterpart by a factor of 30 to 400 percent. Again, the validity of this comparison may be questioned and still, defect density is only a single-point estimate of software reliability.

A well-known story is based on Microsoft's Windows 98, which has been declared to have fixed 5.000 bugs that plagued Windows 95. In response, many journalists reacted by indicating that an operating system that contained 5.000 well-known bugs has been shipped in the estimated number of 50 million copies of Microsoft's product See [Lewi 98]. Lewis asked, if these problems of Windows 95 indicate, that the problem of software quality is unsolvable. His estimations are based on an approach by Jones, who defined the defect potential of software as follows See [Jone 96]:

Defect potential See [Jone 96]

The defect potential , where FP is the number of function points in the new software product. For enhanced software, an exponent of 1.27 is used instead of 1.25.

To determine the number of function points of a particular program, Jones recommends to divide the lines-of-code (LOC) by 100. Thus, for Windows 95, which has reportedly 15 million lines of code, we estimate FP = 150.000 function points. This yields a defect rate of potentially 2.95 million bugs in Windows 95! Please note, that bug potential is not the same as number of bugs, and the inaccuracy of the term "bug" itself (see Section See Definition and Classification of Errors), which is often used ambiguously. Not every bug is always a "failure" or "fault". Thus, 5000 bugs are obviously not 5000 critical failures, but many may be occurrences of rarely used commands exceeding their response time marginally.

Another number often used for comparing the reliability of software is failure density, which is defined as follows See [Brit 98]:

Failure density See [Brit 98]

The failure density is used as a measurement of failures per thousand lines of code.

For the data above, we may thus calculate a failure density of 0.33 per thousand lines of code, which is comparable to other products of that size according to Britcher See [Brit 98]. Please be aware, that these numbers should not be taken too seriously, because all the data is based on estimations. Additionally, the number of faults in any large system is virtually unknowable, and many of these faults may never cause failures. Some faults go undetected simply because that part of the system is never executed. Other faults may influence or even nullify other effects. In addition, a system could have been developed fault tolerant in the first place and may be able to react to the program's behavior See [Brit 98].

This leads us to the learning curve, a method to describe the improvement of software between subsequent testing iterations. Obviously, the bug potential as defined above is not the same as the number of bugs in the code. It may only serve as a way to estimate the amount of required testing. In this case, we introduce the defect-removal rule of Jones, which is defined as follows See [Jone 96]:

Defect removal rule See [Jone 96]

The defect potential , where R is the learning rate during one test-and-fix step, while k is the number of applied test-and-fix iterations.

If we assume a learning rate of 30% for each inspection or test step, which means that 30% of the errors are detected and removed, we would achieve a remaining defect rate of 5000 after 18 test iterations. Since Jones recommends 6 to 12 iterations, Microsoft's software practice appears to exceed the industry norm. In order to evaluate the number of test iterations required to remove the defect potential from 2.95 million to 1, we may calculate the number of required test-and-fix iterations as follows See [Lewi 98]:

Required test-and-fix iterations

 

This means, that a total number of 42 steps would be required to reduce the defect potential from 2.95 million to one See [Lewi 98]. It is clear, that Microsoft Windows is only one example. The estimations described above are similar for other projects, and the potential consequences of malfunctioning computer systems increase with the integration of computers in everyday life operation. The systems themselves grow more and more complex, which drives the need for new assurances of reliability See [BlWa 96].

Summary: Testing and Debugging

This chapter described the testing and debugging phases of the software life-cycle, which are applied to ensure the quality of software products. Yet, even the best testing and debugging approaches can only target at stable software, which is not completely error-free, simply because not every bug may be detected. This is also explained with some of the most notorious errors, which indicate that errors in software packages are ubiquitous and may happen anytime and anywhere.

We have also shown that bugs exist on different levels of severity, indicating that some bugs are more critical than others. In practice this means, that debuggers should first help to eliminate the most critical bugs and then to solve minor problems, although this has been proven difficult in many cases.

Another critical problem is the nature of some programs itself. Due to their complexity, some programs require more testing than can ever be applied. Some applications even offer solutions for problems (e.g. on the Grand Challenge scale), which cannot be verified in reality. How can we be sure that the results are correct? How can we trust the computation of large scale problems?

 

Debugging Parallel Programs - Related Work

After describing the area of application - massively parallel computing - and the necessity to support users during the tedious error detection phase, we will now describe the testing and debugging cycles for sequential and parallel programs in more detail. Starting with a traditional approach of testing, we will introduce the problems and difficulties that programmers face when analyzing arbitrary parallel programs. This leads us to an overview about debugging obstacles due to nondeterministic program behavior and the necessary extensions of the traditional cyclic approach. Throughout this chapter, we will describe the relevant theoretical background as well as some of the important solutions and tool implementations related to this work.

Traditional sequential testing and debugging cycles
  • Obstacles during parallel program debugging
  • Extended parallel testing and debugging approach
  • Requirements on a parallel debugging tool
  • Parallel program analysis and visualization
  • Comparison with existing related work

Debugging Parallel Programs - Related Work

Traditional Testing and Debugging Cycles

Simple Checking

The testing and debugging phase of the software life-cycle is applied, whenever a stage of the implementation is reached that allows execution of the target, which is either the complete application or an independent module. This means, that all syntactical errors have been removed and the source code was compiled successfully.

Testing and debugging always operates in two cycles, during which the program is executed again and again until a sufficient degree of reliability has been obtained (see Section See Problems of Testing, "See Problems of Testing" for a detailed discussion). The testing cycle is used to detect incorrect computations, while the debugging cycle allows the user to gain more information about program states and intermediate results during execution. The idea is to identify, locate, and correct program bugs based on comparing achieved results with expected results See [WaBl 97]. In this context, Blum and Wasserman formally define the term "simple checking" as follows:

Simple checking See [BlWa 96]

Let f be an arbitrary mathematical function. Then consider the following task: given an input x and an output y , determine whether or not y = f(x) ; that is, determine whether or not the input/output pair represents a correct computation of f . This task is called simple checking .

In this case, the function f ( x ) describes the behavior of some program P ( x ) as intended by the user. Therefore it serves as a specification of intended program behavior, which is required for evaluating the correctness of the program as defined in Section See Software Quality. To recapitulate, a program is always correct relative to its specification. This correctness is verified during testing cycles, which apply simple checking iteratively for a set of inputs .

The Testing Cycle

The traditional approach to testing and debugging is sketched in Figure See Traditional testing and debugging cycle (compare with See [KrWi 98]). It consists of two cycles, each represented by a flow-diagram: the left flowchart represents the testing cycle, while the right diagram shows the debugging cycle. The testing phase starts with selection of appropriate input data. Therefore, the user chooses an input x from the set of valid inputs X . This input is provided for a program execution, which should compute the corresponding set of results y . Due to the fact that exhaustive testing would require an almost infinite number of inputs, the end-marker of the testing cycle has been omitted deliberately.

After the execution of the program terminated, the program's execution is checked whether an error occurred or not. As specified in Definition See Error (informal definition) [Enge 88] and See Failure [KrWi 98], an error can either be a program failure or incorrect results. Thus, these two characteristics have to be inspected. Immediately after program termination, two distinct incidents may be observed. Either the program's execution failed before the program arrived at its intended destination, or it terminated correctly. Obviously, a failure is always an indication of incorrect behavior, no matter if it produced incomplete results or no output at all. Consequently, the debugging cycle can be initiated immediately afterwards.

In case the program terminated correctly, the situation is more complicated because the program produced some amount of output, which may even be appropriate at first sight. The problem is that the validity and correctness of these results have to be verified. This may be difficult in many cases, because correct results have to be obtained somehow. For many functions, it is easier to check a completed computation than it is to carry out the computation itself. For example, it is difficult to factor an integer, but, given a, b, it is trivial to determine whether or not b is a factor of a. Various methods have been developed for checking a broad variety of functions asymptotically faster than they can be computed See [BlWa 96].

An interesting approach in this domain is offered by the testing and debugging tool Guard See [AbSo 95], which has been developed at the Griffith University, Australia. Guard emphasizes the fact that software is developed with evolutionary engineering techniques. This means, that the latest incarnation of an arbitrary application is always somehow related to its predecessors, a fact, that is also integral to source code control systems like SCCS and RCS, as well as program dependency tools like Make. For testing and debugging this means, that Guard enhances the traditional approach by automating the comparison of data structures between two running programs. With this technique, different iterations of the testing cycle can be compared automatically and Guard will deliver the variations between these executions. Furthermore, if a predecessor version of the target program is known to operate correctly, incorrect values of an arbitrary follow-up version can be detected easily.

A comparable approach, albeit based on statistics, is described with statistical testing in See [Walt 95]. By introducing a software usage model, the authors claim to detect errors that occur frequently in operational use of the software. Abrams et al have followed another comparable idea in the CHITRA trace analysis tool for performance analysis See [Abra 96]. Their idea is to use statistical methods to characterize parallel program executions, and to view the obtained executions always only as samples from the set of all possible program execution, thus applying statistical methods for program analysis See [Abra 95].

The Debugging Cycle

As mentioned above, the connection between the two cycles is established, whenever testing detects an anomaly. This may either be a program failure or incorrect results. If the observed results are incorrect or a runtime failure occurs, the program is subject to debugging.

A main difference between the testing cycle and the debugging cycle is, that testing operates on a set of individually different inputs, while debugging performs analysis of program execution based on one particular selected input. The input under consideration is obviously the input x , that is responsible for the anomaly and initiated the debugging cycle in the first place. Thus, the debugging cycle is always carried out with a particular set of input data selected during testing. Please note, that in large projects the testing and debugging cycle is often carried out by distinct teams or team members. The interface between these two collaborators is then the description of the anomaly and its corresponding input data.

The debugging cycle is presented in the right flow diagram of Figure See Traditional testing and debugging cycle. It starts with instrumenting the target program. This means, that debugging code is added to the source code which observes the program's execution. Afterwards the program is analyzed iteratively to detect the lines of faulty code. Although it depends on the applied tool, there are usually three activities taking place (compare Definition See Breakpoints [Stit 92]):

Setting a breakpoint7
  • Executing the program until the breakpoint
  • Analyzing the process state at the breakpoint

The connections between these three activities are described with Figure See Iterations of the debugging cycle. The first time-line shows one iteration of the testing cycle from Figure See Traditional testing and debugging cycle. Given some arbitrary input, the program is executed until it arrives at an observation point. At this point, an error is observed, which means that either a program failure occurs or incorrect results are detected. Obviously, the reason for this incorrect behavior must be placed at the observation point or somewhere before the observation point. Thus, the problem of the debugger is to deduce the conditions under which the program produced the incorrect output See [Agra 91]. This characterizes debugging as a "backward" process. In practice, this activity is called "flowback" analysis See [Balz 69] and there are basically two solutions to this problem:

Reverse execution
  • Backtracking with re-execution

The best solution would be to execute the program backwards one step at a time, starting at the observation point up to the point of the bug's origin See [Choi 91]. By comparing the states of the program at each of these steps, the error would be localized between the point where the state of the program was what was expected and the first point where the state was not what was expected See [Paul 99]. However, while this optimal solution could certainly improve the situation of many debugging users, it is not feasible in reality. The reason is that it is practically difficult and most times impossible to execute a program reverse in time, because many statements in a program and therefore the computing processes are irreversible8. One obvious example is the manipulation of file pointers and I/O descriptors See [Feld 88].

As a consequence, debugging is always based on backtracking by re-execution, where the program's statements are executed using their original order. The goal is the same as for reverse execution, to track the processes' history backwards in time from the manifestation of the error (the observation point) to the point at which the bug initially caused the erroneous behavior See [Choi 91].

In order to investigate the history of the process, the user is provided with the illusion of reverse execution See [PaLi 88]. To achieve this, a breakpoint is positioned somewhere in the code before the observation point as visualized with the second time-line diagram of Figure See Iterations of the debugging cycle. Since it is a priori unknown, if the breakpoint is really adequate, it can only be positioned at a suspected error location. Afterwards the program is executed until this breakpoint is reached. At this suspected error location the process state can be analyzed and information about the program's behavior can be gained. If this information is sufficient, and the bug has been detected, the debugging cycle is stopped and the faulty lines of code can be corrected. Otherwise, the error may either be located before or after the selected breakpoint location, which defines the new suspected error location. If the current breakpoint is placed after the new suspected error location, the program has to be restarted and executed again up to the new breakpoint. On contrary, if the current breakpoint is before the new suspected error location, it is usually easy to continue execution from the current position.

Some related work in this context are the prototype implementations of IGOR See [Feld 88], Spyder See [Agra 91], and PPD See [Choi 91]. IGOR is a tool that supports reversible execution with a mechanism available on arbitrary virtual memory systems See [Feld 88]. The basic idea is to peek at the virtual memory page tables and trace the reasons why these pages have been dirtied since the last checkpoint. A different approach is taken by Spyder, a debugging tool for monoprocessor C-programs which offers dynamic program slicing See [Weis 84] for execution-backtracking See [Agra 91]. Initially, an arbitrary variable and a corresponding program location is selected, for which Spyder automatically determines the statements that affect its value up to that position in the program. After executing the program with different test cases, Spyder helps to decide which statements in this dynamic slice to examine first. And finally, it allows to restore the program state at any location by backtracking the program execution to that location, without requiring to re-execute the program from the beginning.

A prototype debugging system dedicated to parallel programs of the Sequent Symmetry shared-memory multiprocessor is the Parallel Program Debugger PPD See [Choi 91]. It provides information on the data and control flow of a program's execution, while trying to keep both execution-time and debug-time overhead low. This is achieved by combining incremental tracing and program re-execution. With incremental tracing, the amount of tracing is selected according to the needs of the debugging user. Initially, only a small amount of trace data is generated during program execution. Afterwards, this initial data is supplemented during follow-up debugging cycles by detailed information, which is generated by tracing only selected parts of the program that are re-executed on demand.

The biggest problems of these approaches are the costs associated with restarting the program's execution, which is especially unacceptable for larger programs See [PaLi 88]. As a consequence, a variety of solutions is proposed to restart the program from intermediate code positions. In 1969, Balzer introduced the "history tape" within the EXtendable Debugging and Monitoring System See [Balz 69]. This is a observation capability that collects and stores all the data about a program's execution in order to format and present it to the user during debugging. Based on these data, flowback analysis can be performed by constructing an inverted tree, that describes how information flowed through the program to produce the value at the bottom node of the tree. In addition, a function for debug-time history-playback allows to retrieve information from the history tape, at any time of the program's execution.

The problem of the history tape is, that it requires enormous amounts of memory and is practically useless for present-day applications. In order to decrease these memory requirements while at the same time omitting the need to restart the program's execution at the beginning is to store required execution data periodically during the program's execution See [PaLi 88]. This is called "checkpointing" and provides the possibility to restart the program at all the places, where a snapshot of the program has been saved See [Feld 88]. While this approach is especially useful for debugging, it has also been used simply to ensure efficient usage of expensive computing environment, whenever long-lasting programs are executed and require a certain degree of fault-tolerance. In contrast to pure debugging, the main goal of checkpointing is to resume the execution, whenever a program's execution fails due to the loss of some external service See [Plan 95].

The traditional approach to testing and debugging as described above provides a brief overview of these activities and how they are usually carried out for sequential programs. Several testing and debugging tools have been developed that support users in that area. In addition, programmers tried to apply the same kind of techniques to parallel programs, and expected similar results. Yet, there are some fundamental difficulties that have to be solved for parallel program debugging, which cannot be mastered with this traditional debugging cycle. These problems will be discussed in the next section, before proposing a revised debugging approach that is also sufficient for testing and debugging of parallel programs.

Obstacles During Program Debugging

Problems of Sequential Debuggers

In principle, a parallel program is just a collection of several sequential processes, that execute concurrently and communicate in some way with each other. Consequently, the principle problems of sequential debugging are also evident for any parallel debugging tool. Rosenberg defines four key issues for designing and developing a debugging tool See [Rose 96]:

The Heisenberg Principle: a debugger should intrude the debuggee only in a minimal way.
  • Truthful debugging: at all costs, the debugger must be truthful so that the programmer may always trust it.
  • Program context information: the most important issue is the presentation of content information, to know where one is.
  • Debugging trails system development: any debugger that may be applied is almost always behind the technology one is using.

The Heisenberg (Uncertainty) Principle describes the intrusiveness of the probes inserted into any system being measured or tested See [LePa 85]. Originating from quantum mechanics, it states that one can never know everything about a quantum state See [Feyn 96]. Applied to software, another name for this behavior is the "Probe Effect" See [Gait 86], while errors established by these effect are often called "Heisenbugs" See [LePa 85]See [Rons 00]. Obviously, the intrusion should be kept a minimum in order to guarantee analysis data as correct as possible. In particular, the act of debugging an application should not change the behavior of the application. Otherwise, the usefulness of the debugger falls into question See [Rose 96].

In software debugging, a debugger violates the pure Heisenberg Principle in many ways, simply because it is in memory and controlled by the same operating system as the application being observed. This can affect the application in many ways, especially if the program is parallel and nondeterministic. For that reason, we will describe the Probe Effect in more detail in Section See The Probe Effect.

Bugs that are affected by this sort of thing are rare but extremely challenging and time-consuming because they are now so elusive. For sequential programs, bugs of this sort may also disappear when print statements are added to the source code of the debugged application, because the addition of the print statement may shift objects around in memory just enough to move the elusive bug somewhere else or even masks its existence. Another example is the disappearance of bugs when run under the auspices of a debugger, which leaves developers with no effective way to proceed See [Rose 96].

The most simplest solution for the Probe Effect would be to constantly perform program observation. In this case, the observation data describes the program's behavior accurately, but can be thrown away if not needed. This approach has been proposed many times, but isn't relevant in practice. Its main disadvantage is that observation activities need to be pinpointed well in advance, which makes this solution to rigid for practical usage. In addition, it requires some mechanism for managing observation data, which would be especially crucial on multitasking, multiuser systems.

Another critical issue of debugging is to always provide truthful information during debugging, which means, that the debugger must never mislead the user because the user is frequently testing out theories of how the observed failure may be cause See [Rose 96]. Obviously, any misinformation will direct the user in a potentially wrong direction, may cause a general lack of trust into the debugging tool and will dramatically hamper effective progress simply by devastating the user. Some big problems of debuggers are observed with optimized code, simply because the optimizing compiler changes the code radically and may prevent accurate information about how the optimized code maps back to the original source code. Another problem is observed, when variables are inspected which are currently being transferred between their actual memory location and the processors registers. Furthermore, compilers may eliminate unused variables from the code, which may be difficult to be recognized by the debugger See [Rose 96].

The most important issue for the debugging user is program context information, which may include several different types of information about source code, stack back-trace, variable values, thread information, and others. Therefore, a debugging tool must always relate the manifestation of the bug with the original lines of source code See [Rose 96]. In principle, programs can only be debugged, if the connection between their behavior and the original source code can be established See [Baec 97]. Yet, these lines may not contain the actual cause of the bug, because many bugs occur in a place much earlier to the manifested effects. Thus, the next important issue is to backtrack the code to the original place of the bug, the root-cause.

And finally, one of the biggest problems of debuggers is, that system developments occur long before any corresponding strong debugging support for the new systems developments is available. System vendors react to the mass market, which in turn requires certain technologies. Of course, debugging is only one of these technologies, but only with sufficient support for debugging, reliable systems may be developed. Therefore, Rosenberg suggests, that debugger developers need to push the systems vendors to provide the necessary infrastructure in order to enable support of the latest debugging technologies, while at the same time application developers need to push for debuggers in order to debug their complicated applications on the new architectures with its next great features See [Rose 96].

Fundamental Differences between Sequential and Parallel Debuggers

The description of the debugging and testing phase is generally valid for the sequential and the parallel case. Similar to parallel programming, where many approaches start by extending sequential codes, many parallel debuggers are simply combinations of existing sequential tools. However, the nature of parallel programs requires some fundamental extensions from parallel debuggers. As a consequence, some existing solutions may be insufficient for some problems of parallel error detection.

In principle, there are mainly three reasons why parallel debugging differs from its sequential counterpart:

Increased complexity
  • Amount of debugging data
  • Additional anomalous effects

One reason is the increased complexity of parallel programs compared to their sequential counterparts. While most sequential debuggers perform symbolic debugging9 on source code level, this may be hard in the parallel case.

For example, consider opening a source code window for each participating process, and perform debugging a program with more than a handful of processes. It has been noted before that text is inappropriate in many cases to display and manage the inherent complexity of parallel programs See [Panc 96a]. Therefore, some means of abstraction seems useful to establish a layer above the source code level. However, it has also been noted above, that some connection to the original lines of code is required in order to detect the corresponding location of faulty statements.

Another reason is the possible huge amount of storage required for debugging data, that has to be provided and analyzed. Since debugging is based on the program's results and the states of the processes during execution, analyzing these data from massively parallel programs may impose a heavy burden onto developers. Often immense and uncontrollable amounts of data are obtained See [Ober 98], and users are often overwhelmed with potentially irrelevant data See [Reed 97]. The result is the so-called "maze-effect" See [Damo 94], which means that observers are completely lost. Although this is also possible in the sequential case, it is much more likely in parallel machines, simply because of the increased amount of debugging data. Therefore, a debugging tool must allow to concentrate on the essential details, instead of flooding the user with all the possible data See [Netz 96].

In addition there are other problems which exist on smaller program scales, too. An example is the set of errors in parallel programs, which consists of multiple sequential bugs and errors established by synchronization and communication of concurrently executing tasks. While the multiplication of sequential bugs may already represent a big obstacle, the existence of anomalous effects due to concurrency makes parallel debugging even worse. The reason for these effects in parallel programs stems from the increased influence of the underlying parallel hardware system. Consider the debugging cycle for a simple program, consisting of the subfunctions f0, f1,... fn as sketched in the left diagram of Figure See Comparison of sequential and parallel program influences [KrVo 98]. Applying the simple checking method, we would use repeated executions in order to determine the correct intermediate states between these subfunctions. If the sequence of subfunctions f0, f1, and f2 describes the desired computation, the resulting output y is determined by the provided input x and the interaction with the system environment. If the input is the same and system-interactions can be reproduced, each iteration of the debugging cycle would yield the same output, and intermediate results can be analyzed with breakpoints and single-stepping during iterative re-executions.

In parallel programs the situation is more difficult, because the subfunctions are executed on concurrent processes and all process-interactions are influenced by the computer system. Consider the parallel program of Figure See Comparison of sequential and parallel program influences [KrVo 98], consisting of subfunctions f0, f1, and f2 on process 0, as well as g0, g1, and g2 on process 1. Again the output is determined by the computation of these subfunctions, the input x, and all system interactions. Since communication and synchronization are carried out by the system, even small changes in the system state may trigger different behavior See [Netz 96]. Some of these system characteristics are (compare with See [McHe 89] and See [Kran 97a]):

Processor speed and load imbalance
  • Scheduling decisions of the processor and the operating system
  • Cache contents, cache conflicts, and memory access patterns
  • Network throughput and network conflicts (latencies, contention)
  • Nondeterminism on the interconnection network

Unfortunately several of these environmental parameters cannot be controlled by the user, neither during subsequent executions of the program in the testing cycle, nor during the debugging cycle with the debugging tool. Therefore, these small variations in the system environment may vigorously influence the execution path of a program and introduce lots of difficulties for program testers and debuggers.

Errors from Process Communication and Synchronization

The influences described above are experienced whenever concurrent processes interact, because there are always some kind of system services involved. Therefore, the necessity of parallel programs to exchange data or to synchronize the communication partners introduces effects, that do not exist in sequential programs and may therefore be ignored during corresponding debugging cycles. Yet, for parallel debuggers additional functionality for handling these communication effects is a basic requirement.

Since the main reasons for these obstacles are communication and synchronization operations, the amount of observable effects is determined by how this process interaction is carried out during the program's execution. As described in Section See Explicit Parallel Programming in Practice, the two contrary paradigms are shared variable and message-passing. Due to the fact, that processes in shared variable programs are tightly coupled, interprocess communication is usually less expensive than in message-passing programs. The latter suffer high communication costs due to communication latency and achievable bandwidth. For that reason, the amount of interprocess communication and therefore the probability to encounter critical effects is usually higher in shared variable programs.

Although there are differences between these two paradigms in terms of occurrence frequency, the types of possible effects are similar in both cases. Some of the most tedious difficulties are:

Deadlocks and livelocks
  • Stampede and bystander effects
  • Irreproducibility effects
  • Completeness problems
  • Probe effects

Deadlocks and livelocks are well-known effects arising from concurrently executing and communicating or synchronizing processes (e.g. See [GlSh 80], and See [Nata 86]). The former describes the situation, where each of the participating processes is mutually waiting or blocked until one of the others does something See [MaJa 97]. The latter characterizes a group of processes that is continuously changing their state in response to changes of other processes without doing any useful work See [Buth 99]. In both cases, the program's progress is suspended until something resolves the conflict. As a side effect, the state where each of the processes persists, is usually a good starting point for debugging, and errors of this type can usually be corrected easily See [SnHo 88]. Several tool exists for deadlock and livelock analysis. A recent example is the Deadlock Checker See [MaJa 97], which uses various innovative techniques to prove deadlock and livelock-freedom for communicating sequential processes. Other solutions are described in See [Chan 83], See [Germ 84], and See [Sing 89].

In contrast, stampede and bystander effects are much more difficult to debug. The stampede effect See [SnHo 88] may emerge after an error has already been encountered on one process. However, since it is rarely possible to stop all the other processes immediately, they may continue to execute and stampede over the evidence of this original error. As a consequence, only limited useful information is left for the subsequent debugging process. In case of the bystander effect See [SnHo 88], the data space of one process is corrupted by another one. Obviously, the corrupted process is the first candidate for debugging, when in fact a different process originally caused the damage.

The effects described so far exist in any system composed from concurrently executing processes. In contrast, the irreproducibility effect and the completeness problem may only emerge in nondeterministic parallel programs, while the consequences of the probe effect are much more critical in this type of programs. Due to this fact, understanding, writing, and debugging nondeterministic programs is accepted to be one of the most difficult activities in parallel software development See [McHe 89]. Consequently, a debugging tool should be able to cope with this group of effects, if it is intended to provide sufficient support during program analysis and error detection.

Nondeterministic Parallel Programs

The origin of determinism and nondeterminism can be traced to philosophy and ethics See [Broc 96]. The term determinism defines the causal certainty, that every event in human life is predefined by the laws of nature and its preceding events. In contrast, nondeterminism describes the doctrine, that our lives are only partially determined by the laws of nature, and some parts cannot be predicted by the principle of causality. Therefore, nondeterminism is much less restrictive than determinism and offers a higher degree of freedom. In reality, especially microphysics, many natural events have been observed (e.g. quantum theory), which contradict the deterministic school.

In computer science, these terms are especially important in the theoretical domain, where computational complexity theory and algorithm analysis are performed. An example is the classification of problems based on how hard they are to be solved. The class P describes problems that can be solved in polynomial time, while the class NP contains all nondeterministic polynomial time problems. The latter contains the group of NP-complete problems, for which all known solutions are essentially of the type "try all possibilities" (compare "See The Completeness Problem", Section See The Completeness Problem), which can be impossible in some cases See [Aho 83].

Besides this classification, the terms determinism and nondeterminism are used to characterize the relations between process states in a program. (Program states related to debugging have been defined in Definition See State of program during execution [Grab 99] on page See State of program during execution [Grab 99].) The interesting question is, if a unequivocal well-defined order of program states exists or not. This is also very important for debugging, since it operates by analyzing the programs states during execution. A possible definition of determinism is as follows:

Determinism See [Enge 88]

A program executed by a machine is called deterministic, if there exists exactly and only one follow-up state after each programming statement. This means, that the follow-up statement of any statement is always uniquely defined. In addition, every deterministic program must terminate10.

Obviously, a program with deterministic behavior offers some desirable qualities, and developers believe, that consistent and reliable software products must be deterministic See [Stei 98]. Yet, it also restricts the programmer's and their code too much for many applications. This is also expressed in some research papers, which believe that the importance of nondeterminism is steadily increasing, especially in the domain of parallel and distributed programming. An important goal is to obtain a better level of performance. If the usage of nondeterministic communications is increased, the degree of parallelism can be raised. As a consequence, potential parallelism in parallel or distributed computing systems is better utilized than in deterministic programs.

A simple case is a FIFO queue (First In - First Out), where data is processed based on their order of arrival See [NeMi 92b]. An actual example is offered by parallel ray tracing algorithms (e.g. See [CaSc 89], and See [JeRe 92]), which compute the final results by dividing the scene onto concurrent processes. The load is distributed between the processes by a master-worker scheme, that handles requests on a first come first served basis. Due to the nature of these algorithms even the same scene may be computed by different processes with a different order of the traced rays.

Although this is obviously the most important reason for using nondeterministic communication features, there are other substantial reasons, for instance when modelling reality. If the computing system simulates a natural phenomenon with nondeterministic characteristics, the simulation program must reproduce the same behavior. Please note, that this is usually implemented with random number generators, but it is also possible to achieve similar results with nondeterministic communication statements. Typical applications in this domain are chaotic, physical computations, which may be inherent nondeterministic. Examples are crashtest simulations for car manufacturers (e.g. PAM-CRASH See [Lons 94] and See [Lons 95]), where different results may be observed although the same kind of impact is used as an input. This represents reality, because the deformation of objects is determined by the applied forces and variations in the environment that are to inconceivable to be modelled. Thus, there is a certain degree of chance involved.

Another reason for using nondeterminism, which is probably more error-prone than the reasons described above, is "programmer's laziness". Often, determining the correct order of nondeterministic events is much more difficult than choosing one of the possible events. Thus, the program is implemented to react to the most likely program states in a way that is assumed to be correct. Yet, this method contains a major pitfall, because introducing nondeterminism often introduces unexpected behavior under certain conditions, which results in hazardous, occasional bugs (see See Definition and Classification of Errors). Therefore, this kind of using nondeterminism should clearly be avoided as much as possible.

Connected to programmer's laziness is our last example of nondeterminism in software, which may be labeled "accidental nondeterminism" or unintended nondeterminism (see See [DaFr 94], See [Kran 96d], and See [KrWi 98]). While all the reasons above apply nondeterminism intentionally, some programs contain nondeterminism without the user's knowledge. In this case, unintentional nondeterminism is creeping into the code due to misplaced or misused synchronization or communication functions. This is very dangerous, because users are unaware about the potential havoc and may therefore miss to check for this possibility.

As a next step, we will now define nondeterminism based on the definition of determinism above, because obviously any program that is not deterministic is nondeterministic. At this point, we should remark that nondeterminism is sometimes called nondeterminacy (e.g. See [EmPa 88], See [Emra 92], and See [DaFr 93]) or even indeterminacy (e.g. See [LeMe 87]), but all these terms describe the same characteristics. We will stick to nondeterminism, since it is the most widely used of all these alternatives.

Nondeterminism

A program is nondeterministic, if - for a given input - there may be situations where an arbitrary programming statement is succeeded by one of two or more follow-up states. This freedom of choice may be determined by pure chance or unawareness of the complete state of the execution environment.

This definition shows that the key problem of nondeterminism is the absence of sufficient knowledge about the actual, complete state and its follow-up states in the computing environment. This leads us back to Figure See Comparison of sequential and parallel program influences [KrVo 98], which described the differences between sequential and parallel programs, and the influence of the computing system, respectively. As mentioned before, it is difficult if not impossible to describe the complete state of a parallel computing system at any time during a program's execution.

Based on Definition See Nondeterminism, we will now try to characterize a nondeterministic program with the following definition:

Nondeterministic programs

Nondeterministic programs do not fully specify all possible execution sequences, but allow a degree of freedom in selecting subsequent program states. Consequently, a program is nondeterministic, if successive executions of the same program may yield different results although the same input is provided See [NeMi 92b].

There are two remarks to this view about nondeterministic programs. Firstly, the input must be valid according to the program's specifications. This means, that only input data that may actually be processed by the application may be considered. The reason is, that illegal input may affect the program's execution in an undefined, unexpected way and may thus also be responsible for different results during successive program executions.

Secondly, the appearance and composition of the program's results have to be discussed in more detail. The question is, what are the actual results of a nondeterministic program? Of course, most important are the final results of the computation, which (hopefully) represent the purpose of the application and intent to provide a solution for the user's primary goal. Yet, in the sense of testing and tuning, it may also be interesting to determine those factors besides the input, that influenced the results of the program. Therefore we include the following items in the term "results" of Definition See Nondeterministic programs:

Final results
  • Intermediate results
  • Interaction patterns between concurrent processes

All these items are influenced by the nondeterminism of the program and may thus vary in subsequent executions although the same valid input is provided. This distinction is necessary if the correctness of the program is investigated. The reason is, that it may be possible to obtain correct final results with different intermediate results for any nondeterministic program. In addition, we may observe the same final results from a parallel program, whose execution varies in the interaction pattern of its processes. As a consequence, the possible nondeterministic behavior of the program is often concealed for the user, which means a high potential for malicious occasional errors. Such programs may even operate correctly for long periods of time, until suddenly incorrect results are revealed.

Please note, that most research papers in this domain (e.g. See [SnHo 88], See [McHe 89], See [NeMi 92b]) consider only the results of the program as being important. Only recently, Ronsse et al distinguish between external nondeterminism and internal nondeterminism See [Rons 00]. External nondeterminism means that an application returns different results for repeated executions with the same input data, while internal nondeterminism means that repeated executions with the same input yield the same results, but the internal execution path is different.

In addition, See [DaFr 94] distinguish intended and unintended nondeterminism, and propose to leave intended nondeterminism without investigation. Unfortunately, this leaves the problem, that even intended nondeterminism may reveal program errors. Similar ideas are described in See [NeDa 94] and See [Netz 96], who characterize feasible nondeterministic events as well as artifacts occurring due to nondeterminism, and propose to detect the set of artifacts and simply ignore them. Although these viewpoints may contain a critical pitfall as described above, we found only Oberhuber See [Ober 98] to cover variations in the communication/synchronization pattern of concurrently executing processes.

Race Conditions

As described above, there are different sources for nondeterminism to occur. In principle, the basic requirement is to allow two or more activities to progress in parallel See [McHe 89]. More detailed analysis allows to identify the following main reasons See [Ober 98]:

External sources:
  • Interrupt service routines for handling requests from I/O peripherals
  • Internal sources (defined by the program itself):
  • Return values from system calls like random number generators See [NeMi 92b]
  • (Internal) communication or synchronization functions

The main distinction between external sources and internal sources identifies those items that can be controlled by the program and those that are completely independent from the user's code. In principle, external sources should be handled completely by the operating system, and should not interfere with the program in any case. Although these reasons may introduce very problematic situations, we may assume that their exceptions are covered by the operating system. Thus, we will not discuss them in more detail. (Please remember, that operating systems are also software and many critical errors are included in them, too - see Section See An Example of Bug Potential)

In addition, external sources like interrupt service routines and return values from the random number generator, do exist in sequential as well as in parallel programs. Thus, a parallel debugger must handle each of these cases, but there are solutions for covering internal sources in sequential tools. (In Section See Random Number Generation we will briefly discuss the integration of functionality for random number generators in our approach).

The remaining item - (internal) communication and synchronization functions - only exists in parallel programs, which contain concurrently executing and interacting processes. The question is, whether or not some kind of ordering is enforced onto the process communication and synchronization. If the interaction between the processes of one program is strictly ordered for all possible inputs, the interaction pattern will be the same for each execution. In that case, communication and synchronization does not introduce nondeterminism in the program's behavior. On contrary, if the parallel environment permits different orders of process interaction, the results of the program may possibly be influenced by the order that occurs during program execution. Such a place in the code, that allows different orders during program execution, is called a "race condition". A possible formal definition to race conditions is offered by Helmbold and McDowell:

Race conditions See [HeMc 96]

Fix an input to the program. If two (or more) conflicting events are unordered (with respect to the same input) then there is a race between the two events on the input.

In order to discuss this term in more detail, we have to take a closer look at the kind of interaction, that takes place during program execution. Again we have to distinguish shared variable and message-passing as described in Section See Explicit Parallel Programming in Practice.

If communication and synchronization are applied with the shared variable paradigm, Definition See Race conditions [HeMc 96] describes two or more independent events accessing the same shared resource in a conflicting way. Furthermore, at least one of them modifies the resource and their execution order depends on the scheduling of the corresponding processes. Thus, for shared variable programs we may refine Definition See Race conditions [HeMc 96] as follows:

Shared variable race conditions

Whenever two or more processes are allowed simultaneous access to a shared variable, a race condition may arise in which the value of the shared variable can become indeterminate. Thus, a race condition is the process that leads to an indeterminate value See [Lewi 93].

These kinds of races are called "data races", and have been discussed extensively in literature. Some of the work on shared variables includes See [AlPa 87], See [MiCh 88], See [DiSc 90], See [Helm 90], See [Hood 90], See [ChMi 91], See [NeMi 91], See [Adve 91], See [Emra 92], and See [JuMc 00]. Recently, See [RiLa 98] described how to apply protocol-based race detection to DSM (Distributed Shared-Memory) architectures. Furthermore, races in shared memory programs can be identified on different levels of synchronization, from mutual exclusion to weaker synchronization such as Post/Wait. Of course, the more low level the synchronization, they more difficult is it to detect the races See [RiLa 98].

In the scope of their work, Helmbold and McDowell also identified four disjoint kinds of races depending on the ordering relations that may be observed during execution: concurrent races, general races, unordered races, and omission races. For a detailed discussion about these shared variable races, please take a look at See [HeMc 96]. In addition, Netzer and Miller See [NeMi 92a] recognized three different types of data races: actual, apparent, and feasible. Actual data races occur when two conflicting events execute concurrently in a given parallel execution. Apparent data races capture the notion that, although a pair of conflicting events might not have executed concurrently they could have, given a program's synchronization. Thus, apparent races may contain data races that are prevented from occurring by the program's semantics. On contrary, feasible races are permitted by both a program's synchronization and its control and data dependences.

Similarly to the shared variable paradigm, race conditions may also occur in programs based on the message-passing model, although there exists no shared memory. In this case, process interaction takes place via messages, and race conditions emerge only at communication and synchronization statements. The resulting race conditions are expressed as racing messages.

Message-passing race conditions - message races See [Netz 96]

A race condition in message-passing programs occurs, when a receive operation is pending, while two or more messages may be simultaneously in transit to arrive at this particular receive operation. Therefore, messages race with each other when their order of arrival at a process is not guaranteed and may be affected nondeterministically (by system characteristics as defined in See Fundamental Differences between Sequential and Parallel Debuggers).

In contrast to data races, only few research papers exist for message races. In addition, the results obtained for shared variable programs cannot be applied directly to message-passing, although some ideas of the former may have inspired the latter See [Netz 96]. Since we are mainly interested in message-passing programs, we will describe some of the related work in more detail.

One well-known example is the approach by Netzer and Miller See [NeMi 92b], who characterized racing messages and offered an on-the-fly detection algorithm (compare also with See [AlVe 93]). Yet, their technique only indirectly detects the existence of racing messages, but does not uncover all racing messages in the program. On contrary, the approach by Damodaran-Kamal and Francioni actually reports message races on-the-fly using a technique called controlled execution for testing and debugging See [DaFr 94]. They use receive semantics in one process at a time and try to distinguish between intended and unintended races. Intended races are those that have been deliberately introduced by a programmer, while unintended races occur without prior knowledge and usually represent the actual manifestation of a bug.

Based on these preliminary approaches, Netzer and Damodaran-Kamal joined forces and combined their ideas on message race detection. Their first combined results about message races have been published in 1994 See [NeDa 94], which have been revised together with Brennan in 1996 See [Netz 96]. One of their addressed problems is diagnostic accuracy, because they believe that overwhelming the programmer with too much information should be avoided (see "maze effect", Section See Fundamental Differences between Sequential and Parallel Debuggers). This opinion is derived from the fact, that message passing programs can exhibit thousands of races, and sifting through race reports this large is too difficult. As a consequence, they distinguish artifact and nonartifact races.

The critical races are the nonartifact races, which represent those places in the code, where unintended nondeterminism is actually being introduced. On contrary, artifact races are only side effects of nonartifact races or nondeterminism, respectively. Thus, a nonartifact race can cause many subsequent artifact races, and all the bug hunting support of a debugging tool must lead to the nonartifact races. As soon as a nonartifact race is fixed, the artifacts will disappear See [Netz 96]. This also means, that there is no need to investigate artifacts, and they should be ignored during debugging. Instead, only races that can be guaranteed nonartifact should be reported See [Netz 96].

While all the above approaches to message race detection apply runtime techniques, the approach of Tai presents post-mortem algorithms to detect all message races for certain message-passing implementations, such as fully asynchronous, synchronous, FIFO, and causal See [Tai 95]. In contrast to Netzer and Damodaran-Kamal, his goal is to find all the races in a program, whether or not they are artifacts.

For Cypher and Leu See [CyLe 93], race condition detection is just a side effect of modelling the properties of message passing programs. Their primary intention was not to present a race detection algorithm, but to discuss different types of message-passing semantics and prove results about each.

Similarly, Oberhuber See [Ober 98] modelled the properties of message-passing programs, but his focus was again on nondeterministic behavior and thus on racing messages. His main contribution is a set of formalisms that have not been described in such detail before. He provides detailed analysis of the reasons for racing messages and an approach to control the execution of such nondeterministic programsSee [Ober 95].

The Irreproducibility Effect

After introducing the characteristics of nondeterministic programs, we will now return to look at the remaining communication and synchronization errors: irreproducibility effects, completeness problems, and probe effects.

The irreproducibility effect See [SnHo 88], sometimes called non-repeatability See [NeMi 92b], is the most obvious problem occurring when testing and debugging nondeterministic programs. Since these programs may yield different results (again, final results, intermediate results, or interaction patterns, see Section See Nondeterministic Parallel Programs) in successive program runs although the same input is provided, it cannot be guaranteed that subsequent executions of the program exhibit the same behavior. Yet, this is a basic requirement for the debugging cycle.

The consequence of the irreproducibility effect are twofold. Firstly, program executions observed during testing may not be reproducible during debugging. Thus, even if an error is detected during the testing cycle, it may be difficult or even impossible to reveal the original cause during the debugging cycle. Secondly, the error observed during testing may be replaced by a completely different error during (cyclic) debugging. Although it is certainly important to remove both errors, recognizing the dissimilarity between the faulty behavior (and also between the original sources) may be confusing. Therefore, a debugging tool for nondeterministic parallel programs must offer a way to reproduce arbitrary test cases.

A solution for these problems is provided by "record&replay" mechanisms, that provides any number of equivalent executions based on some previously observed program execution. According to Leu et al, an equivalent execution can be defined as follows See [Leu 90]:

Equivalent execution See [Leu 90]

Two executions of a process p are considered to be equivalent, if the process p receives the same information from the other processes at the same instants. The instant of an arbitrary event is defined by the interval in which only this event takes place. This means that two executions of a parallel program will be considered to be equivalent if the execution of each of its processes is equivalent

In order to achieve such an equivalent execution, all the elements that influence a process have to be determined. These are the input data provided by the user and the items identified as reasons for race conditions in Section See Race Conditions. Since equivalent input is a basic requirement for the debugging cycle, any method offering equivalent execution must only reproduce the return values from system calls and equivalent communication and synchronization functions.

The solution of record&replay methods is based on a two-step approach. During an initial record phase, the order of critical elements is stored in tracefiles, so that the tracefiles contain sufficient data to describe all the choices taken at nondeterministic events in the code. Afterwards, if the program has to be re-executed and an execution equivalent to the initial record phase is needed, these tracedata are used as a constraint for the program. This is performed by the replay tool, that forces the target program to take the same choice at any nondeterministic point as has been taken during the record phase. If the order of all elements is the same for every program run, their behavior is equivalent and the same errors and results are observed. Thus, record&replay methods allow nondeterministic programs to be tested and optionally re-executed according to the users needs.

The history of record&replay techniques - sometimes called deterministic execution See [CaTa 91], trace-driven simulation See [Fren 95], or instant replay See [LeMe 87] - dates back to an approach by Curtis and Wittie See [CuWi 82], who proposed a technique to record the contents of each message as it is received by the corresponding process. Similar approaches are described in See [Smit 84] and See [LeRo 85] with the advantage that selected processes could be replayed in isolation. Yet, the biggest drawback of these approaches was the requirement of significant monitor and storage overhead. Thus, this data-driven approach to record&replay techniques was dropped for control-driven approaches. The latter only preserve the order of occurring operations, and generate the transferred data again during each replay execution.

Probably the first control-driven approach of record&replay was Instant Replay by LeBlanc and Mellor-Crummey See [LeMe 87]. Their idea was to trace the relative order of significant events (usually communication events occurring at send and receive function calls) instead of the data associated with these events. Since the contents of all process interactions do not need to be stored, this approach requires less time and space to save the information needed for replay. In addition, Instant Replay does not depend on a particular form of interprocess communication. Indeed it was originally intended to be applied for debugging parallel programs on tightly coupled multiprocessers, where processes communicate via shared variables and more frequently than in message-passing systems See [LeMe 87].

Several other record&replay methods have been implemented as an extension of Instant Replay. In fact, most existing prototype tools that allow analysis of nondeterministic parallel programs are based on this approach. Some example implementations of record&replay tools are IVD See [Mack 93], Panorama See [MaBe 93], PDT See [Clem 95b], Buster See [Xion 96], and DDB See [SiRa 97].

In See [Mack 93], a program replay mechanism for the message passing library PVM is described, that has been developed and integrated in the Interactive Visualization Debugger IVD See [Hao 93]. Another integrated parallel debugger for PVM programs that includes record&replay capabilities is Buster, which is described in See [Xion 96]. On contrary, the Panorama debugger has been implemented for the Intel iPSC/860 and the nCUBE architectures See [MaBe 93], and supports both on-line and post-mortem debugging. In this case, on-line means that the debugger is running without any constraints, while during post-mortem re-execution the program's behavior is controlled by a simple record&replay facility.

The Parallel Debugging Tool PDT is announced as an interactive, source-level debugger for distributed-memory parallel programs within the Annai programming environment See [Clem 95a]. It offers debugging support for high-level data-parallel programs based on High Performance Fortran (HPF) as well as message-passing programs based on the MPI standard See [Clem 95b]. For MPI programs, PDT also includes a record&replay mechanism, which represents an implementation of the approach described in See [NeMi 92b] (see below) with an extension for asynchronous probe primitives. The main problem of this approach is its high amount of monitor overhead, which is generated by maintaining extensive timing information during the record phase.

Another implementation of Instant Replay is included in the Distributed DeBugger DDB See [SiRa 97], which targets at the multihreaded Mach distributed operating system developed at Carnegie Mellon University. One of the key goals of Mach is to support a distributed system on top of heterogeneous hardware platforms. This is only one of the characteristics addressed by DDB, which supports the minimal requirements for debugging distributed programs as identified by its authors. In their paper See [SiRa 97], they authors also cover 29 other distributed debuggers and classifies them according to the following criteria: What kind of replay is provided? What types of breakpoints are supported? What effects can be controlled with breakpoints? Which event types are observed by the system?

Of special interest is the proposal of Netzer and Miller, who developed an optimal tracing technique which tries to minimize the amount of necessary trace data, both for message passing systems See [NeMi 92b] and shared variable programs See [Netz 93]. Their idea is based on the fact, that only those events affecting the race conditions do have to be traced. By making such run-time decisions, only a fraction of the total number of events (as required with Instant Replay) have to be stored in the tracefiles.

Although this on-the-fly trace reduction approach of Netzer and Miller is optimal in the common case and offers a good performance in the worst case, it could be further improved for space and time efficiency using certain characteristics of the underlying system. Such improvements for time efficiency have been described in See [FaCh 95] and See [FaCh 96], while improvements for space efficiency can be found in See [LeSc 92] and See [LeSc 93]. Several combined efforts for shared variable programs have been proposed by Audenaert, Levrouw, and Ronsse in See [AuLe 95], See [Levr 94], See [RoLe 95], and See [RoLe 96]. Additionally, they show how to apply very simple compression algorithms to traces in order to further minimize tracefile size See [Rons 95]. In cooperation with Ronsse we developed a technique for message-passing parallel programs that produces even smaller traces and is less intrusive than the others described above (for details, please refer to See [RoKr 98]). In Section See Possible Optimizations, we will show that this approach can be further improved by exploiting certain characteristics of most existing message passing libraries. An approach that has been influenced by this work and is therefore comparable to ours is MPL*, which described by Chassin et al in See [Chas 99].

From the approaches that are not based on Instant Replay, some worth-mentioning are Deterministic Execution See [TaCa 91], Demand-Driven Replay See [MiTa 95] and On-the-Fly Replay See [Gers 94]. In Deterministic Execution, a sequence of synchronization constructs is generated based on an initial execution of the program. By inserting these synchronization constructs into the replayed execution, the irreproducibility effect can be eliminated See [CaTa 91]. The trace of the initial execution is also required for Demand-Driven Replay, which replays a program between breakpoints by allowing only those interprocess activities, that are needed to reach the next breakpoint See [MiTa 95]. The only technique that does not require a dedicated record phase is On-the-Fly Replay, because record and replay phase are executed simultaneously. The replay phase is steered by the record phase by exchanging ordering information between both executions See [Gers 94].

The Completeness Problem

A related crux observed during testing nondeterministic parallel programs is the completeness problem, which is a kind of observability problem11. In this context, observability characterizes the amount of information that must be retrieved from a program code or specification in order to correctly specify, select, and evaluate test cases See [KrWi 98]. Yet, obtaining all possible executions of nondeterministic parallel programs is difficult or even impossible. However, complete testing of such programs requires to consider every possible execution. In principle, the following items have to be addressed:

Several valid results may exist for one set of inputs.
  • Critical errors may occur only sporadically.

These problems may occur in nondeterministic programs, because different results may be computed for one set of inputs. Furthermore, different execution paths of the program may still compute the same results. Thus, the observed results may not necessarily represent the only valid executions of the program See [DaFr 94]. For example, in programs modeling nondeterministic scenarios from reality, several valid executions may exist and may be of interest to the user. In addition, some errors may be hidden in executions that may seldom or never occur during testing. Yet, since environmental parameters may affect the program unpredictably, every possible execution of a program must theoretically be considered during testing.

Formally, this means that the testing approach of Section See The Testing Cycle has to be extended for nondeterministic parallel programs. Again, we apply the simple checking approach of Blum and Wasserman See [BlWa 96] (see Definition See Simple checking [BlWa 96]). After choosing an arbitrary mathematical function f ( x ) that describes the behavior of some program P ( x ), and a valid input x , we have to determine whether or not y = f ( x ) in order to confirm if the input/output pair represents a correct computation of f . The problem is, that there may be more than one correct output y for each set of inputs . Thus, for satisfying the test completion criterion with nondeterministic programs it is necessary to compute each output y from a set of possible outputs for each input of the set of valid inputs in X . As a consequence, some means of steering the choices at nondeterministic events have to be provided. The goal is to obtain alternate versions of particular test cases during replay, based on some previously recorded program execution.

Although this problem is well-known in the parallel debugging community, there are only few solutions that address this completeness problem. In fact, most of these solutions try to overcome this problem by eliminating or controlling the nondeterminism occurring due to communicating processes. A simple approach would be serialization of the code, which means that no process is allowed to proceed concurrently to any other existing process. If the processes are executed in a pre-defined sequential order, race conditions simply cannot occur.

One example of this approach is "controlled execution" See [TaCa 96], which also offers a method to perform automatic testing. Based on a technique for describing the desired communication behavior, the replay tool forces the program to execute the communication as specified. Therefore, the order of all communication events is predefined and races are not permitted. In addition, selected events of the communication description can be varied, allowing to perform restricted permutation testing. An extended approach to this technique is offered by Oberhuber See [Ober 98], who proposes to specify not only input data for testing, but also the sequences of interprocess communication See [Ober 95]. In this approach control patterns are applied for establishing the order between the communication events. Such control patterns are dynamic rules to define the order of process interaction. They can be grouped hierarchically to express more and more complex patterns, which should then sufficiently describe the expected communication structure of a parallel program. This eliminates nondeterminism and guarantees reproducibility, if the predefined behavior is enforced by the debugging tool during execution and re-execution See [Ober 95].

There are several problems to all these proposals. Firstly, the user is confronted with an additional means of specifying the program's communication behavior. It may be difficult to demand this from the average user, who already described the program's behavior by specifying the communication statements in the original source code. Not only does it require to learn a new description language, but also a correct relation between the program code and the communication specification. Secondly, the correctness and completeness of the communication specification must be guaranteed in order to mask all possible executions for all possible inputs. This may be possible in some cases, but is certainly limited in case of highly complex and large programs, which may exhibit totally unexpected behavior. Thirdly, these solutions do not exhaustively test the program's execution, but instead pretend to test the most feasible cases. Therefore, the user may be mislead and trust in a program, that is actually not completely tested and may exhibit incorrect behavior whenever executed without this controlling mechanisms.

In contrast to the solutions described so far, our proposal addresses the completeness problem more directly. The first description of our idea is presented by Grabner and Volkert in See [GrVo 93], who nicknamed this approach "event manipulation". An improved version of their basic idea together with a corresponding replay mechanism can be found in See [Kran 95a] and See [GrVo 96]. Our basic idea was also extended in See [Kran 97b], where we show that a combination of race condition detection with control and data flow analysis can provide vital information for selecting appropriate race condition candidates. Recently, the original manual approach has been automated as described in See [KrVo 98]. This latest instance of automatic event manipulation will be described in more detailed in Chapter See Automatization of Nondeterminism Analysis. Therefore, the following paragraph gives only a brief summary of our proposed approach.

The main idea of our approach can be formulated as follows: Firstly, an initial execution of the target program is observed during a record phase. The observed data is analyzed and encountered race conditions are identified. Based on these race conditions, the following question is explored: What would have happened, if the order of the communication events at some arbitrary race condition would be different from the observed order? The answer to this question for any selected race condition in the observed execution is derived with two steps:

Event manipulation
  • Artificial replay

Event manipulation means, that the observed order of communication events is changed. In the original approach, this change is applied graphically in a display that represents the program's communication behavior. After the exchange has been performed, a replay of the program is initiated, that enforces the artificially generated new behavior. This re-execution has to perform two distinct tasks to solve the question stated above: Up to the point of the event manipulation, the program is replayed exactly as observed during the record phase. After the event manipulation has been performed, the program is executed without any constraints. This is necessary, because the future of the program after this exchange is unknown due to the nondeterminism and cannot be derived from the behavior stored during the record phase.

With this event manipulation and replay approach, we can easily investigate different program executions for one particular set of inputs. Furthermore, all possible permutations of the observed program execution can be derived, if all combinations of event orders at each of the racing receives are tested. For the original approach described in See [GrVo 93] and See [GrVo 96], this would mean that each possible event order has to be generated manually by the user. Since this could represent a major drawback for many parallel programs, we revised this manual approach to perform the event manipulation and artificial replay automatically See [KrVo 98]. In this case, an initial execution is analyzed by a testing tool, that identifies all the race conditions in the program and generates the complete set of follow-up executions automatically without user interaction. Then, the resulting executions and their corresponding outputs represent all possible executions of the target program for the selected input.

The few comparable approaches for solving the completeness problem are described with the Hindsight prototype debugger in See [Spie 93], with the Time Controlled Environment TCE in See [Kraw 00], and with the macrostep approach in See [Kacs 98]. Hindsight targets ordering errors occurring due to missing synchronization operations by exchanging the order in which synchronization operations are carried out See [Spie 93]. The TCE (together with the Race Detection Testing Strategies RDS) tries to modify a program run during subsequent execution by introducing artificial delays or changes to the process allocation scheme See [Kraw 00].

In contrast to these approaches, macrostep debugging as proposed by Kacsuk in See [Kacs 98] can be used to perform complete testing. The main idea is, that the user describes the minimal relevant test sequence for a given problem, and the tool automatically derives each possible execution for this test sequence. This is achieved by exhaustively traversing the execution tree of a program, where the debugger takes a different choice at each race condition during subsequent re-executions of the program See [Kacs 99]. The only limitation of macrostep debugging is, that it is based on the GRAPNEL visual programming language See [Kacs 97]. Although this language uses PVM for low-level communication routines, the visual objects applied in GRAPNEL are much more restrictive than what would be possible with the underlying message passing library. While this is certainly an advantage for the inexperienced user, it also reduces the number of possibilities at race conditions vigorously.

In practice, the number of possible combinations would probably prohibit exhaustive testing See [DaFr 94]. One important reason is, that these possible combinations and the corresponding replay phases may deliver some totally new executions with potentially new race conditions, which of course have to be treated by the testing program in the same way as the original execution. In certain cases it may even be imaginable, that the program delivers new executions endlessly, which prohibits to perform the complete tests of the program. An example is a program with a loop, that requires a specific ordering of events for loop termination.

Therefore, the debugging tool has to provide certain precautions in order to limit the testing cycle to the most probable executions (compare with Definition See Good Enough testing [Bach 98]: See Good Enough testing [Bach 98]). Some of our solutions to this problem are described in detail in Section See Ideas for Test Reduction. Another solution based on our basic event manipulation approach can be found in See [KiCh 97], which is intended to detect occasional bugs (see Section See Definition and Classification of Errors) that occur only sporadically due to racing messages. Their idea is to re-execute the program while imposing a strictly different message ordering than observed. By maximizing the number of reordered pairs of message deliveries, errors that would otherwise occur only sporadically should materialize.

The Probe Effect

Another problem that occurs during debugging nondeterministic parallel programs is the Heisenberg (Uncertainty) Principle applied to software (See [LePa 85], See [Rose 96]), the so-called "Probe effect" (See [McHe 89], See [Ober 98]). The symptoms observed due to this effect are called "Heisenbugs" as described in See [LePa 85] and See [Rons 00]. This problem has already been described for the sequential case in Section See Problems of Sequential Debuggers, and is even more critical for parallel programs, especially in the nondeterministic case. In 1986, Gait formulated this problem as follows:

Probe effect See [Gait 86]

The probe effect is an alteration in the frequency of run-time computational errors observed when delays are introduced into concurrent programs.

This means that the behavior of a program is influenced when something slows down the execution of one or more processes. Obviously, adding a debugging tool represents a certain amount of delay, and that delay may vigorously affect the debuggee's execution. The critical issue is the size of the observation functions which are added to the code during the instrumentation phase. This size determines the monitor overhead, which is on the one hand the time spent in the instrumented code, and on the other hand the amount of memory needed by the observation functions.

Therefore, a primary goal of any debugging tool must be to generate very little overhead. Yet, even the smallest overhead consumes some amount of memory and there will always be an execution delay of the target program, which modifies the time of occurrence of events. Thus, the data that is generated by the observer describes events that occurred later in time, compared to an execution where debugging is turned off. An additional problem to the displacement of observed events is introduced, if the target program reveals nondeterministic behavior. In that case, the order of events at race conditions may be affected, leading to a completely different execution that would have been observed without the debugging tool. In fact, any attempt to debug a program may change that behavior so much, that the collected data will not be adequate See [KrWi 98]. Furthermore, new errors may occur while others may be hidden in program branches not taken during the observed execution, which would otherwise have occurred if the program had been executed without the debugger. In more pathological cases the observed application may be prevented from exhibiting any sensible behavior at all.

These problems have been extensively addressed by Malony, Reed, and Wijshoff in See [Malo 91a], See [MaRe 91], and See [Malo 92]. They distinguish between direct intrusion and indirect intrusion, the latter being very difficult to estimate and correct. In their model, direct intrusion is corrected post-mortem by estimating the amount of tracing and subtracting this overhead from the measured event data. This approach works well for programming models with deterministic communication (e.g. fork-join in See [MaRe 91], and rendezvous in See [Mail 95]). However, it is insufficient for nondeterministic behavior of asynchronous communication as offered by the standard message passing interface MPI, because the latter would require post-mortem changes in the order of traced events. As a consequence, Maillet introduces conservative approximation as described in See [Mail 95], which applies only corrections that do not interfere with the traced event order. Another idea for nondeterministic programs was described by Leu and Schiper See [LeSc 92], and later by Teodorescu and Chassin de Kergommeaux See [TeCh 97]. Both ideas focus on tracing only the minimum amount of information required to allow re-execution of programs. Thus, this initial execution is assumed to be only slightly perturbed and is therefore considered as an approximation of a unperturbed execution See [FaCh 96]. Afterwards the replayed executions are used to apply performance tracing and generate additional timing information.

Other contributions in this domain have been described in See [Will 92], See [SaMa 93], See [Gann 94a], See [Gann 94b] and See [Jaki 95]. In contrast to all these correction algorithms, there also exists a group of monitor overhead prevention algorithms. One example is the "logical clock approach" of Cai and Turner, which uses a virtual time to reflect the real time execution of each process when running without monitoring See [CaTu 94]. During execution, the interprocess communication is controlled by this virtual time rather than the real time in order to avoid the probe effect See [TuCa 93]. A similar approach is described by Zhang et al (see See [LiZh 95] and See [Zhan 98]), who correct the monitor intrusion with artificial delays, so that the relative speed of all the processes remains the same. Similar ideas are proposed in See [Wu 96] and See [Wu 98], where different communication protocols are used to distinguish the monitoring activities from other computational tasks in order to allow automatic removal of monitor intrusion. In See [HoMi 96], Hollingsworth an Miller describe a data collection cost system, that provides users with feedback about the impact of the data collection on the running program. In addition, their system allows to define the level of perturbation that affects the program's execution, which is then controlled by the monitor.

The problems of all these previous approaches to correct the probe effect are, that either they do not treat the possibility of actual changes in event order, or they pretend that changes in event order did not occur in the first place. Yet, even the smallest overhead may lead to such program perturbations and different program behavior. Thus, a complete overhead removal strategy must deal with both types of program perturbation, changes in event timing and event ordering. In See [Kran 99b] and See [Scha 99], we describe a possible solution to this problem that includes necessary clock synchronization algorithms, calculation of corrected event occurrence timings, and automatic event manipulation and replay at places where changes in event ordering are detected.

One of the problems of our approach is the accuracy of the correction algorithms, which prohibits to identify the one and only execution that would occur without the debugging tool attached to the code. This problems may be attributed partially to the achievable accuracy of clock synchronization algorithms, partially to the measurements of communication and overhead. In our case, we implemented a simple solutions based on Lamport's condition that sends must happen before receives, as well as the more sophisticated synchronization method proposed by Maillet et al in See [MaTr 95] and See [Mail 96]. For the overhead measurements, we applied the SKaMPI benchmark developed by Reussner et al See [Reus 98], which provided detailed timings for the complete MPI standard functionality See [Kran 99c].

Another problem is that in many cases there exists not only one possible execution due to the nondeterminism, but several executions with almost equal probability. In the worst case, every possible execution path may occur in reality, which connects the probe effect to the irreproducibility problem and the difficulties of exhaustive testing described above. As a consequence, our correction algorithm derives the set of possible executions with the highest probability, where each of them may most likely occur if debugging is turned off. This issue will be discussed in combination with automatic nondeterminism analysis in Section See Ideas for Test Reduction.

Extended Parallel Testing and Debugging Approach

The problems briefly described in the previous section emphasize the fact, that parallel debugging differs from sequential debugging in many aspects, and parallel debugging tools need to offer additional functionality for handling the increased complexity, the high amount of debugging data and these anomalous effects.

A possible approach to parallel program debugging is suggested by Geist et al in See [Geis 94a], who identify the following three steps:

Try to run the program sequentially as a single process and perform sequential debugging.
  1. Execute the program with two to four processes on one single computer and verify the correctness of the message-passing structure determined by message tags and destinations.
  2. Execute the program with the same two to four processes on more than one computers to detect possible problems introduced by network delays related to synchronization and timing.

These propositions are often referred to in parallel programming literature (e.g. in See [WiAl 99]), and are widely accepted as a possible way to perform parallel program debugging. In fact, most existing debugging solutions apply down-scaling as an inherent functionality, because these debuggers are not prepared to cope with really large programs. In this case, down-scaling means that the program's size is reduced in terms of participating processes and problem size or executed numbers of iterations, respectively. Yet, while this suggested down-scaling of the program size may hopefully be sufficient in many cases, it may be inappropriate in other cases due to the reasons described before. In principle, some errors may occur only with many processes or on actual amounts of data, and it is therefore unavoidable that suitable testing and debugging must also consider the final production scale of the program and its associated data.

For that reason, we propose a different debugging strategy, which has been developed based on the flow-diagram of Figure See Traditional testing and debugging cycle. The basic idea was to integrate the functionality required to treat the obstacles described in Section See Obstacles During Program Debugging. Firstly, it is necessary to obtain the order of events already during the testing cycle. Otherwise, the debugger would suffer from the irreproducibility effect and it would be impossible to guarantee debugging of the same execution that was responsible for the incorrect behavior. Therefore, some instrumentation has to be performed before testing is initiated.

In addition, the probe effect should be avoided as much as possible, which means that the amount of observation overhead should be as small as possible. The smaller this overhead, the lower the probability of program perturbations. This fact can be endorsed, because the functionality required for obtaining event ordering is far less than necessary for debugging, which means, that testing requires only reduced instrumentation. Additionally to adding instrumentation for the record phase to the testing cycle, some replay functionality has to be added to the debugging phase. Only with this replay functionality the same event order as obtained during testing can be enforced during debugging.

Another extension to the diagrams of Figure See Traditional testing and debugging cycle concerns the completeness problem and its solution, event manipulation and artificial replay. The necessary changes have to be made in the testing cycle and before the debugging cycle. This time, another iterative cycle is added after the input has been selected. Then, each possible program execution for that selected input can be tested.

The resulting testing and debugging cycles for nondeterministic parallel programs are sketched in Figure See Testing and debugging cycle for nondeterministic parallel programs. This revised strategy consists of the following three cycles:

Input testing cycle
  • Nondeterminism testing cycle
  • Debugging cycle

The left-most flow-diagram represents the input testing cycle, which is used to iteratively test all possible inputs for the program. As mentioned above, the only difference to the traditional testing cycle is the additional instrumentation phase that is needed for adding the observation code of the record mechanism. Again, the end-marker of the flow-diagram has been omitted deliberately, because of the possible high number of possible inputs that have to be tested, and the difficulties to determine the correct test completion criterion (see Section See The Completeness Problem).

The central cycle of Figure See Testing and debugging cycle for nondeterministic parallel programs is needed to overcome the completeness problem. Its intention is to evaluate the program's behavior and to compute each output y from a set of possible outputs for the input x selected in the input testing cycle. For that reason, the program's execution is observed and the results of nondeterministic events (e.g. the order of incoming messages at race conditions) is recorded. Afterwards, the termination of the program and its results are inspected similar to the testing cycle of Figure See Traditional testing and debugging cycle. Again, the detection of a program failure or incorrect results leads to the initiation of the debugging cycle. If no erroneous behavior is encountered, event manipulation and artificial replay are used to evaluate other program executions for the same selected input. This is iteratively repeated as long as further combinations of racing events can be derived. Since the number of necessary iterations is also unpredictable, the end-marker has been omitted again.

The remaining right-most cycle of Figure See Testing and debugging cycle for nondeterministic parallel programs is the debugging cycle, which almost resembles its equivalent from Figure See Traditional testing and debugging cycle. The main difference is, that this time the program is re-executed under the control of the replay mechanism, because of the irreproducibility effect. Furthermore, another instrumentation phase is needed that differs from the input testing cycle, because the debugger requires much more information than the nondeterminism testing cycle. Please note, that this increase of observation overhead does not introduce any perturbation of the event ordering, because the event order is determined by the record phase of the nondeterminism cycle and is therefore fixed during debugging. Thus, the only effects of the observation overhead are experienced for the event timings, which are delayed depending on the amount of instrumentation code. However, since we are mainly interested in the programs correctness, this increase in timing represents only a minor drawback and is much more important during performance analysis.

Requirements on a Parallel Debugging Tool

Architecture and Functionality of an Arbitrary Debugger

The debugging strategy described in the previous section is one possible approach offered for parallel programming. Indeed, many investigations and research projects have already targeted error detection in parallel programs, and some of the aspects described above have actually been implemented in existing tools and tool prototypes. Yet, a complete debugging tool should be valid for any scale of parallel program, from small-scale applications to massively parallel programs. Especially the latter have not been covered sufficiently, although it is accepted that some means of testing and debugging on the final scale of the applications are indispensable. In the following subsections, we will try to derive the requirements on a parallel debugging tool, that operates with the debugging strategy of Figure See Testing and debugging cycle for nondeterministic parallel programs, and provides support for large-scale parallel programs.

Starting with the architecture, we will again take a glimpse on the sequential counterparts. In this context, Rosenberg defines the architecture of a debugger to consist of See [Rose 96]:

A (graphical) user interface with features for
  • Displaying source code
  • Inspecting (and manipulating) the processes stack, the hardware registers, and the applications variables.
  • Handling breakpoints
  • Controlling the execution of the target program
  • A debugging kernel with functionality for
  • Managing the program's symbol table
  • Evaluating expressions
  • Controlling and executing processes
  • An operating system for accessing
  • Operating system functions
  • The debuggee

A similar architecture is proposed by Bemmerl, who formally defines the key features of an arbitrary debugging tool as follows:

Functionality of a generic debugging system See [Bemm 92]

Any tool for error detection, a so-called debugging system, DS = (PD,DO) is defined by

PD = arbitrary connections of control flow-data flow, and concurrency events with operators AND, OR, and a suitable ordering relation, as well as
  • DO = breakpoints
  • trace state-inspection state-manipulation
Criteria for Developing a Parallel Debugger

The functionality and architecture defined above is certainly a good starting point for the development of a parallel debugging tool. In addition, the following two items have to be investigated by developers of parallel debuggers:

Heterogeneity and portability: supporting different programming models and hardware architectures (possibly in the same environment)
  • Usability: handling the characteristics of parallel programs like
  • Increased complexity
  • Amount of debugging data
  • Additional anomalous effects

One of the most important criteria for any tool developer is a definition of target systems, which in our case are high performance supercomputers. Since there exist different architectures and possibilities of programming them (as described in Section See Supercomputing and Section See Parallel Programming), a wide variety of strategies and tools with different functionality have already been proposed, leading to some nice tool prototypes or even commercial products. Furthermore, every hardware vendor provides its own native support for debugging on their respective architectures, like for example nCUBE with the ndb Debugger, CONVEX with its CXdb, or Intel with the Paragon Debugger ipd on its iPSC/860 architectures.

This diversity is an issue of some on-going projects in this domain, which address the problems of portability and heterogeneity in parallel debuggers and performance analysis tools See [Panc 95]. Corresponding solutions are usually based on the fact, that all vendor-specific tools provide similar functionality under a different user interface. Consequently, it seems useful to develop portable debuggers which could be used on different hardware platforms, thus dismissing the need to learn several debugging tools when switching between different platforms.

Some examples in this context are P2D2 See [Hood 96], Panorama See [MaBe 93], PDBG See [Cunh 98b], PDT See [Clem 95b], and DCBD See [Wang 99], as well as the best-known commercial debugging tool Totalview See [Dolp 98]. These tools offer debugging support on several multiprocessor architectures and networks of workstations. An important characteristic of almost all these systems is, that every approach applies several instances of an existing sequential debugger in order to perform the debugging task on the participating processes. Thus, their idea is to offer a graphical interface for managing the features of these "base debuggers" See [MaBe 93], which are in most cases variants of the GNU debugger gdb See [Stal 93]. In these cases, the parallel debuggers adopt a client-server architecture to provide a uniform interface for different platforms, communication libraries and programming models See [ChHo 94].

Although this portability may be useful in many cases, it introduces some obstacles especially on large scale computing systems like massively parallel machines or heterogeneous clusters of supercomputers, the so-called metacomputers. In fact, one of the biggest problems experienced during debugging of supercomputer applications is connected to the amount of data that has to be analyzed See [Frum 98]. Firstly, these programs tend to be long-lasting, from some hours to several days or even more. As a consequence, many state changes occur that have to be observed, transferred, stored, and processed by the debugging tool. In the worst case, debugging a supercomputer may require another supercomputer to perform the analysis task. Secondly, the execution time and the large number of participating processes leads to enormous interprocess relations, which cannot be comprehended by the user. Thirdly, a great amount of debugging activities have to be performed equally for different processes and repeated iterations of the target application.

Thus, the basic requirements heterogeneity and usability have to be supported by methods for managing parallel programs in order to increase the lucidity and applicability. For that reason, we identified the following optional features:

Abstraction
  • Automatization

Abstraction means, that the amount of debugging data presented to the user is reduced to a suitable minimum. Besides the regular meaning of forming some kind of mental generalization, this also includes related techniques like transformation, reduction, extraction, isolation, and filtering. Error detection can only be applied successfully, if the user can comprehend the observed activities See [Kime 95]. This means, that a debugging tool should offer capabilities allowing the programmer to focus the scope of attention to limited regions within the program See [BeSt 94]. Furthermore, it may be equally important to permit different levels of abstraction in order to select the most appropriate level for the conducted debugging operations, so that only the data relevant to the user is presented See [From 95c]. Sometimes it may be useful to see the execution of the program in a coarse graphical overview, while at other times it may be beneficial to display the program on the level of machine instructions. Thus, a most important feature is to provide connections for dynamically switching between different levels of abstraction. It's the nature of debugging, that one doesn't know just what to look for in advance, so it is certainly important to provide access to the vast amount of debugging data while at the same time offering functionality for managing these data.

Besides abstraction, a lot of debugging efforts can be supported by automatization if repeated debugging activities are performed without user interaction. Some examples are the automatic detection of program failures, abstraction of detectable, for the user probably invisible properties, and iterative testing for the completeness problem. For example, after reading the core file, users can be directed to the place in the code where the program exited. Or, race conditions and corresponding memory access patterns or racing messages can be evaluated automatically. Some sophisticated methods consider statistical testing See [Walt 95] and automatic checking of dynamic properties using regular grammars or finite automata See [Baba 95].

Of course, due to the nature of debugging and its strong relation to the user's knowledge, a complete automatization of the debugging process is certainly impossible. For instance, the correction of a detected anomaly must always be carried out by the user. Therefore, the goal of powerful debugging tools must be to perform as many debugging activities automatically or semiautomatically with the main directions given by the user.

Based on these requirements, we may now try to classify existing work and identify the drawbacks of previous approaches. Only a strategy or tool supporting these characteristics may then be applicable to full-scale applications, and may allow to perform debugging activities impossible with existing solutions. In addition, it may also improve the error detection task on smaller-scale application sizes, especially if it can be combined with other tools in this area.

Parallel Program Analysis and Visualization

This section tries to offer a brief overview of related work in the area of parallel program analysis and visualization. It starts with an overview of existing collaborative, standardization approaches for parallel tools and tool developers. Afterwards, a model of a generic program analysis tool will be introduced, which is used to classify existing tools in the domain error detection and performance tuning. Such surveys are interesting for potential users as well as tool developers. At the same time, it is probably impossible to capture all existing techniques and approaches, or even to distinguish between pure theoretical achievements, actual tool prototypes, and tools available for real end users. Some overview papers with similar content in this domain are in chronological order: See [McHe 89], See [PaNe 93], See [Gu 94], See [Hond 95], See [TsYa 95], See [Fran 96], and See [Brow 99].

Collaborative and Standardization Efforts to Debugging

In the following sections, we describe a lot of different approaches and strategies for debugging parallel programs, and a variety of tools and software environments that have been developed with the goal to satisfy the user's need during error detection and performance tuning. As a starting point, we will briefly introduce some standardization efforts with the goal of leading to substantial improvements for both, tool developers and their consumers. The most important examples of concentrated efforts are

the OMIS project,
  • the Ptools consortium,
  • and the HPD forum.

The first attempt, which will also be mentioned in Section See Monitoring Tools, is the Online Monitoring Interface Specification OMIS12 See [LuWi 97]. This effort was initiated in 1996 by Ludwig and Wismüller at the LRR (Lehrstuhl für Rechnertechnik und Rechnerorganisaton) of the Technical University Munich together with Sunderam from Emory University, Atlanta, one of the creators of the Parallel Virtual Machine PVM. Since then, the authors of OMIS were able to dissipate it to a broad community, and several compliant tools have already matured at various research institutions around the world. Proposals for extensions to the standard, are coordinated by the OMIS group, while software products being developed in the framework of this project will be released under the GNU public license conditions to deliver a maximum profit for the user community.

In contrast to the OMIS standard, the Parallel Tools Consortium Ptools13, established in 1993, seeks to increase the cooperation between representatives from federal, industrial, and academic sectors in order to address the factors that inhibit tool usage and tool usability on parallel computers See [Panc 96b]. Their self-imposed mission is to take a leadership role in defining, developing, and promoting parallel tools that meet the specific requirements of users who develop scalable applications on a variety of platforms. This is attempted with two activities: On the one hand, Ptools organizes meetings for members of the above mentioned organizations in order to provide a forum, where tool developers interact directly with potential users in order to help formalizing and prioritizing user requirements for tool support. On the other hand, Ptools promotes tool portability and consistency by sponsoring projects to develop general tools that can execute reliably across a broad range of parallel and clustered computing systems. This is supported by the Ptools infrastructure, that includes channels for acquiring input from the high-performance computing community, working groups involving tool users, researchers and developers, and mechanisms for formalizing and distributing tool components for adoption by the industry. Although Ptools is operated from the United States, its membership is open to all organizations and persons interested in parallel tools without any membership fees.

One of the activities sponsored by the Ptools consortium is the High Performance Debugging Forum HPDF14, which was officially established in 1997. Its goal is the definition of a useful and appropriate set of standards relevant to debugging tools for high-performance computing See [Brow 97]. This base level standard has to meet the sometimes conflicting constraints of being useful to users, realistically implementable by developers, and architecturally independent across multiple platforms. In order to meet the criterion for timeliness, the HPD forum restricted itself to two years, where the first year should be used to develop a standard that could then be implemented during the second year. And in fact, Version 1 of the HPD standard was released in 1998. However, while the standard was developing fast and is in some parts still being discussed, actual implementations of its functionality are still not available by now. However, there are notably some organizations working on it, and it will be interesting to see, whether their efforts can lead to substantial benefits for the group of debugging users.

Another interesting approach with some nice potential, although actually not termed as a standard, is collaborative debugging See [DoMu 97]. Its idea is to create distributed debugging communities around the globe based on the World Wide Web technology. These groups of programmers will then describe and fix bugs in collaboration, with probably high potential benefits for the quality of the resulting software. A first solution is the Internet Software Visualization Laboratory (ISVL), which uses a client/server architecture to deliver visualizations of program behavior on any Java-enable Web browser See [DoMu 97]. Another approach is offered by Kansas, a multiuser virtual world, which allows people to experiment with building programs and collaboratively hunting bugs See [Smit 97].

Models of Generic Program Analysis Tool

After identifying standardization and collaborative efforts to program debugging, we will now characterize tools for debugging and performance analysis. The latter are included in this section due to the many similarities of tools in these two areas. The reason is, that both analysis activities rely on program state information and some data are equally useful for error detection and tuning (see Section See Visualization of the Detected Behavioral Characteristics for more details and suggestions about this similarity). In fact, some existing tools actually try to fulfill the users' requirements for combined debugging and performance analysis tools See [AgCh 96].

Any program analysis tool always consists of two principal items, an observation component and an analysis component, often referred to as monitoring and visualization component, respectively See [Hond 95]. An even more detailed model is offered by Mohr et al (See [Mohr 92], See [Klar 95]), who identify six distinct layers of an arbitrary analysis tool, which are hierarchically established on top of the analysis target. These layers are See [Klar 95]:

Monitor oriented layer: the monitor
  • Analysis oriented layers:
  • Data access layer
  • Selection layer
  • Tool support layer
  • Tool layer
  • Application oriented layers: application support

The bottom layer of this model represents the connection between the analysis tools and the target system. It denotes the observation component in See [Hond 95], and is responsible for the data collection. Its main tasks are to record the dynamic course of state-changes that constitute the execution of the actually observed program. Obviously, there is a strong connection between the observer and the observed program, as well as the underlying system.

The next four layers above the monitor oriented layer are called the analysis oriented layers. The data access layer provides access functionality for obtaining the observation data from the monitor layer. In order to reduce this exhaustive data and relate it to corresponding analysis tools, the selection layer provides functionality for filtering and compounding the data. In the tool support layer, this data can be accessed by generic functions without any semantics and utility functions like statistics or other operations. The actual semantic oriented analysis tasks are provided by the tool layer, which offers a wide variety of activities: trace validation, statistical evaluation, visualization of execution behavior, animation, as well as processing the data for other tools on the next layer, like modelling tools, debugging and performance analysis tools, and tools for load balancing. This top-most layer of the hierarchy, the application oriented layer, constitutes the interface to the user, and is therefore the connecting link between the analysis tool and its client.

In See [Hond 95], the four analysis oriented layers and the application oriented layers are combined in the analysis or visualization component. Its most important property is the way of displaying the program to the user. In general, we may distinguish textual representation (e.g. source code) or graphical representation (with many different ways to visualize the monitored data). As mentioned before, the advantages of the latter can only be acquired if a combination of both approaches is included through some kind of source code reference.

Tools can also be divided by whether analysis and visualization takes place simultaneously with the program execution or after it, which establishes the following two groups:

On-line analysis tools
  • Post-mortem analysis tools

In the on-line case, the data of the monitor is transported to the analysis component, while the target program is still executing. In the post-mortem case, the program is executed and analysis data is stored in tracefiles, which are investigated only after the program's execution terminated. Please note, while post-mortem allows to remove the analysis components (the five upper layers in Mohr's approach See [Mohr 92]) from the executing program, it still requires to include at least the monitoring component, because this component can only retrieve state information when the program is actually running.

Both approaches have different advantages, with on-line being the primary choice for standard error detection tools. The reason is, that on-line analysis tools allow a much more flexible approach to program state inspection than post-mortem, because the user can determine what to inspect at each point during the execution. On the other hand, the post-mortem approach allows some analysis techniques, that can not be performed with on-line tools, primarily because on-line tools can only rely on the past and present of a program's execution, while post-mortem approach may analyze the program from its beginning to its termination. An example, where post-mortem approaches are superior to on-line approaches is race condition detection, because on-line methods may not be able to identify the complete set of race conditions at any point during the execution.

Monitoring Tools

As mentioned above, every tool must contain at least a small monitoring component to observe the program's execution and gather state-information for the follow-up analysis phases. Therefore, most monitoring tools are an integral part of the analysis tool, especially in the on-line case, and cannot be examined in isolation. In fact, the monitor part is often invisible to the user, and therefore indistinguishable as a stand-alone component. Examples for this statement are most traditional sequential debuggers and the parallel debuggers, that are composed from sequential debuggers. For instance, when using the GNU debugger gdb See [Stal 93], which is often used as the lowest level of parallel debuggers like in P2D2 See [Hood 96], PDBG See [Cunh 98b], and PDT See [Clem 95b], the activities of the monitor are completely integrated in the debugging environment and cannot be evaluated in isolation. A similar model is offered by one of the few commercial parallel debuggers Totalview See [Dolp 98].

However, there are also some other approaches with explicit monitoring components. Based on their monitoring method, these approaches can be divided into the following three classes (see See [GrVo 93], See [Klar 93], and See [Grab 99]):

Hardware monitoring
  • Hybrid monitoring
  • Software monitoring

Hardware monitors represent an extension of the target system on the hardware architecture level. Thus, their usage is generally restricted to vendors or hardware experts, and requires a lot of expertise and knowledge during installation and operation. The biggest problem of these monitors is, that even the most advanced tools are restricted to a few observation points and provide only low-level data as well as a low-level interface. In addition, such monitors offer little portability because they address only one particular hardware architecture. Yet, their advantage is the very little overhead that is produced during program observations, which denotes a lot of benefits in terms of the probe effect.

One of the few widely-known hardware monitors is ZM4 (german: "Zählmonitor 4") See [Hofm 87], the fourth generation of hardware monitors developed at the University Erlangen-Nürnberg See [Klar 95]. ZM4 tries to overcome the portability problem by offering a variety of interfaces for several target systems, from Transputers via the networks of personal computers to the MEMSY system (Modular Expandable Multiprocessor System), which is characterized by its extensive support for measurements See [Hofm 93]. Additionally, the interpretation of ZM4's results is supported by the post-mortem software environment SIMPLE (Source related Integrated Multiprocessor and -computer Performance evaluation, visualization and modeLing Environment) See [Daup 92], which tries to relate the observed data with the original source code in order to compensate for the low-level data obtained by the monitor.

Another kind of hardware monitors are so-called hardware counters, which are directly included in the microprocessor's hardware. These counters can be incremented whenever predefined events occur, and thus may be used to obtain heuristics about process activities. At present, there are two efforts worth mentioning in this domain. One is PCL (Performance Counter Library) See [BeZi 98] developed at the Research Center Jülich, the other is PerfAPI or PAPI (Performance Application Programming Interface) See [Mucc 99] developed in the frame of the Parallel Tools Consortium (see Section See Collaborative and Standardization Efforts to Debugging). An example of using these hardware performance counters is described in See [Rabe 99], where the PCL library was integrated on the SGI Cray T3E architecture to observe the frequency of calls to routines of the standard Message Passing Interface library.

The second monitoring technique, hybrid monitoring, is a mixture between software monitoring and hardware monitoring. Mohr characterized hybrid monitoring as follows See [Mohr 93]: the recognition of events is done in software, but the recording and time stamping is done in hardware. It is still rather low-level in order to generate as little overhead as possible, while at the same time supports software probes and other functional characteristics from software monitors. Its main advantages compared to pure hardware monitors are, that it is much more flexible and portable, while at the same time providing an improved interface to the user.

An example hybrid monitor has been developed by Malony and Reed for the iPSC/2-Hypercube system See [MaRe 90]. Their tool operates on a 5-bit parallel interface of the hardware architecture, which connects each of the processors to one common connector. With this approach, the 32 nodes hypercube is divided into two 16 nodes hypercubes, where one of them performs the computational task, while the other performs the observation See [Klar 95].

A similar approach has been implemented in our EMU (Event Monitoring Utility) See [Kran 96a], which offers a monitoring cube strategy for hypercube multiprocessor architectures (such as the nCUBE 2 multiprocessor and the SGI Origin 2000). Again, the available hypercube is divided into two smaller hypercubes with equal dimension, where one performs the computation while it is being observed by the other cubes. In principle, the functionality of the monitor is divided into two parts, with a very small part for accessing the process state on the computational nodes and a remaining larger part with the functionality for processing and storing these data on the monitoring nodes. In See [Kran 96a] we describe the advantages of this hybrid monitoring approach compared to software monitoring approaches in terms of monitor overhead. Other hybrid monitoring approaches are described in See [Gorl 91], See [HaWy 90], See [Lump 90], See [Malo 89], and See [Mari 90].

The last group of monitoring tools is probably the most widely used, the software monitoring approaches. As a consequence, there exists a wide range of different techniques for different hardware architectures, e.g. See [Gris 91], See [Joyc 87], See [LeRo 85], See [LePa 85], See [Mail 95], See [Mill 86], and See [Snod 88]. Most of these tools are solely intended for one specific parallel system, prohibiting its usage on other operating systems and hardware architectures. Instead, it seems that every tool developer restarts with another implementation of a monitoring component, because each system has its own design goals and philosophy for solving one particular class of problems. Yet, this is a minor discrepancy which has been noted before, because the basic characteristics of all these approaches are comparable.

In See [Mohr 93] Mohr tries to emphasize this fact and describes a possible way for a target-independent way of monitoring and analyzing parallel systems. The latest results of these ideas have been implemented in EARL, the Event Analysis and Recognition Language, which allows to construct new trace analysis tools by writing scripts in the EARL language See [WoMo 98]. Similar goals are pursued by several standardization approaches for the monitoring layer of debugging and performance analysis tools. One of the earlier representatives in this category is PICL, the Portable Instrumented Communication Library developed at the Oak Ridge National Laboratory See [Geis 90]. One of the central ideas of PICL was to offer the user a communication library that is available on different parallel computer architectures and offers the same kind of communication operations. In addition, PICL allows to generate traces for post-mortem analysis, which report the calls to the PICL functions. Consequently, an analysis tool like Paragraph See [HeEt 91a] can be used to evaluate the contents of the PICL traces and derive valuable information about the program's execution.

A comparable solution was developed by Reed at al within the Pablo performance tuning environment See [Reed 93], which consists of several components for instrumenting, tracing, and analyzing parallel MPI programs. The trace file format of this toolset is the Pablo Self-Defining Data Format (SDDF) See [Aydt 92], which is a data description language for specifying both data record structures and data record instances. In a certain sense it is a data meta-format, because it can describe general data records as opposed to a predefined set of records. Originally developed to link the Pablo instrumentation software to the other analysis components of pablo, like Autopilot See [Ribl 98], Virtue See [Reed 95], and svPablo See [DeRo 98], it has proven to be even more general and extensible than first envisioned by its authors. By providing features like compactness, portability, generality, and extensibility, it is nowadays used by several other monitoring and analysis tools.

For the sake of completeness, we also require one remark concerning these three tracefile descriptions, EARL, PICL, and SDDF. Each of this formats is placed on ASCII representations, which offers a high degree of flexibility and readability. In fact, in most cases, users may even be able to understand the contents of the traces without any analysis tool. Yet, this feature includes also the drawback, that tracefile size may grow easily leading to enormous requirements of trace storage. For that reason, most comparable approaches try to establish binary representations, which allow to store a program's behavioral data more efficiently. In addition, there are approaches that try to apply compression mechanisms to traces as described in See [Rons 95]. In See [Frum 98] an interesting idea is described, which compresses the trace data based on information-theoretic techniques. Comparable statistical methods are followed in See [Reed 97] and See [Yan 98].

The problem of tracefile size is also less critical in relation to on-line monitoring tools, e.g. the Online Monitoring Interface Standard OMIS See [Ludw 96] by Ludwig et al, which has already been described in Section See Collaborative and Standardization Efforts to Debugging and is currently available as Version 2.0 See [LuWi 97]. With this interface description, various types of run-time tools for parallel and distributed systems and the systems themselves can be interconnected. By using a monitoring system with a standardized interface, arbitrary target systems could be supported with the same powerful set of tools. Furthermore, tool developers can easily attach new tools to already existing implementations of OMIS compliant monitoring (OCM) systems on different target architectures, which in term can concurrently serve several compliant tools, thus offering a means for tool interoperability See [Wism 98a].

Another interesting idea related to monitoring tools is the integration of record&replay mechanisms. Certainly, the record phase is performed by a monitoring system which retrieves all the necessary information to guarantee an equivalent execution during the follow-up replay phases. Of course, this monitor requires only the most basic features and needs not generate complete process state information for the analysis tasks. Instead, this process state information may be generated during the replay phases, when the amount of monitor overhead is harmless. Therefore, it seems useful to include the replay functionality directly in the monitoring code, so that for each record function a corresponding replay function is provided.

In order to evaluate the benefits of this approach, we have reworked our original approach to monitoring - EMU (Event Monitoring Utility) - and its corresponding replay tool PARASIT (PARAllel SImulation Tool) into the NOndeterministic Program Evaluator NOPE. The advantages of this approach are manifold: Firstly, it improves the usability, because only one interface is provided and both the record and the replay component are steered with similar commands. Secondly, the code is more compact and additional instrumentation functions can be added for the record and replay phases simultaneously. Thirdly, and probably most important, it is very easy to switch between the monitor function and the replay function, which is especially important when event manipulation has been carried out.

In addition, we have studied the documentation of OMIS, and we believe that a similar approach could be applied by simply using the intrinsic steering possibilities of such a monitoring tool. The resulting OMIS compliant record&replay tool could profit from all the advantages of every OMIS compliant tool, while at the same time provides the same features as our own record&replay tool NOPE.

Software Visualization

After this brief discussion of the monitoring component, we will now present an overview of existing program analysis techniques. The simplest method of program state analysis is the print statement, that is added to the code. In fact, it is an accepted reality that print statements are the most widely used debugging tool for software developers See [Lieb 97]. The reasons for this situation are diverse, and may be attributed to the simplicity of adding print statements to code compared to the learning process for any debugging tool, as well as the inadequacy of many error detection and performance tuning strategies and their corresponding user interfaces. This dilemma and the reasons for refusal of users to accept the products of tool developers are an focal point for the several studies of Francioni and Pancake et al as described in See [Fran 96], See [Panc 94], See [Panc 95] and See [Panc 99]. The suggested solution is a more user-oriented approach in designing the human computer interface of upcoming tools in this area.

A similar point of criticism is targeted at the standard debugging abstraction, the source code. In fact, most of the current error detection approaches confront the user with the original lines of code, and offer additional functionality for steering the program through this code and inspecting and manipulating the states of the processes based on the original code. However, as we have briefly discussed in Chapter See Area of Application: Parallel Computing, parallel computing is an area of high complexity, where programs consist of multiple concurrently executing and interacting tasks which possibly handle huge amounts of data over long periods of time. Yet, it has been noted before, that textual representations are inadequate to express the complexity established in such programs See [Panc 96a]. Therefore it seems useful to investigate techniques for debugging and performance tuning in the area of software visualization See [KrSt 93].

Visualization is widely accepted as a powerful technique to manage various stages of the software life cycle, especially to improve specification, design and development, programming, and program analysis See [ZhMa 94]. Usually, two different fields of visualization are distinguished, visual programming and program visualization, which are introduced with the following two definitions:

Visual programming See [Pric 98]

Visual programming is the use of "visual" techniques to specify a program.

Software visualization See [Baec 97]

Software visualization is the systematic and imaginative use of the technology of interactive computer graphics and the disciplines of graphic design, typography, color, cinematography, animation, and sound design to enhance the comprehension of algorithms and computer programs.

The main difference between these two areas is their goal. While visual programming seeks to make program specification easier by using a graphical (or "visual") notation, software visualization seeks to improve the understanding of programs and algorithms using various techniques. Yet, there are also some similarities which arise from the visual techniques being applied. For example, many visual programming tools can be cited that include both, a mechanism for specifying the program and a mechanism for investigating the actual behavior of the program (e.g. HeNCE See [Begu 93b]). Often, different kinds of representations are used for these two tasks, but there exists some notable exceptions which offer the same visual representation during program construction as well as during error detection and tuning (e.g. Visper See [Stan 98]).

The similarity between both fields, software visualization and visual programming, is also recognizable from their main goal. In both cases visualization is used in the sense of forming a mental image of something that can be aided by graphical, auditory, and other sensory modalities. These means are utilized to aid the reasoning and understanding of programs and their executions, respectively See [Zhan 99]. This is underlined by Miller, who believes that visualization should guide, not rationalize See [Mill 93]. In this context, to guide means that the visualization supports the user in discovering things, that were unknown before. In contrast, to rationalize means that things are presented, that are already well-known.

Since we are mainly interested in software visualization, we will take a closer look at the classification of Price, Baecker, and Small See [Pric 98]. They distinguish the following different types of techniques, which are derived from the distinction between programs and algorithms, static and dynamic representation, and code and data values See [Pric 98]:

Program visualization
  • Static code visualization
  • Static data visualization
  • Data animation
  • Code animation
  • Algorithm visualization
  • Static algorithm visualization
  • Algorithm animation

The main distinction is between programs and algorithms, where program visualization is used to display actual program code or data structures, while algorithm visualization operates on higher-level descriptions of software. Both kinds of visualization support either representations of code segments or data values, and are performed either statically, where the display contains a snapshot of the data, or dynamically, where the display is changed over a period of time and portrays data processing and the occurrence of state changes.

Visualization Tools for Program Analysis

According to this simple classification, a lot of different visualization tools using various techniques for program analysis have been produced. Much of this software is specialized or experimental, focusing only on certain applications and architectures. Only a minority of work has addressed the issues of reusability and configurability for analysis tools See [MaWi 93]. In addition, there is a high degree of discrepancy for the acceptance of visualization when comparing performance analysis with testing and debugging. While the former already relies heavily on program visualization, the latter still concentrates on textual representations.

An interesting facet related to this domain is interactive auralization of code, which may be applied to inform programmers about key control and dataflow events See [Baec 97]. Its idea is to simultaneously interact with the graphical interface, while sound is generated by the program's state changes. Although this may seem a little bit odd at first glimpse, there are actually some implementation for program analysis with auralization (e.g. See [BrHe 91] and See [FrJa 93]).

There are many different papers introducing a wide variety of taxonomies and classifications for program visualization and corresponding tools. In See [Pric 92] and See [Pric 93], a tree-based taxonomy is proposed that includes categories for input, output, and processing, along with some metrics. The classification of See [RoCo 92] differentiates and categorizes the elements and methods of program visualization based on the mapping technique between the code and the graphical representations. Yet another approach is offered by See [KrSt 93] which deals with the transformation of the abstract execution of a program into its graphical representation. Contrary to these classifications, which try do describe how to present program abstractions to the user, the paper See [Ouds 96] deals with the question of what to present to the user in terms of program abstraction and data representations. In See [BaEi 96], the authors address specifically the problem of visualizing large software systems, while in See [ReRi 99] the suitability of visualization systems for program analysis in computational grids is discussed. Other overview papers are See [MaRe 89], See [PaUt 89], See [Appe 90], See [HeEt 91b], See [Begu 93a], See [Heat 95a], See [Heat 95b], See [Hyrs 95], and See [Zhan 99]. In addition, several books covering this topic have been released, like See [SiKo 90], See [EaZh 96], and most recently See [Stas 98].

Another remark in this context concerns complete testing and debugging environments as well as programming environments that include debugging as an integral part of their program development strategy. The usefulness of such a combined approach has been described in See [ApMc 88]. In See [Lour 97], Lourenco et al describe the combination of the distributed debugging engine DDBG See [Cunh 98a], which is the predecessor to PDBG See [Cunh 98b] that has already been mentioned above, with the structural testing tool STEPS See [KrWi 96]. This combined approach allows to locate errors through symbolic source code analysis and controlled execution and inspection of the target program. A comparable solution is offered by the TOOL-SET, which aims to provide and integrated set of interactive and automatic tools for analysis of PVM programs on networks of workstation See [Wism 97b].

Some examples for complete programming environments with integrated debugging capabilities are the Annai programming environment See [Clem 95a], which has already been mentioned before with its PDT debugger, the Graphical Application Development Environment GRADE See [Kacs 97], the EDPEPPS environment See [Zeme 99], and the TRAPPER environment See [AhBa 99]. The basic idea of GRADE is to support high-level graphical programming with the graphical language GRAPNEL for the PVM environment. Therefore, it offers modules for constructing, executing, debugging, monitoring, and visualizing message-passing programs, which can be accessed by the user within an integrated graphical user interface. As a debugging tool, GRADE integrates the distributed debugging engine DDBG See [Cunh 98a], which is the predecessor to PDBG See [Cunh 98b] that has already been mentioned above. Furthermore, GRADE includes Tape/PVM and PROVE to support performance monitoring and visualization. The EDPEPPS environment is in many ways comparable to GRADE See [Zeme 99]. Their key differences are, that GRADE applies its own pure graphical language, which does not model exactly all the semantics of PVM programs, while EDPEPPS combines graphical and textual representations. In addition to the tools above, the TRAPPER environment supports both, PVM and MPI, and runs on Windows NT See [AhBa 99].

The same situation is applicable for the different types of applications and the tools incorporating one or several of them. One of the best known approaches for modelling parallel programs are Petri Nets, which are applied on different levels of program abstraction in several different characteristics (See [Vaut 87], See [Fers 92]). Another kind of graph is the dependence graph, which explicitly displays the control and data dependencies for the operations executed by a program (See [DaKe 82], See [Begu 93a]). Additional displays are the program graph (a static call graph), the Gantt chart (process state changes) and the critical path (responsible for program execution time). Furthermore there exist data access displays, histogram-bargraphs, xy-contour plots, memory access patterns, data-machine-processor views and a lot more See [Jaki 95]. Some remarkable approaches are three-dimensional representations as in See [ViSa 94] and parallel program analysis with Virtual Reality techniques See [Reed 95].

A complete list of visualization techniques as well as its corresponding tools cannot possibly be covered in this work (and probably in no other paper as well). Thus, we concentrated on some representative examples instead of trying to achieve a higher level of completeness. Especially one technique is interesting, because it is used in many different tools under various aspects. This is the space-time diagram, which was introduced by Leslie Lamport to express the order of events in distributed systems See [Lamp 78] and is therefore sometimes called Lamport diagram. Other synonyms for the same representation are Feynman display See [Hond 95] or Hasse diagram See [Klar 95]. However, the term Feynman display may be misleading, because it originated from quantum mechanics and to our knowledge, Feynman never used it referring to computer science (see See [Feyn 96]). Similar representations with comparable semantics are also used in other domains, e.g. the message flow graphs as a means for performing deadlock analysis and other optimizations at compile time See [LaLe 95].

Such a space-time diagram is used in many tools for parallel debugging and performance tuning (See [McHe 89], See [SiKo 90], See [RoWr 93], See [Brow 95]). One of the best-known examples is the already mentioned Paragraph (See [HeEt 91a], See [Heat 93]), which includes the space-time diagram and many other graphical displays. Following its success, many computer vendors announced a similar set of performance visualization tools, e.g. ParAide for Intel's Paragon, MPP Apprentice for Cray's T-3D, Prism for TMC's CM-5 See [Alle 91], PV for IBM's SP2, and CXTrace for Convex's SPP1 See [YaSa 96]. In ParaGraph the execution thread for each process is represented by a horizontal line, drawn from left to right, which changes color to indicate the state of the process (active, idle, blocked). Message operations are depicted by lines which connect the two communicating processes. The display reveals processors waiting for a blocking communication to complete.

Another tool that incorporates a space-time display is the AIMS toolkit (See [Yan 95], See [Yan 98]). AIMS is based on source-code-level instrumentation and enhances portability across platforms from different hardware vendors. One of the many graphical views in AIMS is the view kernel VK. It displays the dynamics of program execution using an animated view. This view presents the information about the execution of programming constructs. The constructs are source code statements, whose execution are displayed as graphical objects.

Similar to AIMS is XPVM See [Geis 94b], a graphical user interface for analyzing PVM programs. The space-time view in XPVM shows computing time, communication overhead and waiting time. The animation within XPVM can be done on-line or post-mortem. A similar functionality compared to XPVM but for MPI programs is offered by the tools Upshot See [HeLu 91] and its successor Jumpshot See [Zaki 99]. In Falcon See [GuEi 94] a variation of the space-time display animates thread creation and completion. When a thread is created a horizontal bar is added in the display. This bar is connected with the parent thread with a vertical line. The color of the horizontal bar represents the state of the thread. When a thread finishes its execution, the right end of the horizontal bar is connected to the parent thread with a vertical line.

The concurrency map See [Ston 88a] is an alternative representation to the time-line diagrams, emphasizing potential concurrency at task level instead of interprocess dependencies. A variation of the basic diagram comes from the causality graph See [Zern 92] or from the Animation Choreographer of the PARADE system See [KrSt 94]. Other examples with similar views include Paradyn (See [Mill 94] and See [Mill 95a]) with the time histogram, PVaniM See [Topo 95] with the Lamport-timeline view, Trapper See [Born 95] with another state-time animation, and OrWell with the communication chart See [WeKu 98]. Instead of displaying only discrete events, many representations choose to display process states as time-bars, where possibly different colors indicate different activities of the process, e.g. in VAMPIR See [Nage 96] or Paje See [ChSt 00a]. The latter is also noteworthy due to its capabilities of dealing with dynamically created threads of various life-times during program execution See [ChSt 00b]. Other related tools may choose a more general form of the directed graph representations, e.g. the Computationally Oriented Display Environment CODE and the Heterogeneous Network Computing Environment Hence See [Brow 95].

Additional Levels of Abstraction

From all the examples provided above it seems appropriate to use such a space-time diagram for displaying concurrently executing and interacting processes. Furthermore, we believe, that it is not only most appropriate for performance analysis, but it is also suitable for debugging large scale application. Therefore, we will introduce a formal model - the event graph - in the next chapter, that can easily be displayed as a space-time diagram and provides a useful program abstraction for error location and execution steering within a debugging tool.

Although space-time diagrams are obviously a good starting point for any kind of program analysis tool, it may be appropriate to combine it with representations on other levels of abstraction by offering a scalable approach. Especially the number of nodes and edges may easily be too demanding for conventional workstation computational resources See [Kime 95]. An arbitrary, scalable visualization tool would then be adaptable to the user's needs in order to increase its efficient usage. One example is VISTOP, the VISualization TOol for Parallel systems, as part of the TOPSYS project See [BeBr 93]. VISTOP offers scalability by providing selection mechanisms for "interesting" objects and "interesting" phases of the program. By limiting the display to the certain objects and key scenes of the program, the user's distraction due to uninteresting parts is decreased. A simple example of additional abstraction is the process grid feature of P2D2 See [ChHo 94], which provides a graphical overview of all participating processes. A similar display is available in the Mantis debugger See [LuCu 96] with the status window, where activities of the processes are displayed with colors and serve as an overview of program execution. Both of these approaches improve the program analysis task because they offer an overview of the processes state.

The benefits of abstraction mechanisms are also very useful in many other domains, e.g. for visual programming See [Wirt 94], when graphs and icons on different levels of abstraction need to be displayed with a suitable layout, or for performance tuning, when bottlenecks are extracted as a result of abstraction See [HaMa 94]. In See [Kess 96], a pattern-driven approach for automatic parallelization is described, that exploits frequently-occurring computations and programming concepts based on a programs source code.

The IPS-2 parallel program measurement system of Miller et al provides a hierarchical model of parallel and distributed programs with different views of performance data on multiple levels of abstraction See [Mill 90a]. Similar strategies for performance analysis are proposed by Yan and Sarukkai See [YaSa 96] and independently by Kohn and Williams See [KoWi 93]. The idea of Yan and Sarukkai is to provide key strategies for possible time reduction, where the appropriate strategy is selected based an traversing through a hierarchical selection tree. The nodes of the tree identify different characteristics of the program, while the leaves indicate the actual reason for the problem. As soon as the actual reason is detected, an appropriate strategy is provided to solve that particular problem See [YaSa 96]. The ATExpert system of Kohn and Williams is even more enhanced, because it (textually) explains the cause of performance problems in detail and proposes concrete actions to actually solve them See [KoWi 93].

Similar ideas are provided in TraceView See [Malo 91b], where the user can define views with an appropriate filtering of the data that are visualized. Another idea is to introduce different levels of event abstraction. Cai et al describe how to abstract events on lower levels of abstraction (low level events) into events on higher levels of abstraction (high level events), and display only high level events See [Cai 93]. Bates distinguishes between primitive events and high-level events See [Bate 95], where primitive events represent observed nondecomposed fundamental behavior, while the abstraction of primitive events into other, probably more representative events is achieved with high-level events. Thus, high-level events cannot be observed from the program's execution, but manifest only a way to abstract the observed primitive events into more useful abstractions on higher levels (compare also See [From 95c]). This approach is not limited to parallel programs, but is also useful for other areas, like object oriented programs as described in See [Nobl 95] and also in See [Jerd 97]. Other useful approaches like center of interest See [Furn 86] or the fish eye view See [StBo 89] display the complete information without losing any global relations by concentrating on areas of major interest and displaying remote regions of the diagram in successively less detail. A fundamental idea of the fish eye view is to provide a balance of local detail and its global context.

This basic event abstraction mechanisms are certainly extensible. In EDL, the Event Definition Language introduced by Bates and Wileden See [BaWi 83b], there exist two essential mechanisms for event abstraction: filtering and clustering. With filtering, all but a designated subset of events can be deleted from the original event stream. Clustering means, that one or more primitive events are gathered together into a higher level event, which requires the availability of a mechanism for constructing an abstracted view. This kind of abstraction can be performed automatically by the corresponding debugging tool as described in See [Tayl 93], See [KuBl 95], See [Augu 95], See [From 94], See [From 95b], and See [Reed 97].

The development of EDL has also lead to the high-level debugging approach EBBA, Event-Based Behavioral Abstraction See [Bate 95] and the program behavior models of FORMAN See [Augu 98]. Both models follow the idea, that the behavior observed in parallel programs may reveal useful patterns, which can be evaluated during program analysis. With EBBA, debugging is treated as the process of creating models of expected program behavior and comparing these models to the actual behavior exhibited by the program (see See [Bate 88] and See [Bate 95]). The main problem of EBBA is, that cyclic debugging is not supported, which requires to exhaustively describe all the interesting events before program execution See [LeMe 87]. Similarly, in the FORMAN language users may specify event relations (e.g. partial order, hierarchy) as exhibited by patterns with assertions, that are generated during program execution and can be checked post-mortem See [Augu 98]. One of these relations is the inclusion relation, which allows to consider program behavior at appropriate levels of granularity See [Augu 99].

Among these tools that have been inspired by the work of Bates and Wileden is POET See Please note, that the name Poet is used for two similar tools. One is the Poet event trac