|
Dieter Kranzlmüller Event Graph Analysis for Debugging Massively Parallel Programs |
|||

2. Area of Application: Parallel Computing
2.1 The Need for High-Performance Computing
2.1.1 Grand Challenge Problems
Parallel computing is sometimes called the "key enabling technology" of modern computers that allows to fulfil the ever-increasing demand for higher and higher performance at lower production costs, which allows sustained productivity in real-life applications [Hwan 93]. Therefore, parallel computing is integrated wherever performance is one of the key issues, and parallelism appears in various forms, such as pipelining, vectorization, concurrency, data parallelism, time sharing, multitasking, multiprogramming, multithreading, and distributed computing at different processing levels. Yet, while parallelism is widely available in every computing platform, from powerful game consoles to mainframes in companies, some problems still continue to require more processing power than available.
At the foremost front of problems are the so-called "Grand Challenges" [Gran 89], [Gran 92], and [Wald 95], which can be defined as follows:
Definition 2-1: Grand challenge problems [WiAl 99]
- A grand challenge problem is one that cannot be solved in a reasonable amount of time with today's computers.
These fundamental problems generate increasingly complex data, require realistic simulations of the processes under study, and demand intricate visualizations of the results [Gran 99a]. These problems often require numerous large-scale calculations and have significant elements of multi-discipline and multi-geographic collaborations [Gran 98]. Their solutions will have a broad economic and scientific impact, not only on the computing industry but probably on the human life in general [Gran 99b]. The following list contains an overview of some problems in the Grand Challenges category as summarized in [Gran 96], which can only give an impression of today's situation, since yet unrecognized application might emerge soon, for example from the expanding multimedia field [Hoss 99].
- Applied Fluid Dynamics
- Meso- to Macro-Scale Environmental Modeling
- Ecosystem Simulations
- Biomedical Imaging and Biomechanics
- Molecular Biology
- Molecular Design and Process Optimization
- Cognition
- Fundamental Computational Sciences
- Grand-Challenge-Scale Applications
Another idea would be to identify the equations, that have to be solved in order to get results for the Grand Challenge problems. These so-called Grand Challenge equations have been collected by [Gran 99a]:
- Newton's Equations
- Schrödinger Equation (time dependent)
- Navier-Stokes Equation
- Poisson Equation
- Heat Equation
- Helmholtz Equation
- Discrete Fourier Transform
- Maxwell's Equations
- Partition Function
- Population Dynamics
- Combined 1st and 2nd Laws of Thermodynamics
- Radiosity
- Rational B-Spline
To be more concrete, we will now give two representative examples of Grand Challenge problems in more detail:
Example 2-1: Grand Challenge problem [WiAl 99]:
Numerical weather prediction (weather forecasting by computer)
- The global atmosphere is modeled by dividing it into three-dimensional regions or cells, e.g. cells of size 1 mile x 1 mile x 1 mile which would lead to roughly 5 x 108 cells. The conditions in each cell (temperature, pressure, humidity, wind speed and direction, etc.) are computed at time intervals with rather complex mathematical equations and using various conditions from the previous time intervals. If we assume 200 floating-point operations for each calculation, one time step would require 1011 floating-point operations. In addition, the calculations for each cell have to be repeated iteratively in order to model the passage of time. Thus, predicting the weather over a time period of 10 days using 10-minute intervals would require 104 time steps and 1015 floating-point operations in total. Using a computer that operates at 100 Mflop/s (108 floating-point operations per second) would than lead to a computation time of 107 seconds or approximately 116 days to perform this calculation. On contrary, to perform this calculation within 10 minutes would require a computer operating at 1.7 Tflop/s (1.7 x 1012 floating-point operations per second). If you compare this with the current Top 500 supercomputer list [Dong 99] as described in Section 2.2.3, only the world's fastest supercomputer, Intel's ASCI Red, achieves more computational speed for the Linpack benchmark [Dong 94].
Example 2-2: Grand Challenge problem [WiAl 99]:
Astrophysical N-body simulation
(motion prediction of astronomical bodies in space)
- Each astronomical body in space is attracted to each other body by gravitational forces. These are long-range forces that can be calculated with a simple formula (see [WiAl 99], page 126), and the movement of each body can be predicted by calculating the total force experienced by the body. If there are N bodies, there will be N-1 forces to calculate for each body, or approximately N2 calculations in total. Thus, if we assume that a galaxy might contain 1011 stars, we would require 1022 calculations. Again, these are only the calculations for one time step, while scientists are usually interested in simulating a certain period of time, which means that these calculations will have to be performed iteratively. If we provide a powerful computer, that could perform each calculation within 1ms (10-6 seconds), it would still take approximately 109 years for one iteration of this N2 algorithm.
2.1.2 Requirements on Computer Systems
These two examples indicate, that there exist problems, which can only be solved with very powerful architectures that may not even exist today. Some of them require systems of unheralded capability [Sieg 92]. Hence, besides the problems of feasibility, Grand Challenge applications also serve as the representative guides to the development of system software and hardware [Gran 98]. Furthermore, workstations, however powerful they are or will be in the near future, cannot replace the potential of parallel computers, which are basically built upon the technology of powerful microprocessor chips by tying them together via sophisticated broadband interconnection networks in order to support massive parallelism [Hoss 99].
The following three requirements manifest the need for parallel computing:
Clearly, computational speed is the most obvious characteristic required from a parallel computing system. Often, computations must be completed within a "reasonable" time period or have a specific deadline for the computations [WiAl 99]. For example, the predictions of a weather forecasting program that requires two days to compute next days weather is certainly useless.
A second requirement is the amount of memory consumed by this large and complex problems, which can easily exhaust conventional computing systems. However, by coupling large numbers of processors which have access to their local memory, the total amount of memory can be scaled up. Please note, that the amount of memory also affects the computational speed, because obviously the data to be processed has to be transported from memory to CPU and return. This defines the need for a sufficient degree of interconnection bandwidth.
The last requirement is the accuracy of the results, that have to be computed. Yet, this is strongly connected to the other requirements, simply because the higher the computational speed and the amount of memory available, the higher the degree of accuracy that can be achieved. On contrary, restrictions in the available computing power and/or the available memory certainly reduce the obtainable accuracy, and many actual problems cannot be solved with sufficient degree of accuracy by today computers.
2.1.3 Teraflop Computing
The above mentioned requirements have also been visible in the promises of computer manufacturers, who specified their goal for 1995 to be the magical "3 T's" - 1 Tflop/s (1012 floating-point operations per second) in execution rate1, 1 Tbyte of main memory, and 1 Tbyte/s interconnection bandwidth [Hoss 99]. This goal lead to the definition of key attributes for such a powerful computing system, which have been specified by Hossfeld as [Hoss 99]:
- Multiple high-performance commodity priced compute nodes from regular commercial product lines (not special-purpose designs)
- Hierarchical memory systems (e.g. cache-only memory architectures and distributed shared memory systems with low-latency memory access)
- Very high performance storage and parallel I/O systems
- Scalable programming environments and operating systems
- A universal programming paradigm
Today, four years later than expected, the 1995 goal in execution rate was achieved, which can be attributed to the Accelerated Strategic Computing Initiative (ASCI) project [ASCI 99]. This $1-billion program launched in 1994 by the US Department of Energy (DOE) seeks to build Tflop/s computer systems within 10 years [Craw 96]. The participating government laboratories - Sandia National Laboratories, Lawrence Livermore National Laboratory, Los Alamos National Laboratory - collaborate in creating the leading-edge computations modeling and simulation capabilities required for maintaining safety, reliability, and performance of US nuclear stockpile and reducing the nuclear danger [Gust 98]. Besides nuclear weapons research, simulation of other areas like biotechnology, medical and pharmaceutical research, weather prediction, aircraft and automobile design, industrial process improvement, environment projection, etc. are targeted [HwXu 98]
The strategies adopted by the ASCI program have two notable features: accelerated development and balanced scalable design. Thus, ASCI aims not just at the peak speed of a particular computer system, but at a total system solution that can deliver sustained application performance at five orders of magnitude better than that in 1994. The scalable design approach defines the following steps [HwXu 98]:
- Focus on high-end platforms for scientific computing
- Use of commodity off-the-shelf (COTS) hardware and software components
- Use of massively parallel architectures
A step further than the ASCI program is targeted by the PetaFlops project [NASA 97]. This project aims at the long term goal to achieve a supercomputing speed of 1 Pflop/s (105 flop/s). Recent studies suggest, that such a system could eventually be built in 2007 with exotic technologies and unknown costs. However, the more realistic idea is to build a massively parallel processor with conventional technology and COTS components by year 2015 at a cost of US$ 1 billion [HwXu 98]. Such a system must effectively exploit massive parallelism in the range of thousands of processors to over a million of threads concurrently. In the following section, we will briefly summarize the historical steps that led to the current situation and have been performed to achieve these goals, before presenting an overview of the current state of supercomputer developments.
1 Please note, that there is often a discrepancy when specifying actual performance rates of arbitrary computer systems, which mostly stems from the difference between peak performance and sustained performance. While peak performance pictures the best performance results a computer system can gain under optimal conditions, sustained performance represents performance numbers that can be observed during operation.
![]() GUP Linz http://www.gup.uni-linz.ac.at |