EPODE

Parallelism across the method

Each processor executes a different part of a method. For example, each processor may produce a solution at different time step for each pass through the algorithm.

This approaches has the triple advantage of being coarse-grained, of avoiding load balancing problems, and of being application-independent. It has the disadvantage of giving a limited speedup. Unfortunately, many existing algorithms that perform well on a sequential computer can take hardly profit from a parallel configuration. This feature necessitates us to construct new methods, specially designed for parallel configuration. A small number of concurrent subtasks of considerable computational complexity can be distinguished (methods are called coarse-grain parallelism).

Existing stiff ODE methods have opportunities for parallelism in the linear algebra and in the Jacobian formation. Efficient nonstiff methods are all explicit and their computing time is spent mainly on the function evaluation.

Parallelism in method has the benefit of requiring no user intervention or special properties of the given ODE. Even in the case one can find some parallelism within the function evaluation, parallelism in method can give a speedup an ton of this. The parallelism within a function evaluation is problem and user dependent.

Compared to serial integration methods, most parallel integration methods show poor stability characteristics. Improvement may well cause the advantages of the parallel solution to be lost. If the error and stability characteristics are satisfactory, the effectiveness of the algorithms concerned is crucial. One or other algorithm may be more effective, dependent on the problem being solved and on the parallel machine being used, but also on the implementation of the solution method.

Dependent on the problem and processor characteristics, a parallel integration method may show unfavorable error/performance relations, i.e. a (comparable) sequential method, implemented on a uniprocessor, gives equally results in a smaller solution time.

The parallel integration methods do not attack the function (derivative) evaluations properly. Combining these methods and methods for effective parallel function evaluation may lead to considerably higher performances.

Parallelism in separate statements or several subintervals

Parallelism across method for the solution of ODE can be subdivided primarily into two categories:

the algorithm for solving ODE is built from separate statements and steps which can be executed in parallel to some extent;
parallelization is carried out in such a way that several subintervals can be treated in parallel.

There are no clear-cut boundaries to the other types of parallelism (for example, parallel predictor-corrector methods can be counted in the first category as well as in the second one). Mixed types of parallelism, of the two categories, are possible.

Parallelism in separate statements. The level of abstraction of this form of parallelism is high. The advantage of parallel methods is their independence of the degree of parallelism of the size of the system of ODE to be solved. On the other hand only modest degree of parallelism are attainable in this way, depending greatly on the method (in a high redundancy of calculations is accepted, high degrees of parallelism can be reached, but the time saved with respect to a comparable sequential procedure will be relatively small). This is a result of the fact that the solution of ODE is an inherently sequential problem.

Parallelism in several subintervals. Among others, block methods and parallel predictor-corrector methods can be subsumed under this type of parallelism. It is difficult to obtain high degrees of parallelism by this principle, unless considerable additional information on the problem is available or an extreme degree of inefficiency due to may redundant calculations is accepted. Procedure of this type have been introduced by Nivergelt.

Parallelism and computational overhead

Parallelism through computational redundancy. A class of parallel methods is dealt with, where a certain computational overhead is introduced to recast an essentially serial algorithm into a form in which it consists of several extensive (compound) tasks, which can be performed in parallel. The methods only make sense if the speed advantage of the parallel treatment exceeds the disadvantage of the included redundancy.

Examples of this kind of parallel method are among others:

domain decomposition method proposed by Nivergelt;
block methods are easily adapted to a parallel mode with little apparent degradation in the solution;

Parallelism without introducing computational overhead. The idea of parallel predictor-corrector methods is to compute the predictor and corrector simultaneously, but for different solution points. There were derived specific examples of method for different number of processors with stability and convergence theory for their classes of methods. The processors are allocated to solving either predictor or corrector equations.

Stabbing approach

Nivergelt has proposed a stabbing approach in which the interval of integration is subdivided and a number of different problems are solved on each interval concurrently. Interpolation is then performed in parallel. A bounded analysis has been developed in the linear case. This approach is now considered not to be practical, but it is the first instance of a specific parallel implementation for a restricted class of problems.

Direct methods: Runge-Kutta methods

Direct methods for ODEs do not have much opportunity for massive parallelism unless explicit methods can be used for large system of equations as typically arise in the semidiscretization across space of hyperbolic PDEs. In all other cases, we have to look for modifications of existing methods to introduce parallelism.

One way of exploiting parallelism across the method is to perform several function evaluations concurrently on different processors. This approach is possible for multistage methods such as Runge-Kutta (RK) methods.

An explicit multistage method (an (A,B) method) is essentially sequential. However, is some cases two or more stages can be performed in parallel if there is any appropriate subset of the coefficients of the strictly lower triangular matrix A which are zero. This approach has been formalized in an attempt to exploit the structure of the RK matrix using a digraph analysis of this matrix, but the conclusion is that there is little to be gained in this respect. The concept of a block explicit method has been introduced in which the RK matrix can be written in block lower triangular form. If the number of such blocks is k and the number of internal stages in the largest block is p, then such a method is suitable for a p-processor machine in which the p internal solution values can be computed concurrently within a block. In order to achieve load balancing, it makes sense to have all blocks approximately the same size. The maximum order of such a method is k and so is independent of p.

Parallelism in explicit Runge-Kutta methods (ERK). Formula RK is a s-stage p-parallel q-processor RK formula iff p is the smallest integer for which the s internal stage values Y_n,i can be evaluated in p time units and q is the smallest number of processors for which this value of p can be attained. The require to retain the explicit nature of an RK formula while computing block of internal stage values in parallel supposes requires that each Y_n,i in the first block of interval stage values cannot depend on any Y_n,j j # i, and, similarly, the second block of internal stage values to be computed in parallel can depend on the first block only, and, respectively, the k block of internal stage values to be computed in parallel can depend on the Y_n,i in blocks 1, ... ,k-1 only. If p is the number of blocks and q is the number of interval stages values in the largest block, then the formula with internal stage values grouped as described above in an s-stage p-parallel q-processor ERK formula. For good load balancing, formulas with s=pq are preferred. The order of such a scheme is at most p.

Exploiting the sparsity of the generating matrix: the digraph method. For parallel integration, one will try to find RK formulas in which the generating matrix possesses a minimum length of the leading non-zero elements in each row.

The inherent potential for parallelism of RK methods can be investigated by considering the associated digraphs and highlighting the important role of the underlying sparsely system.

The couple G=(V,E) is the digraph (directed graph) of the matrix A if the set of vertices is V= {1,...,m}, whereas the pair (i,j) belongs to the set of directed edges E is i,j in V and a_ij # 0. In the plots of digraphs, if (i,j) in E, then the arrow in the line linking these vertices points at i. This has been done to make explicit the connection between the digraph and the flow of information in a parallel implementation. The digraph models the sparsity of A. The vertices v₁, ... ,v_s, where s>= 2 form an s-loop if v_i # v_i+1 , i=1,...,s : v_i,v_i+1; i=1,...,s is a subset of E or (v_i+1 ,v_i):i=1,...,s is a subset of E, where v_s+1 =v₁. The vertices are not assumed to be distinct. The vertex v is a 1-loop if a_v,v #0, otherwise it is a 0-loop. Let s be the length of the largest loop in the digraph of a RK matrix A (it means that s distinct stages are linked together in A into an implicit irreducible block, and that no larger number of stages possesses this feature). In this case, the underlying method is called s-implicit. An explicit method corresponds to a 0-implicit RK method, and diagonally implicit corresponds to 1-implicit. Assume that the set V is partitioned into n disjoint non-empty subsets: {1, ... ,m} = union _i=1 ⁿ P _i, P ₁, ... , P _n # 0, P _iintersection P _j, for all i # j. We call this a processor partitioning of the method. We further partition each P _l, P _l= union _i=1^m L _li , i=1, ...m , where L _li intersection L _lj=0 for all i # j ( L _li can allowed to be empty sets, but for some r in 1, ... ,n it is true that L_rj # 0, j=1, ...,n for some m>= 1). This is called a level partitioning of the method if the following conditions hold:

For each l=1, ... ,n and j=1, ...m , all the vertices in L _lj belong to a loop.
If a_ij # 0 then either there exist p in 1, ... ,m and r,s in 1, ... ,m such that r<= s, i in L _ps, j in L _pr or there exist q,t in 1, ... ,n and r,s in 1, ... ,m such that q # t, r < s, i in L _ts, j in L _qr.

A combination of processor and level partitioning of a s-implicit RK method will be called a n,m,s implementation. Any n,m,s implementation can be executed on n processors in m cycles that entail solutions of implicit systems each with no more than s copies of the derivative vector.

It was stated that the order p of an explicit RK method with a n,m,s implementation cannot exceed m, regardless of the number of processors n. It is well known that, for 1 <= m <= 4, there are m-stage ERK method that are of order m. Thus, parallel implementation cannot increase the order/stage ratio for such methods. For more than five stages, the order of an explicit RK method is always smaller than the number of stages, and an improvement via parallelization is possible, but it seems that explicit RK methods are not facilitated much by parallelism at the method level. Consequently, the attention must be focused on implicit methods, inclusive of diagonally implicit methods.

Parallelism in diagonally implicit Runge-Kutta methods (DIRK). In this case a_ij=0, j>i. The parallelism comes about because the stages within a block are shared amongst the processors. One of the drawbacks of the above approach seems the fact that the stage order is at most 2 (the stage order is defined to be the minimum order of all the interval components). The order behaviour of a method when applied to many classes of stiff problem is more like the stage order of the method than the classical order of consistency.

Parallelism in singly implicit and multi-implicit Runge-Kutta methods (SIRK and MIRK). Since parallel computation is becoming widely available it is natural to ask how to construct such methods which are suitable for implementation in a parallel environment. One possible way to do this is by requiring that the method is a multimplicit Runge-Kutta method (MIRK). In a MIRK method, the matrix A has all eigenvalues real and distinct, so that A matrix is similar to a diagonal matrix with the eigenvalues on the diagonal, which makes it suitable for a parallel implementation.

Direct methods: extrapolation schemes

Many times it has been remarked that extrapolation methods possess a high degree of parallelism and offer an extremely simple technique for generating high-order methods. Extrapolation algorithms can be applied efficiently especially in two situations, where the function f is approximated by either polynomials or rational functions. A detailed analysis shows that a sophisticated load balancing scheme is required in a parallel system to achieve good speedup.

The basic idea is that any differential method can be applied over a time interval [t_n,t_n+H] with a series of different constant stepsizes h_i=H/n_i, i=1, ..., p . If the underlying method has an even power series expansion of the stepsize in the global error, then Richardson extrapolation can be performed to generate high-order methods of any desired accuracy. An efficient parallel implementation requires the grouping of the p methods to ensure that each processor has approximately the same amount of work; this approach is well suited for a parallel implementation when the problem size is large or function evaluations are very costly, but the parallelism is strictly limited.

Direct methods: block methods The main reason for the popularity of BDF methods is the relatively low computational effort per step, at least when compared with other suitable methods for stiff equations, such as implicit RK methods. However, the FLM have one serious disadvantage: they are subject to the so-called second Dahlquist barrier, which says that the order cannot exceed two if the method has to be A-stable.

Block methods are easily adapted to a parallel mode with little apparent degradation in the solution. Stability boundaries are not significantly altered and estimation of error is unaffected by the change to a parallel system. The obvious drawback to these schemes is that their parallel performance is inherently limited, i.e. the performance is dependent on the number of nodes in the scheme.

In a r-point block method each pass through the algorithm simultaneously produces r new equally spaced solution values. In a k-block, r-point method, time is divided into a series of blocks with each block containing r steps at which solutions to differential systems are to be found; the computation proceeds in blocks, i.e., the computation is based on the computed values at the earlier k blocks. Within a block it is possible to assign the computational tasks at each time step to a single processor, and, thus, the computational at all future steps can be performed simultaneously in parallel.

Block implicit methods can be applied in a one-step mode, in which only the last point in the block is used to compute the first approximations to the r values of the next block. Then, implicit formulas are applied iteratively until convergence is achieved to the maximum order of accuracy obtainable.

Iterative methods: parallel predictor-corrector schemes

In the design and construction of methods suitable for nonstiff problems in parallel environment, much interest has been recently shown on the use of general class of PC methods. In designing efficient PC methods for parallel architectures we must consider such properties as order, stability and error behaviour. This analysis can be facilitated by studying the very general class of explicit multivalue methods which is wide enough to encompass most methods suitable for efficient implementation in a parallel environment. Explicit linear multistep methods have the advantage that they require a minimal amount of work per integration step. However, their asymptotic stability is strongly restricted, and they force us to use a relatively small h. Therefore, the over-all amount of work in carrying the integration over a given interval may be large, nevertheless. Implicit formulas are more stable and permit the use of larger h's but, in general, require the solution of the dependent variable. A predictor-corrector PC-scheme appears to give a good compromise between the two types of formulas: the amount of work per step is larger than for the open formula (at least two evaluation of f(t,y)), it is much normally, much more stable and, with the some step member, somewhat more accurate than an explicit formula.

In general, the following drawbacks of the parallel PC schemes can be identified:

data communications with these schemes are required at each step - this will definitely reduce the performance of the methods;
the predictor equations depend on the predicted values;
parallel PC schemes are of smaller stability region than the corresponding serial PC schemes;
the evaluation of the derivative function many times at the same time point is inefficient.

Concurrent corrections: the frontal methods. The main point of the frontal techniques lies in the simultaneous computation of the predicted value and of all the iterates, each at a different solution point. We present frontal methods where the corrector iteration is performed m times by using both the fixed point method and the Newton method. Their degree of parallelism is m+1. These methods are not suitable for stiff problems because they loss of stability.

Splitting methods Numerical tests and theoretical analysis suggest that the standard approach to predictor-correction, results in methods having relatively small stability regions (unless a large number of corrections are performed) and, in general, large error coefficients because of the complex mixing of predictor and corrector errors. A possible way to overcome these difficulties is by the use of splitting techniques, which allows more efficient ways of solving the implicit equations than by fixed-point iteration.

Runge-Kutta correctors. One of the classes of parallel IVP solvers for nonstiff problems that received relatively much attention is the class of predictor-corrector methods based on RK correctors.

Iterative methods: implicit block methods

One way of circumventing the somewhat negative features associated with paralleling block explicit RK methods is to consider a predictor-corrector approach. Predictor-corrector methods were originally proposed as examples of block methods. The underlying feature of a block method is that each application of the method generates a collection of approximation to the solution within a block. In a parallel environment individual processors can compute, independently, the approximation values to the solution within the blocks.

Since, in general, relatively few corrections are needed to get suitable behaviour this parallel wavefront approach appears to be of limited value. Since block predictor-corrector methods can be placed in the format of a multivalue method, the order and stability theories developed for such methods can be directly applied.

Block implicit methods can also be adapted to a predictor-corrector mode; in this case all the solution values of a block may be used to predict a solution at each node at the next block.

In predictor-corrector block methods the value of the unknown vector is computed ahead simultaneously at a predetermined number of future points. The computation is based on the computed values of the vector at earlier points. The computation proceeds in blocks. Within a block it is possible to assign both the predictor and the corrector computations at each future point to a single computer, and to perform the computations at all future points simultaneously in parallel.