This article was originally published in Computers in Physics, Volume 7, Number 4, 1993, pp. 418-424. Revised for MPI, Nov., 1998


How to Write (Nearly) Portable Fortran Programs for Parallel Computers

by Victor K. Decyk
Department of Physics and Astronomy
University of California, Los Angeles
Los Angeles, CA 90095-1547


Abstract

Techniques are introduced for designing Fortran code for parallel processors to be (nearly) portable. Automatic parallelizing compilers can help the user first write code for a shared memory machine. This process is similar to that of vectorizing code for vector processors and debugging can be done even in a serial environment. If certain constraints are used, the code can also run on distributed memory computers with minor changes.

I. Introduction

Parallel computers give the promise of attaining previously unimaginable computational speeds for scientific calculations. One reason their use has increased only slowly, however, is that each parallel computer was different and required learning a new system and programming environment. Some computers used a shared memory model, others a distributed memory model. Some used a special parallel language or language extensions, while others used Fortran 77 with message passing subroutines. Writing parallel code was usually difficult and writing portable parallel code seemed impossible. And even if one did invest the time to write a parallel program for some computer, if often happened that before one got much use out of it, either the environment changed or the computer company vanished altogether.

Recently, a number of computer vendors have provided automatic parallelizing compilers for shared memory computers. This article is about how to use such compilers to design and write parallel code which can be used both on shared memory as well as distributed memory computers. The process of writing parallel code then becomes similar to the process of writing vectorized code, even for distributed memory computers. A number of guidelines for producing portable code are suggested and are illustrated with examples. Portability is an important consideration if programming time is valuable. Furthermore, parallel code can be debugged serially for the most part, which simplifies the process.

II. Simple Example

The general idea behind these automatic parallelizing compilers is that they will parallelize the outermost loop, and on vector machines, vectorize the inner loop(s), so long as there are no data dependencies. Examples of such compilers include the fpp precompiler to cft77 from Cray Research and the automatic parallel option of VS Fortran 2.5 from IBM. Let us illustrate this with a very simple example. Suppose one has the following Fortran loop:
        PARAMETER (NX=32)
        DIMENSION F(NX)
        DO 10 J = 1, NX
        F(J) = J
     10 CONTINUE
This can be parallelized by breaking up the single loop of length NX into NVP loops of length NXP as follows:
        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP,NVP)
        DO 30 K = 1, NVP
        JOFF = NXP*(K - 1)
        DO 20 J = 1, NXP
        F(J,K) = J + JOFF
     20 CONTINUE
     30 CONTINUE
where for simplicity I have assumed that NX is an exact multiple of NVP. Although the array F now has a different dimension, it contains the same number of words and the same data as before (the formula J+NXP*(K-1) is just the location in linear storage of the element (J,K) of the two dimensional Fortran array F). The compiler can now run the outer index K in parallel. In general, compilers will give messages when it can parallelize the outer loop and give reasons why when it can not. For example, here is a sample output from the IBM compiler for this loop:
         DIMENSION F(NXP,NVP)
   PARA +-------     DO 30 K = 1, NVP
   VECT|+-------     DO 20 J = 1, NXP
       ||_______     F(J,K) = J + JOFF
       | _______
The Cray preprocessor gives similar information:
            dimension f(nxp,nvp)
   P---     do 30 k = 1, nvp
   P        joff = nxp*(k - 1)
   P V-     do 20 j = 1, nxp
   P V      f(j,k) = j + joff
   P V-  20 continue
   P-    30 continue
One can thus mentally associate each index K with one of NVP parallel processors. Note that the variable JOFF is special because if the outer loop is run in parallel, there have to be NVP copies of it, one for each processor, since its value depends on the index K. Such variables are known as private variables. Variables such as NXP which do not depend on K are known as shared variables, since each processor has the same value for that variable. Some years ago, extensions to Fortran were defined to help compilers recognize such parallelism. (Cray microtasking is an example of this.) These extensions were in the nature of explicitly declaring the loop over K a parallel loop and declaring the variable JOFF a private variable. Generally such extensions are no longer needed, as the compilers are usually capable of recognizing these declarations without user intervention. In reality, private variables are actually arrays of size NVP, and to avoid confusion, one can always declare them to be such as follows:
        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP,NVP), JOFF(NVP)
        DO 30 K = 1, NVP
        JOFF(K) = NXP*(K - 1)
        DO 20 J = 1, NXP
        F(J,K) = J + JOFF(K)
     20 CONTINUE
     30 CONTINUE
On a distributed memory computer, one can think of each index K as residing on a different computer, so it is no longer an explicit loop index or array dimension. Instead K comes from a function call which returns a different processor ID (0 < ID <NVP-1) on each node:
        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP)
        CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR)
        K = ID + 1
        JOFF = NXP*(K - 1)
        DO 40 J = 1, NXP
        F(J) = J + JOFF
     40 CONTINUE
Here we are using a call to the MPI standard message-passing library to determine the processor ID. I will use the index K to identify processors,which is one larger than the usual convention on distributed memory computers.

Finally, it is possible to combine the two versions of this loop into one version that would run on both computers. To do this, we keep the loop index K, but its extent is now given by a new variable NBLOK, which is equal to NVP on a shared memory machine but has the value 1 on a distributed memory machine. If the value of K is needed other than for indexing, then it is given by a new combination K + KOFF. On a shared memory machine, KOFF is always zero and K varies, while on a distributed memory machine, K is always 1 and KOFF varies, but the combination K + KOFF always has the same value on both machines. Whether a machine is a shared or distributed memory computer is controlled by a new parameter MSHARE, which has the value 1 for shared memory and 0 for distributed memory. The result is:

        PARAMETER (NX=2,NVP=4,MSHARE=1)
        PARAMETER (NXP=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXP,NBLOK)
        IF (MSHARE.EQ.1) THEN
           KOFF = 0
        ELSE
           CALL MPI_COMM_RANK(MPI_COMM_WORLD,KOFF,IERR)
        ENDIF
        DO 60 K = 1, NBLOK
        JOFF = NXP*(K + KOFF - 1)
        DO 50 J = 1, NXP
        F(J,K) = J + JOFF
     50 CONTINUE
     60 CONTINUE
On a shared memory computer we can use a dummy function MPI_COMM_RANK since it is not called and may not be available.

One important advantage of this approach is that the program can be run on serial computers or run on some parallel computers with the automatic parallelization turned off. This makes debugging much easier. Up to now, we have associated the outer loop index with one of NVP processors. On many shared memory computers, NVP is not necessarily the actual number of physical processors. But on distributed memory computers, it often is. Thus for portability, it is best to design code so that NVP can equal the number of physical processors.

III. Example with Reduction

Let us try this technique on another simple example which contains a summation reduction. First consider the following loop:
        PARAMETER (NX=32)
        DIMENSION F(NX)
        FSUM = 0.
        DO 10 J = 1, NX
        FSUM = FSUM + F(J)
     10 CONTINUE
This can be written as a double loop for a shared memory computer trivially:
        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP,NVP)
        FSUM = 0.
        DO 30 K = 1, NVP
        DO 20 J = 1, NXP
        FSUM = FSUM + F(J,K)
     20 CONTINUE
     30 CONTINUE
On the distributed memory computer, the outer loop is implicit, as in the previous example, and one has:
        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP)
        FSUM = 0.
        DO 40 J = 1, NXP
        FSUM = FSUM + F(J)
     40 CONTINUE
        CALL MPI_ALLREDUCE(FSUM,TEMP,1,MPI_REAL,MPI_SUM,MPI_COMM_WORLD,IERR)
        FSUM = TEMP
The call to subroutine MPI_ALLREDUCE sums all instances of the variable FSUM across processors and distributes the answer to all processors in the variable TEMP. Thus on the distributed memory computers, a partial sum over the index J is first performed by each processor. Then the partial sums are added by the subroutine MPI_ALLREDUCE. Finally, the portable version of the loop looks like:
        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER (NXP=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXP,NBLOK)
        FSUM = 0.
        DO 60 K = 1, NBLOK
        DO 50 J = 1, NXP
        FSUM = FSUM + F(J,K)
     50 CONTINUE
     60 CONTINUE
        IF (MSHARE.EQ.0) THEN
           CALL MPI_ALLREDUCE(FSUM,TEMP,1,MPI_REAL,MPI_SUM,MPI_COMM_WORLD,IERR)
           FSUM = TEMP
        ENDIF
On a shared memory computer we can use a dummy function MPI_ALLREDUCE since it is not called and may not be available.

IV. Example with Communication

As long as expressions can be written so that array elements are referenced everywhere with the same value of the parallel loop index K, then the loops can obviously be run in parallel. As much as possible, one strives to write code with such expressions. However, if this is not possible, and array elements must be referenced with different values of K in an expression, then things are more complicated, because then one parallel process must communicate with another. Let's consider another example which calculates a finite difference with periodic boundary conditions:
         PARAMETER (NX=32)
         DIMENSION F(NX), R(NX)
         DO 10 J = 2, NX
         R(J) = F(J) - F(J-1)
      10 CONTINUE
         R(1) = F(1) - F(NX)
When one breaks this up into a double loop as before, one obtains:
        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP,NVP), R(NXP,NVP)
        DO 30 K = 1, NVP
        DO 20 J = 2, NXP
        R(J,K) = F(J,K) - F(J-1,K)
     20 CONTINUE
        IF (K.GT.1) THEN
           R(1,K) = F(1,K) - F(NXP,K-1)
        ELSE
           R(1,K) = F(1,K) - F(NXP,K+NVP-1)
        ENDIF
     30 CONTINUE
Notice that we now have expressions such as:
        R(1,K) = F(1,K) - F(NXP,K-1)
where both K and K-1 appear as indices in the same expressions. That means that the processor associated with K needs information from the processor associated with K-1. On a shared memory machine, this presents no problem, and both the Cray and IBM compiler will parallelize the loop 30 above. However, on a distributed memory machine, this means that one processor must physically send data to another processor, so this loop must be written differently.

First, it should be noted that on a distributed memory computer, one cannot write expressions which involve mixed data from different processors (that is, involving different K indices), other than simply copying the data. Second, on distributed memory processors, a given processor does not get data from another processor, it sends data to another. That means that when copying data one should have the value K on the right hand side rather than on the left hand side as in our example. Using these restrictions, we can make this example more portable by writing it as follows:

        PARAMETER (NX=32,NVP=4)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP,NVP), R(NXP,NVP)
        DO 50 K = 1, NVP
        DO 40 J = 2, NXP
        R(J,K) = F(J,K) - F(J-1,K)
     40 CONTINUE
        IF (K.LT.NVP) THEN
           R(1,K+1) = F(NXP,K)
        ELSE
           R(1,K-NVP+1) = F(NXP,K)
        ENDIF
     50 CONTINUE
        DO 60 K = 1, NVP
        R(1,K) = F(1,K) - R(1,K)
     60 CONTINUE
Note that in this new version we now have two loops over K (loops 50 and 60), where we had only one before (loop 30). In loop 50, we copied the data which involved different K indices. Then in loop 60, we performed the actual arithmetic where everyone has the same index K. In some cases, loop 50 may require a compiler directive to tell the compiler there is no recursion. Now let's look at how one would write this example on a distributed memory computer:
        PARAMETER (NX=32,NVP=4,)
        PARAMETER (NXP=NX/NVP)
        DIMENSION F(NXP), R(NXP)
        DIMENSION ISTATUS(MPI_STATUS_SIZE)
        CALL MPI_COMM_RANK(MPI_COMM_WORLD,ID,IERR)
        K = ID + 1
        DO 70 J = 2, NXP
        R(J) = F(J) - F(J-1)
     70 CONTINUE
        KLEFT = K - 1
        IF (KLEFT.LT.1) KLEFT = KLEFT + NVP
        KRIGHT = K + 1
        IF (KRIGHT.GT.NVP) KRIGHT = KRIGHT - NVP
        CALL MPI_IRECV(R,1,MPI_REAL,KLEFT-1,1,MPI_COMM_WORLD,MSID,IERR)
        CALL MPI_SEND(F(NXP),1,MPI_REAL,KRIGHT-1,1,MPI_COMM_WORLD,IERR)
        CALL MPI_WAIT(MSID,ISTATUS,IERR)
        R(1) = F(1) - R(1)
The loop index K is now implicit as in the previous distributed memory examples. The first loop (50) has been replaced by calls to MPI_IRECV and MPI_SEND, which is one way distibuted memory computers sends data between processors. In this example, the processor associated with index K first uses MPI_IRECV to tell the system where to receive data from processor K-1 or K+NVP-1.  (The data is not necessarily received yet).  Processor K then uses MPI_SEND to send the contents of F(NXP) either to processor K+1 or to processor K-NVP+1. Finally, processor K calls MPI_WAIT to make sure that data from processor K-1 or K+NVP-1 has been received and then stores it in R(1).

Finally, it is possible to combine both versions of this loop into one:

        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER (NXP=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXP,NBLOK), R(NXP,NBLOK)
        DIMENSION ISTATUS(MPI_STATUS_SIZE)
        IF (MSHARE.EQ.1) THEN
           KOFF = 0
        ELSE
           CALL MPI_COMM_RANK(MPI_COMM_WORLD,KOFF,IERR)
        ENDIF
        DO 90 K = 1, NBLOK
        DO 80 J = 2, NXP
        R(J,K) = F(J,K) - F(J-1,K)
     80 CONTINUE
        KK = K + KOFF
        KLEFT = KK - 1
        IF (KLEFT.LT.1) KLEFT = KLEFT + NVP
        IF (MSHARE.EQ.1) THEN
           R(1,K) = F(NXP,KLEFT)
        ELSE
           KRIGHT = KK + 1
           IF (KRIGHT.GT.NVP) KRIGHT = KRIGHT - NVP
           CALL MPI_IRECV(R(1,K),1,MPI_REAL,KLEFT-1,1,MPI_COMM_WORLD,MSID,IERR)
           CALL MPI_SEND(F(NXP,K),1,MPI_REAL,KRIGHT-1,1,MPI_COMM_WORLD,IERR)
           CALL MPI_WAIT(MSID,ISTATUS,IERR)
        ENDIF
        R(1,K) = F(1,K) - R(1,K)
     90 CONTINUE
As before, on a shared memory computer we can use a dummy functions for MPI_IRECV, MPI_SEND, and MPI_WAIT,  since they are not called and may not be available.

I have also made one other change in the shared memory segment of loop 90. I have not used the form shown in loop 50 above, where the index K is on the right hand side, but rather the form in loop 30 above, where the index K is on the left hand side. The reason is that some parallel compilers have difficulty recognizing the absence of recursion without compiler directives when the index on the left hand side is not K. So to avoid having to add compiler directives, I have written the loop for the shared memory version in the form shown.

V. Example with Recursion

Another kind of example that is often confusing when writing parallel code, is one that has recursion. Consider the following loop, which finds which elements of array F satisfy a certain condition and stores the corresponding index:
        PARAMETER (NX=32)
        DIMENSION F(NX), N(NX)
        L = 0
        DO 10 J = 1, NX
        IF (F(J).GT.BOUND) THEN
           L = L + 1
           N(L) = J
        ENDIF
     10 CONTINUE
In general, this loop cannot be vectorized (except by the IBM compiler) because the value of the counter L depends on its previous value and it is then subsequently used as an address inside the loop. Following our usual procedure one can write this loop in the following portable fashion:
        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER (NXP=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXP,NBLOK), N(NXP,NBLOK)
        DO 30 K = 1, NBLOK
        L = 0
        DO 20 J = 1, NXP
        IF (F(J,K).GT.BOUND) THEN
           L = L + 1
           N(L,K) = J
        ENDIF
     20 CONTINUE
     30 CONTINUE
Note that we have not actually removed the recursion. We have just replaced it with NVP separate recursions which one can then run in parallel.

For more complex recursions than this one, such as random number generators, one must exercise more care in initializing the separate recursions before being able to run them in parallel. Note that since the index K has no data dependencies, one could also use this technique to vectorize the outer loop instead of running it in parallel.

VI. MIMD Versus SIMD Compilers

For all the examples I have given until now, both the IBM and Cray parallelizing compilers behaved similarly. There is a significant difference, however, illustrated by the following example:
        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER (NXPMAX=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXPMAX,NBLOK), R(NXPMAX, NBLOK), NXP(NBLOK)
        DO 20 K = 1, NBLOK
        DO 10 J = 1, NXP(K)
        F(J,K) = R(J,K)
     10 CONTINUE
     20 CONTINUE
In this example, the range of the inner loop index NXP(K) depends on the value of the outer loop index. In other words, the length of the inner loop is different for each processor. The Cray precompiler can parallelize this, but the IBM cannot. The IBM compiler must have all inner loops of the same length, that is, it has a SIMD view of the world where all processors are executing the same instruction. The Cray precompiler, on the other hand, has a MIMD view of the world, where different processors can execute different instructions. Generally, such differences are fixed by having all processors executing the same instruction, and masking out unneeded ones with an IF statement. Thus a more portable version of this example is given by:
        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER(NXPMAX=NXNVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXPMAX,NBLOK), R(NXPMAX, NBLOK), NXP(NBLOK)
        DO 40 K = 1, NBLOK
        DO 30 J = 1, NXPMAX
        IF (J.LE.NXP(K)) THEN
           F(J,K) = R(J,K)
        ENDIF
     30 CONTINUE
     40 CONTINUE
A similar situation occurs if one or more processors need to skip a loop altogether:
        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER (NXPMAX=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXPMAX,NBLOK), R(NXPMAX, NBLOK), NXP(NBLOK)
        DO 70 K = 1, NBLOK
        IF (NXP(K).LE.NXPMAX/2) THEN
           DO 50 J = 1, NXP(K)
           F(J,K) = R(J,K)
     50    CONTINUE
        ELSE
           DO 60 J = 1, NXP(K)
           F(J,K) = -R(J,K)
     60    CONTINUE
        ENDIF
     70 CONTINUE
The solution is to put all the IF tests inside the inner loop:
        PARAMETER (NX=32,NVP=4,MSHARE=1)
        PARAMETER (NXPMAX=NX/NVP,NBLOK=1+MSHARE*(NVP-1))
        DIMENSION F(NXPMAX,NBLOK), R(NXPMAX, NBLOK), NXP(NBLOK)
        DO 90 K = 1, NBLOK
        DO 80 J = 1, NXPMAX
        IF ((NXP(K).LE.NXPMAX/2).AND.(J.LE.NXP(K))) THEN
           F(J,K) = R(J,K)
        ELSEIF ((NXP(K).GT.NXPMAX/2).AND.(J.LE.NXP(K))) THEN
           F(J,K) = -R(J,K)
        ENDIF
     80 CONTINUE
     90 CONTINUE
This form is more portable, but less efficient on some computers.

VII. SIMD Connection Machine

The algorithms which result from this technique can also be ported to a SIMD computer such as the Connection Machine, but the code itself is not portable. Rather, it must be translated into a new language called cmFortran. One approach which always works in performing the translation is to reverse the order of the inner and outer loops, then replace the loop over the parallel index K with an array construction. In many cases, however, special constructs and instrinsics exist which make the conversion even easier. For purposes of programming, the Connection Machine is treated as a shared memory machine.

As an illustration, the first simple example can be written in cmFortran as follows:

        parameter (nx=32,nvp=4,mshare=1)
        parameter (nxp=nxnvp,nblok=1+mshare*(nvp-1))
        dimension f(nxp,nblok)
   CMF$ LAYOUT f(:serial,:news)
        koff = 0
        do 10 j = 1, nxp
        f(j,:) = j + nxp*([1:nblok]+koff-1)
     10 continue
where the index K is replaced by a colon to indicate that all elements K are to be operated on and the brackets indicate an array whose elements are the integers from 1 to nblok. The LAYOUT compiler directive declares that the second index in the array f is distributed across processors, and the first index is not. Although this example works, there is an even simpler way to write this loop using the forall construction:
        parameter (nx=32,nvp=4,mshare=1)
        parameter (nxp=nx/nvp,nblok=1+mshare*(nvp-1))
        dimension f(nxp,nblok)
   CMF$ LAYOUT f(:serial,:news)
        koff = 0
        forall (j=1:nxp,k=1:nblok) f(j,k) = j + nxp*(k + koff - 1)
The second example of a summation reduction can be written even more simply, since a summation reduction operator already exists in cmFortran:
        parameter (nx=32,nvp=4,mshare=1)
        parameter (nxp=nx/nvp,nblok=1+mshare*(nvp-1))
        dimension f(nxp,nblok)
   CMF$ LAYOUT f(:serial,:news)
        fsum = sum(f)
The third example using communication can be written simply using the circular shift communication primitive:
        parameter (nx=32,nvp=4,mshare=1)
        parameter (nxp=nx/nvp,nblok=1+mshare*(nvp-1))
        dimension f(nxp,nblok), r(nxp,nblok)
   CMF$ LAYOUT f(:serial,:news), r(:serial,:news)
        r(1,:) = cshift(f(nxp,:),1,shift=1)
        r(1,:) = f(1,:) - r(1,:)
        forall (j=2:nxp,k=1:nblok) r(j,k) = f(j,k) - f(j-1,k)
The example with recursion can be written using the technique of inverting the order of the double loops discussed at the beginning of this section:
        parameter (nx=32,nvp=4,mshare=1)
        parameter (nxp=nx/nvp,nblok=1+mshare*(nvp-1))
        dimension f(nxp,nblok), n(nxp,nblok), l(nblok)
   CMF$ LAYOUT f(:serial,:news), n(:serial,:news), l(:news)
        l = 0
        do 20 j = 1, nxp
        where (f(j,:).gt.bound) l = l + 1
        forall (k=1:nblok,f(j,k).gt.bound) n(l(k),k) = j
     20 continue
Finally, the last example with masks can be written:
        parameter (nx=32,nvp=4,mshare=1)
        parameter (nxpmax=nx/nvp,nblok=1+mshare*(nvp-1))
        dimension f(nxpmax,nblok), r(nxpmax,nblok), nxp(nblok)
   CMF$ LAYOUT f(:serial,:news), r(:serial,news), nxp(:news)
        forall (j=1:nxpmax,k=1:nblok,
       1(nxp(k).le.nxpmax/2).and.(j.le.nxp(k))) f(j,k) = r(j,k)
        forall (j=1:nxpmax,k=1:nblok,
       1(nxp(k).gt.nxpmax/2).and.(j.le.nxp(k))) f(j,k) = -r(j,k)
On the Connection Machine, it is best to make NVP as large as possible.

VIII. Conclusions

To summarize, the technique described here consists of the following rules. Break up single loops into double loops, where the outer loop is to be run in parallel. The range of the outer loop index K is the number of parallel processes. As much as possible, only allow expressions where the outer loop index K is the same everywhere. If this is not possible, then the only operation permitted is to copy the data so that subsequent calculations do have the same outer loop index K. To make the code more portable to distributed memory computers, new variables are introduced, to control the outer loop index K and the value of K. In addition, versions of distributed memory instrinsics are introduced, and in some cases dummy ones. The remaining non-portable feature involves communication, where memory copies must be replaced with message passing subroutines. This process is quite mechanical, and perhaps some day much of it could be don by compilers.

Using these rules we have written a (nearly) portable one dimensional particle-in-cell code, similar to the one described in Ref. [1], using a General Concurrent PIC algorithm, described in Ref. [2]. There were only three subroutines which involved communications, hence had non-portable constructs. They were an FFT subroutine (in six places), a subroutine for passing particles from one processor to another (in four places), and a subroutine for adding and replicating guard cells (in four places). The code has been run successfully run in parallel on the Intel IPSC/860, the Cray C-90, the IBM ES/9000, and a cluster of workstations, and of course, it runs sequentially on a serial computer. In fact, this code could even be run on disk-based systems, where communication calls are replaced by read and writes to disk, but this has not been done.

The value of using the shared memory automatic parallelizing compilers for writing parallel code first became evident while porting the Intel version of the code described in Ref. [2] to the IBM ES/9000. During this conversion, it became clear that the structure of the code on shared and distributed memory computers was the same, except that the outermost subscript was explicit on the shared memory computers, but implicit (i.e., invisible) on the distributed memory computers. Furthermore, the compilers were very useful in finding mistakes in parallelism, and could be debugged serially in a familiar environment. (One just had to trust the compiler that it was safe to run in parallel, one did not actually have to run it in parallel.) Now I have reversed the process, and all new code for distributed memory computers is first written in a shared memory environment with these new compilers. This gets most of the bugs out. Then the code is moved to the distributed memory environment. Only one new kind of bug might be introduced at this point, namely synchronization errors, which I will not discuss here. Another benefit for beginners is that they do not have to struggle with parallel debuggers. For a production code which will run only on a distributed memory computer, one could eliminate the outer loop index at the end.

When writing portable code, one must write for the lowest common denominator and sometimes this results in performance sacrifices for some machines. Examples of this include an extra loop index (which is always 1) when running on a distributed memory machine, an extra step in copying data when a compiler can parallelize with the data as is, or IF tests inside the innermost loop when they could have been placed outside that loop on some computers. Experiences with the particle-in-cell code show that such extra overhead for the critical loops was quite small, a few per cent. The subroutine with the largest overhead (40%) was the distributed 1d real to complex FFT, but little time was spent in this routine. Whether such overhead is important in some other code strongly depends on what the critical loops in that code are. However, programming time is often a very scarce resource, and one will often trade even substantial inefficiencies in computer time to save programmer time.

When all these rules are followed, the code which results look a great deal like cmFortran code written in Fortran 77. Such code is, as one can see from the examples, quite a bit easier to write in cmFortran or Fortran 90 than in Fortran 77. So there is some hope that as Fortran90 or some high performance Fortran style compilers become more prevalent, parallel programming will get easier in the future.

Acknowledgments

I wish to thank Aeint de Boer and Paul Hoffman of UCLA for help with the IBM parallelizing compiler and the UCLA Office of Academic Computing for providing access to the IBM ES/9000. I wish to acknowledge access to the Cray C-90 provided by the National Research Supercomputer Center at Livermore, California. I further acknowledge that this research was performed in part using the Intel Touchstone Gamma System operated by Caltech on behalf of the Concurrent Supercomputing Consortium, and that access to this facility was provided by NASA/JPL, and I wish to thank Robert Ferraro , Paulett Liewer and Edith Huang of NASA/JPL for help with that system. Finally, I acknowledge access to the Thinking Machine CM-200 provided by the Advanced Computing Laboratory (LANL) and the consulting support they provided. This work was supported by Sandia National Laboratory and the U. S. Department of Energy.

References

  1. V. K. Decyk and J. E. Slottow, "Supercomputers in the Classroom," Computers in Physics, vol. 3, no. 2, (1989), p. 50.
  2. P. C. Liewer and V. K. Decyk, "A General Concurrent Particle-in-Cell Algorithm," J. Computational Phys. 85, 302 (1989).


Rev. 28 Nov 98;