Date: Sun, 6 Aug 2000 12:09:36 -0700
From: Jan de Leeuw <deleeuw@stat.ucla.edu>
Subject: G4/500 MP

I just set up my G4/500 MP, with two processors. The second processor is only used by programs that explicitly take its presence into account (at least on MacOS 8.0 or 9.0). This note is used to show that it is easy to do this, using MacMP, a convenient interface to the MacOS multiprocessing libraries developed by Viktor Decyk (Physics, at UCLA).

Let us compute the variance of the first 10^9 integers. And let's do it in a stupid way, with a loop for the sum and another loop for the sum of squares. Using one processor this can be done by the uni.f or uni.c program. It takes 120.2 sec. on the G4/500. It is programmed so as to be maximally similar to multi.f or multi.c, which uses the second processor to compute the sum and the first one to compute the sum of squares. This takes 69.2 seconds, i.e. 58% of the uni-time (the time is not exactly half, since the two processors are not doing the same about of work). You can also see, from the code, how simple it is to put MP into your own applications.

A better approach is to program a single loop to cumulate both sums, as is done in the program uniadder.f or uniadder.c for a single processor.  For dual processors, one can split the loop into two halves, as in the example multiadder.f or multiadder.c.  This gives almost perfect speedup.  And then, finally, you could use Viktor Decyk's MacMPI to distribute your computations over a network of Mac's, such as gSCAD (our cluster of 16 G4/450 computers).

A single G4/500 is equivalent, for many tasks, to a 1 GHz Pentium III.

===
Jan de Leeuw; Professor and Chair, UCLA Department of Statistics;
US mail: 8142 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095-1554
phone (310)-825-9550; fax (310)-206-5658; email: deleeuw@stat.ucla.edu
http://www.stat.ucla.edu/~deleeuw and http://home1.gte.net/datamine/
==================================================================
No matter where you go, there you are. --- Buckaroo Banzai
http://www.stat.ucla.edu/sounds/nomatter.au
==================================================================

Further comments from Viktor Decyk:

As Jan has said, the approach of accumulating both sums in a single loop and then splitting the work into pieces can also be used to distribute the computation over a network of Macs using MacMPI, as is shown in the example mpiadder.f or mpiadder.c.  It is also possible to use a network of dual processor Macs by making use of both the MacMPI and MacMP libraries together, as illustrated in the example multimpiadder.f or multimpiadder.c, which achieves nearly four times speedup on two dual processor Macs over the original uniadder example.

Warning for C Programmers:  The original uniadder example used a subroutine adder which was defined as follows:

void  adder (int *n, int *start, double *s1, double *s2) {
int i;
for (i=(*start);i<(*n);i++)
   {
      (*s1)+=(double) i;
      (*s2)+=(double) i*i;
   }
}

This ran much slower than the Fortran version:

      subroutine adder(n,start,s1,s2)
      integer n, start
      double precision s1, s2
      integer i
      do 10 i = start, n-1
      s1 = s1 + dble(i)
      s2 = s2 + dble(i)*i
   10 continue
      return
      end

The reason for this is pointer aliasing.  Since the pointers s1 and s2 in the C program could potentially point to the same memory locations, the compiler writes the result of the s1 calculation to memory and s2 reads it back.  This results in a lot of needless memory references, since s1 and s2 in fact point to different locations in memory.  Furthermore, when this subroutine was used in the multiadder version with two processors, almost no speedup was observed.  The reason for this was that both CPUs were competing for the data bus very intensively.  The cure was very simple, accumulate in two local variables instead of the two pointers, so the intermediate results do not have to be written to memory, as follows:

void adder (int *n, int *start, double *s1, double *s2) {
int i;
double ls1 = 0.0, ls2 = 0.0;
for (i=(*start);i<(*n);i++)
   {
      ls1+=(double) i;
      ls2+=(double) i*i;
   }
*s1 = ls1;
*s2 = ls2;
}

This version now runs at the same speed as the Fortran version and gives nearly perfect speedup on two processors.  Pointer aliasing does not occur in the Fortran version, since the language requires (and the compiler assumes) that s1 and s2 cannot refer to the same memory location.  Although C is a more flexible language than Fortran, it sometimes requires more expertise and care to get comparable performance, as this example shows.