Benchmarks of AppleSeed Performance




MegaFlop performance data
 
 

Performance data with 4 and 8 processors using 3D code
 
 

Performance data with 4 and 8 processors
 
 

Performance data with one processor
 
 

3D MFlop RISC performance data with one processor
 
 

3D MFlop


 


AltiVec Fractal Demo Benchmarks
  The G4/450's of our AppleSeed Macintosh cluster achieved the following performance running the June 2000 version of the AltiVec Fractal Demo IP, computing the default z4 fractal image (single precision) with the Maximum Count set to 65536. This code, compiled with MacMPI_IP.c using Metrowerks CodeWarrior Pro 5, decomposes the problem as interlaced lines, resulting in efficient parallelism. It also fills in as many bubbles in the instruction pipeline as it can in an attempt to use the G4's AltiVec unit efficiently:

Number of G4/450's MFlops without AltiVec MFlops using AltiVec *
1
385
1583
2
771
3162
3
1146
4682
4
1543
6331

 We also ran the June version of the AltiVec Fractal Demo IP on the same problem using the AltiVec* instruction units in the UCLA Statistics Department's cluster of 16 G4/400's. Increasing the Maximum Count parameter (MaxCount) makes the problem more challenging, while the size of the messages remains constant.

Number of G4/400's MFlops (MaxCount=4096) MFlops (MaxCount=16384) MFlops (MaxCount=65536) 
1
1347
1418
1441
2
2616
2819
2859
4
4976
5571
5705
8
8913
10782
11468
16
14747
20253
22840
For the smallest problem size on 16 nodes, the computation time became as small as the communications time, resulting in inefficient parallelism.

* This flop performance calculation shows "Honest" MegaFlops in two significant respects:

  1. Because of the nature of the AltiVec instruction set, for every floating-point multiply, a floating-point addition must be performed as well. Consequently, a*b multiplies must be accompanied by an add to zero. In this calculation, there are 24 floating-point operations (flops) per iteration per pixel, 6 of which are adds to zero. The above benchmarks reflect only the 18 contributing flops, even though the hardware is really performing all the mechanisms for those extra 6 adds to zero. So for every 1000 MFlops you see in the above benchmarks, an extra unused 333 MFlops of adds to zero is also going on.
  2. Vectorization in this code is across pixels, so every four pixels you see is computed by one vector calculation (AltiVec registers can take four floating-point elements). In the nature of this calculation, some of the elements can finish up their useful work before the other elements, but in order to finish working on the remaining elements the code continues to operate on the entire vector. Therefore it is possible that a complete four-element vector calculation will be performed for the sake of one last element computed to completion, meaning that the AltiVec hardware is doing four times more work than is normally needed to complete the operation. (It moves on to the next four pixels when the last four are all done, of course.)

  3.  

     
     
     

    We hope that doesn't happen too often, but just in case, this code has mechanisms to properly flag the elements and tally the flops actually used to compute the pixels. That is, as soon as an element has finished, any further work on that element is uncounted, even though the AltiVec hardware is continuing to crank away. Counting the extra unused flops makes the calculated performance jump by up to a factor of two for some images.




Latest PIC Milestone

   On February 6, we established a new milestone with AppleSeed.  We were able to run a 100 million particle 3D electrostatic PIC simulation on an 8 node Macintosh G4/450 dual processor cluster.  The total time was 17.8 seconds/time-step, with a grid of 128x128x256.  We used Bedros Afeyan's Polymath 2000 cluster, which has 1 GB memory per node, since we don't have any machines large enough at UCLA to do the job.  The current cost of such machines is less than $2500/node. It was only 5 or 6 years ago that such calculations required the world's largest supercomputers.
 


Gigabit Performance Results

Comparison of Apple’s Gigabit Ethernet (1000BaseT) adapter with the earlier Fast Ethernet (100BaseT) adapter on two dual-processor G4/450's running OS 9. Measured Bandwidth (MBytes/sec) is for 2 processors connected with a cross-over cable exchanging data as a function of message size.  Results show that the Gigabit Ethernet is more than 3 times faster than Fast Ethernet.

Back to AppleSeed

http://exodus.physics.ucla.edu/appleseed/appleseed.html
 

last update: April 17, 2001