Appleseed: A Parallel Macintosh Cluster for Numerically Intensive Computing

by

Viktor K. Decyk, Dean E. Dauger, and Pieter R. Kokelaar

Department of Physics and Astronomy
University of California, Los Angeles
Los Angeles, CA 90095-1547

email: decyk, dauger, and pekok%40physics.ucla.edu

Abstract

We have constructed a parallel cluster consisting of 4 Apple Macintosh computers running the MacOS, and have achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations.  A partial implementation of the MPI message-passing library was implemented in Fortran77, based on the Program-to-Program Communications Toolbox in the MacOS.  This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster.  For large problems where message packets are large and relatively few in number, performance of over 50 MFlops/node was achieved.  This gave a cost effectiveness for the cluster as low as $45/MFlop, depending on the memory requirements.

Introduction

In recent years there has been a growing interest in clustering commodity computers to build inexpensive parallel computers. A number of projects [1-3] have demonstrated that for certain classes of problems, this is a viable approach for cheap, numerically intensive computing. The most common platform for building such a parallel cluster is based on the Pentium processor running the Linux version of Unix. Recently, the Apple Macintosh G3 computer has been introduced which uses the Motorola PowerPC 750 processor that is substantially faster for numerically intensive computing than the Pentium at a comparable price. We decided to investigate whether a cluster based on the Macintosh G3 was practical.

This investigation was initially motivated by the availability of an excellent optimizing Fortran compiler for the Macintosh produced by the Absoft Corporation [4]. When we tested our well-benchmarked suite of plasma particle-in-cell (PIC) simulation codes [5-6] on the Macintosh G3/266, we obtained impressive single node performance, as shown in Table I. Not only was the performance nearly twice as fast as the Pentium Pro/200, but it was comparable to the performance achieved on some of the Crays.
 
Table I. 
2D Particle Benchmarks 
 
----------------------------------------- 
The following are times for a 2D particle simulation, using 327,680 particles and a 64x128 mesh for 325 time steps. Push Time is the time to update one particle's position and deposit its charge, for one time step. Loop Time is the total time for running the simulation minus the initialization time. 
 
 
Computer Push Time (nsec.) Loop Time (sec.)
Macintosh G3/266 1950 213.2
Cray T3E-900, 1 proc. 1970 212.7
Cray Y-MP, 1 proc. 1980 215.0
IBM RS/6000, Model 590 2130 289.2
Intel Pentium Pro/200 3480 376.9
Cray T3D, 1 proc. 7060 751.8
 
 

Our PIC codes are used in a number of High-Performance Computing Projects, such as modeling fusion reactors [7] and advanced accelerators [8]. For these projects massively parallel computers are required, such as the 512 node Cray T3E at NERSC. However, code development and student projects can be performed on more modest parallel machines such as the low cost clusters. All of our parallel PIC codes currently use an applications programming interface (API) called MPI (Message Passing Interface) [9]. To our knowledge, MPI is not available for the Macintosh. However, there is a message-passing API available in the MacOS, called the Program-to-Program Communications (PPC) Toolbox [10], which is comparable to MPI in its low-level features. The similarity of the native PPC message-passing facility to MPI further encouraged us build the Macintosh cluster.

Software Implementation

Although a complete implementation of MPI has many high-level features (such as user defined datatypes and division of processes) not available in PPC, these features were generally not needed by our PIC codes. It was therefore straightforward to write a partial implementation of MPI (34 subroutines) based on PPC, which we call MacMPI. PPC uses the AppleTalk protocol, which can run on Ethernet hardware, but does not require an IP address. The entire library was written in Fortran77, making use of Fortran extensions for creating and referencing pointers available in the Absoft compiler.

The only complicated subroutine was the initialization procedure, MPI_INIT. To initialize the cluster, a file is created which contains a list of n computer names and zones which are participating in the parallel computation. The node which has this file is designated the master node (node 0). The master initiates a peer connection with each of the other participating nodes (1 through n-1), and then passes to node 1 the list of remaining nodes (2 through n-1). Node 1 then establishes a peer connection with them, and passes on the list of remaining nodes (3 through n-1) to node 2, and so on. The last node receives a null list and does not pass it on further. Each node also establishes a connection to itself, as required by MPI. The executable file can be copied to each node and started manually. We have also written a Launch Den Mother utility to automaticallyt copy and launch executables on the nodes.  This utility will eventually be made available on the web.

Once the MacMPI library was implemented, we were able to port the parallel PIC codes from the Cray T3E and IBM SP2 to the Apple Macintosh cluster without modification. This library and related files are available at our web site: http://exodus.physics.ucla.edu/appleseed/appleseed.html

Hardware Implementation

The baseline Macintosh G3 running at 266 MHz currently (August, 1998) costs $1529 at UCLA. This machine is a desktop model with 32 MB RAM, a 4 GB Hard Drive, CD-ROM and Zip-drive. Because the G3 cluster will be also used for visualization, we decided to purchase the tower model for $1790, which has a larger disk drive and video output. We upgraded each Macintosh by adding one 256 MB memory card at a cost of $879 each, so that the total memory of each Macintosh was 288 MB.

For networking, we purchased an Asante PCI Fast Ethernet (100 Mbps) Adapter for each computer at a cost of $95 each. If only two Macs are being clustered, the only additional equipment needed is a Category 5 cross-over cable. We made our own cables, which otherwise would have cost $8 apiece. A hub or switch is required to cluster 4 Macintoshes, and we purchased an Asante 5 port 100BaseT Ethernet Hub for $185, as well as a power strip for $15. For a simple introduction to Macintosh networking, we recommend the Mac OS 8 Bible [11].

Costs for various configurations are summarized in Table II, and the most expensive version came to $11,256 for a cluster of 4, containing over 1 GB of memory. This cost does not include the monitor. We purchased a 20" Trinitron Apple Display for $1439. In addition, we purchased a KVM switch with cables made by Black Box Corp. for $502 to enable us to share the single monitor among the 4 computers. Sharing the monitor was a convenience while debugging the MacMPI library and manually starting the applications. We anticipate that a monitor switch will not be necessary during production runs. Figure 1 shows the current configuration.
 
Table II. 
Cost of Macintosh G3 cluster for various configurations 
 
 
Desktop cluster  Tower cluster
2 Macs, 320 MB RAM, 8 GB disk = $4261  2 Macs, 576 MB RAM, 12 GB disk = $5543
4 Macs, 640 MB RAM, 16 GB disk = $8692 4 Macs, 1.15 GB RAM, 24 GB disk =$11,256
 
 
Figure 1.  Appleseed: Cluster of 4 Apple Macintosh G3 computers with single monitor. 
The cluster has two networks running simultaneously. In order to maximize performance, AppleTalk MacMPI uses the Fast Ethernet (100BaseT) connection as a private network without any other traffic on it. In addition, the Macintoshes can be connected to the Internet using the built-in Ethernet (10BaseT) running TCP/IP. This gives the cluster access to the outside world and enables importing and exporting files using an ftp program. It is also possible to connect a LocalTalk LaserWriter to the AppleTalk network via a LaserWriter Bridge.

Performance

The performance of this cluster was excellent for certain classes of problems, mainly those where communication was small compared to calculation and the message packet size was large. Results for the large 3D benchmark described in Ref. [6] are summarized in Table III. One can see that the Mac cluster performance was comparable to that achieved by the Cray T3E-900 and the IBM SP2/266 in this case. Indeed, the recent advances in computational performance is astonishing. A cluster of 4 Macintoshes now has the same computational power (and twice the memory) as a 4 processor Cray Y-MP, one of the best supercomputers of 8 years ago, for one thousandth of the cost!
 
Table III. 
3D Particle Benchmarks 
----------------------------------------- 
The following are times for a 3D particle simulation, using 7,962,624 particles and a 64x32x128 mesh for 425 time steps. Push Time is the time to update one particle's position and deposit its charge, for one time step. Loop Time is the total time for running the simulation minus the initialization time. 
 
Computer Push Time (nsec.) Loop Time (sec.)
 
IBM SP2, w/MPL, 8 proc. 2104 7331.1
 
Cray T3E-900, w/MPI, 8 proc. 1800 6196.3
Cray T3E-900, w/MPI, 4 proc. 3844 13233.7
 
Mac G3/266 cluster, 4 proc. 3231 11929.6
Mac G3/266 cluster, 2 proc. 7182 25738.5
 
 

To determine what packet sizes gave good performance, we developed a swap benchmark (where pairs of processors swap packets of equal size) and a bandwidth was defined to be twice the packet size divided by the time to exchange the data. Figure 2 shows a typical curve. As one can see, high bandwidth is achieved for packet sizes of around 215 (32768) words. Best bandwidth rates achieved on this test are less than one fourth the peak speed of the 100 Mbps hardware when four nodes are communicating simultaneously.
 
Figure 2 
Bandwidth (MBytes/sec) for four processors swapping data as a function of packet size. Tests are repeated 10 times. Solid curve is average rate, dashed curves are maximum and minimum rates. 
 

For the 3D benchmark case described here, the average packet size varied between 213 and 217 words, which is right in the middle of the region of good performance. Benchmarks for smaller problems such as the 2D case discussed in Ref. [6], did not scale as well, as shown in Table IV, but still gave good performance. We estimate that the codes are running over 50 MFlops/node, which gives a cost effectiveness of about $45-55/MFlop for the cluster, depending on the configuration.
 
 
Table IV. 
2D Particle Benchmarks 
----------------------------------------- 
The following are times for a 2D particle simulation, using 3,571,712 particles and a 128x256 mesh for 325 time steps. Parallel codes use Domain Decomposition. Push Time is the time to update one particle's position and deposit its charge, for one time step. Loop Time is the total time for running the simulation minus the initialization time. 
 
 
Computer Push Time (nsec.) Loop Time (sec.)
 
IBM SP2, w/MPL, 4 proc. 807 942.0
 
Cray T3E-900, w/MPI, 4 proc. 481 564.7
Cray T3E-900, w/MPI, 2 proc. 1193 1406.0
 
Mac G3/266 cluster, 4 proc. 595 795.6
Mac G3/266 cluster, 2 proc. 1156 1448.8
Mac G3/266 cluster, 1 proc. 2323 2744.1
 
Using MacMPI

To compile and run a Fortran source code, three additional files are needed, the library MacMPI.f, the include file mpif.h, and a file of participating nodes named nodelist. Creating an executable with the Absoft compiler is straightforward. If a user has a Fortran 77 program called test.f and a subroutine library called testlib.f, the following command will link with MacMPI.f and produce an executable optimized for the G3 architecture:

f77 -O -Q92 test.f testlib.f MacMPI.f

The include file mpif.h must also be present. One can also run the code with automatic double precision, as follows:

f77 -O -N113 -N2 -Q92 test.f testlib.f MacMPI.f

This option was used by our benchmark codes. It is possible to create a makefile both manually as well as via a graphical interface, although the makefiles differ from the standard Unix style makefiles.

To run a Fortran 90 program, one should compile the Fortran 90 program and MacMPI.f separately, as follows:

f77 -O -Q92 MacMPI.f
f90 -O -604 test.f
f90 -O -604 test.f.o MacMPI.f.o

To setup the Macintosh for parallel processing in MacOS 8.1, one must set the AppleTalk Control Panel to use the Fast Ethernet Adapter and verify in the chooser that AppleTalk is active. Next, the computer name must be set and Program Linking should be enabled in the File Sharing Control Panel. Finally, in the Users and Groups Control Panel, one must allow Guests to link.

MacMPI requires that the master node have a file called nodelist present in the same directory as the executable. This is a straight text file. The first line contains a port name. If the name ppc_link is used, then the slave nodes do not need to have a host file. (If some other port name is used, then the slave nodes need to have a nodelist file which contains only the port name.) The second line contains the name self. This name is required only if the cluster contains a single processor. Finally the remaining lines consist of a list of computer names and zones, one name per line, in the form:

computer_name@zone_name

If there is only one zone in the AppleTalk network, the zone names are omitted. The names cannot have trailing blanks.  A sample nodelist file is shown in Table V.
 
 

Table V. 
Sample nodelist file 

ppc_link 
self 
uclapic1 
BarbH2@Physics-Plasma Theory 
fried@Physics-Plasma Theory 

 

If automatic starting capabilities implemented, one has to manually copy the executable to each node (via floppy disk or network), and start up (by double clicking) each executable. The master must be started last. The Launch Den Mother utility automates this procedure.

During execution, some errors detected by MacMPI are written to Fortran unit 2, which defaults to a file called FOR002.DAT. This file should be examined if problems occur. Some errors may be due to the fact that our implementation of MPI is only partial. There is one error log entry generated which is caused by a bug in AppleTalk. This error entry says that an Incomplete Read occurred, but the expected and actual data received are the same. The MacMPI library has a work around for this bug, so this error entry is for informational purposes only.

Future

Our current plans are to extend the cluster to eight machines.  Four of them will be dedicated solely to running large production calculations.  The other four will be used as general desktop machines during working hours and join the production pool at night. Since two separate networks are used, the two uses can coexist.

Improvement in network performance will improve the execution speed of smaller problems. Discussions with Apple Computer indicated that part of the reason why AppleTalk did not achieve maximum performance with 100 Mbps Ethernet had to do with certain "inefficiencies" in the AppleTalk implementation, which will be fixed in the next release of the MacOS (8.5). Another area of improvement we plan to investigate is to replace an Ethernet hub with a switch. This should remove degradation due to collisions and allow full duplex communication. Finally, in the long term, substantial performance improvement over Fast Ethernet appears possible using the FireWire technology, invented by Apple and used in digital cameras.

The new, inexpensive iMacs recently announced by Apple Computer will have built-in Fast Ethernet, and will only be about 12% slower than the machines we tested. A cluster of iMacs might be particularly attractive for a combination student lab/parallel computer.

Acknowledgments

We wish to acknowledge the useful advice given to us by Myron Krawczuk, Macintosh Consultant, New Jersey, Cliff McCollum, U. Victoria, Canada, Johan Berglund, KTH, Sweden, and Chris Thomas, UCLA. This work is supported by DOE and NSF.

References

[1] D. S. Katz, T. Cwik, B. H. Kwan, J. Z. Lou, P. L. Springer, T. L. Sterling, and P. Wang, "An Assessment of a Beowulf System for a Wide Class of Analysis and Design Software," to appear in Advances in Engineering Software, v. 26(6-9), August 1998. See also http://www-hpc.jpl.nasa.gov/PS/HYGLAC/beowulf.html

[2] Samuel A. Fineberg and Kevin T. Pedretti, "Analysis of 100 Mbps Ethernet for the Whitney Commodity Computing Testbed," NAS Technical Report NAS-97-025, October, 1997. See also http://parallel.nas.nasa.gov/Parallel/Projects/Whitney

[3] M. S. Warren, J. K. Salmon, D. J. Becker, M. P. Goda, T. Sterling, and G. S. Winckelmans. “Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, II. Price/performance of $50/Mflop on Loki and Hyglac”, Supercomputing '97, Los Alamitos, 1997. IEEE Comp. Soc.  See also http://cnls.lanl.gov/avalon

[4] See http://www.absoft.com/

[5] V. K. Decyk, "Benchmark Timings with Particle Plasma Simulation Codes," Supercomputer 27, vol V-5, p. 33 (1988).

[6] V. K. Decyk, "Skeleton PIC Codes for Parallel Computers," Computer Physics Communications 87 , 87 (1995).

[7] R. D. Sydora, V. K. Decyk, and J. M. Dawson, "Fluctuation-induced heat transport results from a large global 3D toroidal particle simulation model", Plasma Phys. Control. Fusion 38 , A281 (1996).

[8] K.-C. Tzen, W. B. Mori, and T. Katsouleas, "Electron Beam Characteristics from Laser-Driven Wave Breaking," Phys. Rev. Lett. 79 , 5258 (1997).

[9] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, MPI: The Complete Reference [MIT Press, Cambridge, MA, 1996].

[10] Apple Computer, Inside Macintosh: Interapplication Communication [Addison-Wesley, Reading, MA, 1993], chapter 11.

[11] Lon Poole, MacWorld Mac OS 8 Bible [IDG Books Worldwide, Foster City, CA, 1997], chapter 17.