Appleseed: A Parallel Macintosh Cluster for Numerically Intensive Computing
by
Viktor K. Decyk, Dean E. Dauger, and Pieter R. Kokelaar
Department of Physics and Astronomy
University of California, Los Angeles
Los Angeles, CA 90095-1547
email: decyk, dauger, and pekok%40physics.ucla.edu
Abstract
We have constructed a parallel cluster consisting of 4 Apple Macintosh computers running the MacOS, and have achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A partial implementation of the MPI message-passing library was implemented in Fortran77, based on the Program-to-Program Communications Toolbox in the MacOS. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of over 50 MFlops/node was achieved. This gave a cost effectiveness for the cluster as low as $45/MFlop, depending on the memory requirements.
Introduction
In recent years there has been a growing interest in clustering commodity computers to build inexpensive parallel computers. A number of projects [1-3] have demonstrated that for certain classes of problems, this is a viable approach for cheap, numerically intensive computing. The most common platform for building such a parallel cluster is based on the Pentium processor running the Linux version of Unix. Recently, the Apple Macintosh G3 computer has been introduced which uses the Motorola PowerPC 750 processor that is substantially faster for numerically intensive computing than the Pentium at a comparable price. We decided to investigate whether a cluster based on the Macintosh G3 was practical.
This investigation was initially motivated by the availability of an
excellent optimizing Fortran compiler for the Macintosh produced by the
Absoft Corporation [4]. When we tested our well-benchmarked suite of plasma
particle-in-cell (PIC) simulation codes [5-6] on the Macintosh G3/266,
we obtained impressive single node performance, as shown in Table I. Not
only was the performance nearly twice as fast as the Pentium Pro/200, but
it was comparable to the performance achieved on some of the Crays.
|
Our PIC codes are used in a number of High-Performance Computing Projects, such as modeling fusion reactors [7] and advanced accelerators [8]. For these projects massively parallel computers are required, such as the 512 node Cray T3E at NERSC. However, code development and student projects can be performed on more modest parallel machines such as the low cost clusters. All of our parallel PIC codes currently use an applications programming interface (API) called MPI (Message Passing Interface) [9]. To our knowledge, MPI is not available for the Macintosh. However, there is a message-passing API available in the MacOS, called the Program-to-Program Communications (PPC) Toolbox [10], which is comparable to MPI in its low-level features. The similarity of the native PPC message-passing facility to MPI further encouraged us build the Macintosh cluster.
Software Implementation
Although a complete implementation of MPI has many high-level features (such as user defined datatypes and division of processes) not available in PPC, these features were generally not needed by our PIC codes. It was therefore straightforward to write a partial implementation of MPI (34 subroutines) based on PPC, which we call MacMPI. PPC uses the AppleTalk protocol, which can run on Ethernet hardware, but does not require an IP address. The entire library was written in Fortran77, making use of Fortran extensions for creating and referencing pointers available in the Absoft compiler.
The only complicated subroutine was the initialization procedure, MPI_INIT. To initialize the cluster, a file is created which contains a list of n computer names and zones which are participating in the parallel computation. The node which has this file is designated the master node (node 0). The master initiates a peer connection with each of the other participating nodes (1 through n-1), and then passes to node 1 the list of remaining nodes (2 through n-1). Node 1 then establishes a peer connection with them, and passes on the list of remaining nodes (3 through n-1) to node 2, and so on. The last node receives a null list and does not pass it on further. Each node also establishes a connection to itself, as required by MPI. The executable file can be copied to each node and started manually. We have also written a Launch Den Mother utility to automaticallyt copy and launch executables on the nodes. This utility will eventually be made available on the web.
Once the MacMPI library was implemented, we were able to port the parallel PIC codes from the Cray T3E and IBM SP2 to the Apple Macintosh cluster without modification. This library and related files are available at our web site: http://exodus.physics.ucla.edu/appleseed/appleseed.html
Hardware Implementation
The baseline Macintosh G3 running at 266 MHz currently (August, 1998) costs $1529 at UCLA. This machine is a desktop model with 32 MB RAM, a 4 GB Hard Drive, CD-ROM and Zip-drive. Because the G3 cluster will be also used for visualization, we decided to purchase the tower model for $1790, which has a larger disk drive and video output. We upgraded each Macintosh by adding one 256 MB memory card at a cost of $879 each, so that the total memory of each Macintosh was 288 MB.
For networking, we purchased an Asante PCI Fast Ethernet (100 Mbps) Adapter for each computer at a cost of $95 each. If only two Macs are being clustered, the only additional equipment needed is a Category 5 cross-over cable. We made our own cables, which otherwise would have cost $8 apiece. A hub or switch is required to cluster 4 Macintoshes, and we purchased an Asante 5 port 100BaseT Ethernet Hub for $185, as well as a power strip for $15. For a simple introduction to Macintosh networking, we recommend the Mac OS 8 Bible [11].
Costs for various configurations are summarized in Table II, and the
most expensive version came to $11,256 for a cluster of 4, containing over
1 GB of memory. This cost does not include the monitor. We purchased a
20" Trinitron Apple Display for $1439. In addition, we purchased a KVM
switch with cables made by Black Box Corp. for $502 to enable us to share
the single monitor among the 4 computers. Sharing the monitor was a convenience
while debugging the MacMPI library and manually starting the applications.
We anticipate that a monitor switch will not be necessary during production
runs. Figure 1 shows the current configuration.
|
![]() |
|
|
Performance
The performance of this cluster was excellent for certain classes of
problems, mainly those where communication was small compared to calculation
and the message packet size was large. Results for the large 3D benchmark
described in Ref. [6] are summarized in Table III. One can see that the
Mac cluster performance was comparable to that achieved by the Cray T3E-900
and the IBM SP2/266 in this case. Indeed, the recent advances in computational
performance is astonishing. A cluster of 4 Macintoshes now has the same
computational power (and twice the memory) as a 4 processor Cray Y-MP,
one of the best supercomputers of 8 years ago, for one thousandth of the
cost!
|
To determine what packet sizes gave good performance, we developed a
swap benchmark (where pairs of processors swap packets of equal size) and
a bandwidth was defined to be twice the packet size divided by the time
to exchange the data. Figure 2 shows a typical curve. As one can see, high
bandwidth is achieved for packet sizes of around 215 (32768) words. Best
bandwidth rates achieved on this test are less than one fourth the peak
speed of the 100 Mbps hardware when four nodes are communicating simultaneously.
|
|
|
|
For the 3D benchmark case described here, the average packet size varied
between 213 and 217 words, which is right in the middle of the region of
good performance. Benchmarks for smaller problems such as the 2D case discussed
in Ref. [6], did not scale as well, as shown in Table IV, but still gave
good performance. We estimate that the codes are running over 50 MFlops/node,
which gives a cost effectiveness of about $45-55/MFlop for the cluster,
depending on the configuration.
|
To compile and run a Fortran source code, three additional files are needed, the library MacMPI.f, the include file mpif.h, and a file of participating nodes named nodelist. Creating an executable with the Absoft compiler is straightforward. If a user has a Fortran 77 program called test.f and a subroutine library called testlib.f, the following command will link with MacMPI.f and produce an executable optimized for the G3 architecture:
f77 -O -Q92 test.f testlib.f MacMPI.f
The include file mpif.h must also be present. One can also run the code with automatic double precision, as follows:
f77 -O -N113 -N2 -Q92 test.f testlib.f MacMPI.f
This option was used by our benchmark codes. It is possible to create a makefile both manually as well as via a graphical interface, although the makefiles differ from the standard Unix style makefiles.
To run a Fortran 90 program, one should compile the Fortran 90 program and MacMPI.f separately, as follows:
f77 -O -Q92 MacMPI.f
f90 -O -604 test.f
f90 -O -604 test.f.o MacMPI.f.o
To setup the Macintosh for parallel processing in MacOS 8.1, one must set the AppleTalk Control Panel to use the Fast Ethernet Adapter and verify in the chooser that AppleTalk is active. Next, the computer name must be set and Program Linking should be enabled in the File Sharing Control Panel. Finally, in the Users and Groups Control Panel, one must allow Guests to link.
MacMPI requires that the master node have a file called nodelist present in the same directory as the executable. This is a straight text file. The first line contains a port name. If the name ppc_link is used, then the slave nodes do not need to have a host file. (If some other port name is used, then the slave nodes need to have a nodelist file which contains only the port name.) The second line contains the name self. This name is required only if the cluster contains a single processor. Finally the remaining lines consist of a list of computer names and zones, one name per line, in the form:
computer_name@zone_name
If there is only one zone in the AppleTalk network, the zone names are
omitted. The names cannot have trailing blanks. A sample nodelist
file is shown in Table V.
|
ppc_link
|
If automatic starting capabilities implemented, one has to manually copy the executable to each node (via floppy disk or network), and start up (by double clicking) each executable. The master must be started last. The Launch Den Mother utility automates this procedure.
During execution, some errors detected by MacMPI are written to Fortran unit 2, which defaults to a file called FOR002.DAT. This file should be examined if problems occur. Some errors may be due to the fact that our implementation of MPI is only partial. There is one error log entry generated which is caused by a bug in AppleTalk. This error entry says that an Incomplete Read occurred, but the expected and actual data received are the same. The MacMPI library has a work around for this bug, so this error entry is for informational purposes only.
Future
Our current plans are to extend the cluster to eight machines. Four of them will be dedicated solely to running large production calculations. The other four will be used as general desktop machines during working hours and join the production pool at night. Since two separate networks are used, the two uses can coexist.
Improvement in network performance will improve the execution speed of smaller problems. Discussions with Apple Computer indicated that part of the reason why AppleTalk did not achieve maximum performance with 100 Mbps Ethernet had to do with certain "inefficiencies" in the AppleTalk implementation, which will be fixed in the next release of the MacOS (8.5). Another area of improvement we plan to investigate is to replace an Ethernet hub with a switch. This should remove degradation due to collisions and allow full duplex communication. Finally, in the long term, substantial performance improvement over Fast Ethernet appears possible using the FireWire technology, invented by Apple and used in digital cameras.
The new, inexpensive iMacs recently announced by Apple Computer will have built-in Fast Ethernet, and will only be about 12% slower than the machines we tested. A cluster of iMacs might be particularly attractive for a combination student lab/parallel computer.
Acknowledgments
We wish to acknowledge the useful advice given to us by Myron Krawczuk, Macintosh Consultant, New Jersey, Cliff McCollum, U. Victoria, Canada, Johan Berglund, KTH, Sweden, and Chris Thomas, UCLA. This work is supported by DOE and NSF.
References
[1] D. S. Katz, T. Cwik, B. H. Kwan, J. Z. Lou, P. L. Springer, T. L. Sterling, and P. Wang, "An Assessment of a Beowulf System for a Wide Class of Analysis and Design Software," to appear in Advances in Engineering Software, v. 26(6-9), August 1998. See also http://www-hpc.jpl.nasa.gov/PS/HYGLAC/beowulf.html
[2] Samuel A. Fineberg and Kevin T. Pedretti, "Analysis of 100 Mbps Ethernet for the Whitney Commodity Computing Testbed," NAS Technical Report NAS-97-025, October, 1997. See also http://parallel.nas.nasa.gov/Parallel/Projects/Whitney
[3] M. S. Warren, J. K. Salmon, D. J. Becker, M. P. Goda, T. Sterling, and G. S. Winckelmans. “Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, II. Price/performance of $50/Mflop on Loki and Hyglac”, Supercomputing '97, Los Alamitos, 1997. IEEE Comp. Soc. See also http://cnls.lanl.gov/avalon
[4] See http://www.absoft.com/
[5] V. K. Decyk, "Benchmark Timings with Particle Plasma Simulation Codes," Supercomputer 27, vol V-5, p. 33 (1988).
[6] V. K. Decyk, "Skeleton PIC Codes for Parallel Computers," Computer Physics Communications 87 , 87 (1995).
[7] R. D. Sydora, V. K. Decyk, and J. M. Dawson, "Fluctuation-induced heat transport results from a large global 3D toroidal particle simulation model", Plasma Phys. Control. Fusion 38 , A281 (1996).
[8] K.-C. Tzen, W. B. Mori, and T. Katsouleas, "Electron Beam Characteristics from Laser-Driven Wave Breaking," Phys. Rev. Lett. 79 , 5258 (1997).
[9] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, MPI: The Complete Reference [MIT Press, Cambridge, MA, 1996].
[10] Apple Computer, Inside Macintosh: Interapplication Communication [Addison-Wesley, Reading, MA, 1993], chapter 11.
[11] Lon Poole, MacWorld Mac OS 8 Bible [IDG Books Worldwide, Foster
City, CA, 1997], chapter 17.