Appleseed: A Parallel Macintosh Cluster for Numerically Intensive Computing

by

Viktor K. Decyk, Dean E. Dauger, and Pieter R. Kokelaar

Department of Physics and Astronomy
University of California, Los Angeles
Los Angeles, CA 90095-1547

email: decyk, dauger, and pekok%40physics.ucla.edu

Abstract

We have constructed a parallel cluster consisting of 8 Apple Macintosh computers running the MacOS, and have achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77, based on the Program-to-Program Communications Toolbox in the MacOS. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of over 50 MFlops/node was achieved. This gave a cost effectiveness for the cluster as low as $45/MFlop, depending on the memory requirements. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster.

Introduction

In recent years there has been a growing interest in clustering commodity computers to build inexpensive parallel computers. A number of projects [1-3] have demonstrated that for certain classes of problems, this is a viable approach for cheap, numerically intensive computing. The most common platform for building such a parallel cluster is based on the Pentium processor running the Linux version of Unix. Recently, the Apple Macintosh G3 computer has been introduced which uses the Motorola PowerPC 750 processor that is substantially faster for numerically intensive computing than Pentiums with comparable clock speeds. We decided to investigate whether a cluster based on the Macintosh G3 was practical.

This investigation was initially motivated by the impressive single node performance we achieved on our well-benchmarked suite of plasma particle-in-cell (PIC) simulation codes [4-5] on the Macintosh G3/266, as shown in Table I. This was due in part to the availability of an excellent optimizing Fortran compiler for the Macintosh produced by the Absoft Corporation [6]. Not only was the performance nearly twice as fast as the Pentium Pro/200, but it was comparable to the performance achieved on some of the Crays.

Table I.

2D Particle Benchmarks

-----------------------------------------
The following are times for a 2D particle simulation, using 327,680 particles and a 64x128 mesh for 325 time steps. Push Time is the time to update one particle's position and deposit its charge, for one time step. Loop Time is the total time for running the simulation minus the initialization time.
-----------------------------------------
Computer                 Push Time     Loop Time

Macintosh G3/300:        1750 nsec.    191.1 sec.
Macintosh G3/266:        1950 nsec.    213.2 sec.
Cray T3E-900, 1 proc:    1970 nsec.    212.7 sec.
Cray Y-MP, 1 proc:       1980 nsec.    215.0 sec.
IBM RS/6000, Model 590:  2130 nsec.    289.2 sec.
iMac/233:                2220 nsec.    243.2 sec.
Intel Pentium II/300:    2610 nsec.    282.1 sec.
Intel Pentium Pro/200:   3480 nsec.    376.9 sec.
Cray T3D, 1 proc:        7060 nsec.    751.8 sec.
 

A further motivation to build the cluster came when we realized that the MacOS had a native message-passing applications programming interface (API), called the Program-to-Program Communications (PPC) Toolbox [7]. It has been there since MacOS 7.0, and is used by AppleShare and Apple Events. We already had many programs written using the Message-Passing Interface (MPI) [8], a common message-passing API used on other high-performance parallel computers. The similarity of the native PPC Toolbox message-passing facility to the low-level features of MPI further encouraged us to build the Macintosh cluster.

Our PIC codes are used in a number of High-Performance Computing Projects, such as modeling fusion reactors [9] and advanced accelerators [10]. For these projects massively parallel computers are required, such as the 512 node Cray T3E at NERSC. However, it would be very convenient if code development and student projects could be performed on more modest parallel machines such as the low cost clusters. It is also preferable that the resources of the large computers are devoted only to the large problems which require them.

Software Implementation

Although a complete implementation of MPI has many high-level features (such as user defined dataypes and division of processes) not available in the PPC Toolbox, these features were generally not needed by our PIC codes. It was therefore straightforward to write a partial implementation of MPI (34 subroutines) based on the PPC Toolbox, which we call MacMPI. PPC Toolbox currently uses the AppleTalk protocol, which can run on Ethernet hardware, but does not require an IP address. The entire MacMPI library was written in Fortran77, making use of Fortran extensions for creating and dereferencing pointers available in the Absoft compiler. A C language interface to MacMPI was also implemented.

The only complicated subroutine was the initialization procedure, MPI_INIT. To initialize the cluster, a nodelist file is read which contains a list of n computer names and zones which are participating in the parallel computation. The node which has this file is designated the master node (node 0). The master initiates a peer connection with each of the other participating nodes (1 through n-1), and then passes to node 1 the list of remaining nodes (2 through n-1). Node 1 then establishes a peer connection with them, and passes on the list of remaining nodes (3 through n-1) to node 2, and so on. The last node receives a null list and does not pass it on further. Each node also establishes a connection to itself, as required by MPI. The executable file can be copied to each node and started manually. We have also written a utility called Launch Den Mother to automatically create the nodelist file and copy and launch executables on the nodes.

Once the MacMPI library was implemented, we were able to port the parallel PIC codes from the Cray T3E and IBM SP2 to the Apple Macintosh cluster without modification. This library and related files and utilities are available at our web site: http://exodus.physics.ucla.edu/appleseed/appleseed.html. For a simple introduction to MPI programming, we recommendUsingMPI [11].

Hardware Implementation

The baseline Macintosh G3 running at 266 MHz currently (August, 1998) costs $1439 at UCLA. This machine is a desktop model with 32 MB RAM, a 4 GB Hard Drive, CD-ROM and Zip-drive. Because the G3 cluster will be used for multiple purposes, including visualization, we decided to purchase the tower model for $1790, which is more expandable, has a larger disk drive, and video output. We upgraded each Macintosh by adding a 256 MB memory card at a cost of $879, so that the total memory of each Macintosh was 288 MB.

The Macintosh G3 comes with a built-in 10BaseT Ethernet adapter. Although it is possible to run MacMPI with this adapter, 100BaseT PCI Fast Ethernet adapters are much faster and have become very inexpensive. We purchased adapters from a number of different vendors and ran tests representative of the kinds of communications used in our codes. The results (Figure 1) showed that Asanté gave the best performance, with Farallon a close second. Each Asanté adapter cost $95, so that the total cost of each Macintosh in our tower configuration was $2764.

Performance of Fast Ethernet Adapters

Figure 1. Performance of Fast Ethernet Adapters, with 2 Nodes and cross-over cable
 

If only two Macs are being clustered, the only additional equipment needed is a Category 5 cross-over cable. We made our own cables, which otherwise would have cost $8 apiece. A hub or switch is required to cluster more than 2 Macintoshes. To cluster 4 Macs, we purchased an Asanté 5 port 100BaseT Ethernet Hub (model FH100TX5) for $185, as well as a power strip for $15. To cluster 8 Macs, we used an Asanté 8 port 100BaseT Ethernet Hub (model FH100TX8) for $319, and two power strips. For a simple introduction to Macintosh networking, we recommend the Mac OS8Bible [12].

Costs for various configurations are summarized in Table II, and the most expensive version came to $22,461 for a cluster of 8, containing over 2 GB of memory, and nearly 50 GB of disk space. This cost does not include the monitor. We purchased a 20" Trinitron Apple Display whose current cost is $1349. In addition, we purchased a KVM switch with cables made by Black Box Corp. for $502 to enable us to share a single monitor among 4 computers. Sharing the monitor was a convenience while debugging the MacMPI library and manually starting the applications. We anticipate that a monitor switch will not be necessary during production runs. Figure 2 shows a configuration with 8 computers.

Table II.

Cost of Macintosh G3/266 cluster for various configurations
(Prices as of August, 1998)

-----------------------------------------
Desktop cluster
2 Macs, 320 MB RAM, 8 GB disk =    $4,241
4 Macs, 640 MB RAM, 16 GB disk =   $8,652
8 Macs, 1.28 GB RAM, 32 GB disk = $17,253
-----------------------------------------
Tower cluster
2 Macs, 576 MB RAM, 12 GB disk =   $5,543
4 Macs, 1.15 GB RAM, 24 GB disk = $11,256
8 Macs, 2.30 GB RAM, 48 GB disk = $22,461
-----------------------------------------
 
 

Appleseed cluster of 8 Macintosh G3 computers

Figure 2: Appleseed cluster of 8 Apple Macintosh G3 computers with iMac.
 

The cluster has two networks running simultaneously. MacMPI uses AppleTalk with Fast Ethernet (100BaseT). This network has no other nodes on it, in order to maximize performance and enhance security. In addition, the Macintoshes can be connected to the Internet using the built-in Ethernet (10BaseT) running TCP/IP. This gives the cluster access to the outside world and enables importing and exporting files using an ftp program. It is also possible to connect a LocalTalk LaserWriter to the AppleTalk network via a LaserWriter Bridge. Note that we could have built this cluster from any PowerPC Macintoshes. We chose the G3 only because of its superior performance.

Performance

The performance of this cluster was excellent for certain classes of problems, mainly those where communication was small compared to calculation and the message packet size was large. Results for the large 3D benchmark described in Ref. [5] are summarized in Table III. One can see that the Mac cluster performance was comparable to that achieved by the Cray T3E-900 and the IBM SP2/266 in this case. Indeed, the recent advances in computational performance is astonishing. A cluster of 4 Macintoshes now has the same computational power (and twice the memory) as a 4 processor Cray Y-MP, one of the best supercomputers of 8 years ago, for less than one thousandth of the cost!

Table III.

3D Particle Benchmarks

-----------------------------------------
The following are times for a 3D particle simulation, using 7,962,624 particles and a 64x32x128 mesh for 425 time steps. Push Time is the time to update one particle's position and deposit its charge, for one time step. Loop Time is the total time for running the simulation minus the initialization time.
-----------------------------------------
Computer                      Push Time     Loop Time

Mac G3/266 cluster, 8 proc:   1496 nsec.     5891.2 sec.
Mac G3/266 cluster, 4 proc:   3231 nsec.    11929.6 sec.
Mac G3/266 cluster, 2 proc:   7182 nsec.    25738.5 sec.
-----------------------------------------
Cray T3E-900, w/MPI, 8 proc:  1800 nsec.     6196.3 sec.
Cray T3E-900, w/MPI, 4 proc:  3844 nsec.    13233.7 sec.
-----------------------------------------
IBM SP2, w/MPL, 8 proc:       2104 nsec.     7331.1 sec.
-----------------------------------------
 

To determine what packet sizes gave good performance, we developed a swap benchmark (where pairs of processors swap packets of equal size) and a bandwidth was defined to be twice the packet size divided by the time to exchange the data. Figure 3 shows a typical curve. As one can see, high bandwidth is achieved for packet sizes of around 215 (32768) words. Best bandwidth rates achieved on this test are less than 20% of the peak speed of the 100 Mbps hardware. With 4 nodes, we have done additional tests comparing the 5 port Asanté Ethernet Hub with a 4 port Asanté Fast Ethernet Switch (model FS4004DS). Surprisingly, the performance with the Hub and Switch was essentially the same.

Communication bandwith data

Figure 3: Bandwidth (MBytes/sec) for 2-8 processors swapping data simultaneously as a function of packet size. With 2 nodes, only a cross-over cable is used. With 4-8 nodes, a Fast Ethernet Hub is used. Peak performance with 8 nodes is about 60% of that with 2 nodes.
 

For the 3D benchmark case described in Table III, the average packet size varied between 213 and 217 words, which is right in the middle of the region of good performance.

Benchmarks for smaller problems such as the 2D case discussed in Ref. [5], did not scale as well, as shown in Table IV, but still gave good performance.

Table IV.

2D Particle Benchmarks

-----------------------------------------
The following are times for a 2D particle simulation, using 3,571,712 particles and a 128x256 mesh for 325 time steps. Parallel codes use Domain Decomposition. Push Time is the time to update one particle's position and deposit its charge, for one time step. Loop Time is the total time for running the simulation minus the initialization time.
-----------------------------------------
Computer                      Push Time     Loop Time

Cray T3E-900, w/MPI, 8 proc:   185 nsec.     218.6 sec.
Cray T3E-900, w/MPI, 4 proc:   481 nsec.     564.7 sec.
Cray T3E-900, w/MPI, 2 proc:  1193 nsec.    1406.0 sec.
-----------------------------------------
Mac G3/266 cluster, 8 proc:   325 nsec.     502.7 sec.
Mac G3/266 cluster, 4 proc:   595 nsec.     795.6 sec.
Mac G3/266 cluster, 2 proc:  1156 nsec.    1448.8 sec.
Mac G3/266 cluster, 1 proc:  2323 nsec.    2744.1 sec.
-----------------------------------------
IBM SP2, w/MPL, 8 proc:       356 nsec.     423.8 sec.
IBM SP2, w/MPL, 4 proc:       807 nsec.     942.0 sec.
-----------------------------------------
 
 

We estimate that the codes are running over 50 MFlops/node, which gives a cost effectiveness of about $45-55/MFlop for the cluster, depending on the memory configuration (prices as of August, 1998).

Using MacMPI

To compile and run a Fortran source code, two additional files are needed, the library MacMPI.f and the include file mpif.h. Creating an executable with the Absoft compiler is straightforward. If a user has a Fortran 77 program called test.f and a subroutine library called testlib.f, the following command will link with MacMPI.f and produce an executable optimized for the G3 architecture:

f77 -O -Q92 test.f testlib.f MacMPI.f

The include file mpif.h must also be present. One can also run the code with automatic double precision, as follows:

f77 -O -N113 -N2 -Q92 test.f testlib.f MacMPI.f

This option was used by our benchmark codes. It is possible to create a makefile both manually as well as via a graphical interface, although the makefiles differ from the standard Unix style makefiles.

To run a Fortran 90 program, one should compile the Fortran 90 program and MacMPI.f separately, as follows:

f77 -O -Q92 -c MacMPI.f
f90 -O -604 -c test.f
f90 -O -604 test.f.o MacMPI.f.o

We have written a C language interface to MacMPI.f called MacMPI.c, and an associated include file mpi.h. To compile and run a C program with the Absoft compiler, the C and Fortran files should be compiled separately, as follows:

f77 -O -Q92 -c MacMPI.f
acc -O -A -N92 -c MacMPI.c
acc -O -A -N92 test.c MacMPI.c.o MacMPI.f.o

To setup the Macintosh for parallel processing in MacOS 8.1, one must set the AppleTalk Control Panel to use the Fast Ethernet Adapter and verify in the chooser that AppleTalk is active. Next, the computer name must be set and Program Linking should be enabled in the File Sharing Control Panel. Finally, in the Users and Groups Control Panel, one must allow Guests to link. In addition, it is strongly recommended that in the Energy Saver Control Panel the sleep time is set to Never (although it is OK to let the monitor go to sleep). A running Fortran program may appear to be inactive to the MacOS, and if the Mac is put to sleep, it will suddenly start to run very slowly.

Launch Den Mother

A parallel application can be started either manually or automatically. A utility called Launch Den Mother (and associated Launch Puppies) has been written to automate the procedure of selecting remote computers, copying the executable and associated input files, and starting the parallel application on each computer.

Before running Launch Den Mother, each participating Macintosh must have the Launch Puppy utility located in a folder called AppleSeed , that must reside in the top directory of the startup disk. In addition, the AutoGuest INIT Extension available from Apple Computer [13] must be installed in the Extensions folder in the System folder on each participating Macintosh. AutoGuest permits the Finder of each computer to start the Launch Puppy if guest link access has been selected in the Users and Groups Control Panel, without asking for further verification from the owner of each machine. Launch Den Mother requires MacOS 8.0 or later to run.

The Launch Den Mother utility needs to reside only on the computers which will be initiating a parallel application, although we normally install it on all the computers. After starting Launch Den Mother, a single dialog box appears, as shown in Figure 4.

Figure 4. Launch Den Mother utility dialog box.

First one selects the application to run. In the upper left hand corner of the dialog box appears a list of files, from which one selects the files which will be copied to the other Macs and executed. This list of files selected is displayed in the lower left hand corner of the dialog box. In Figure 4, one can see that from the Erica folder, we have chosen two files, an executable file called lattice.out, and an input file called input.lattice, which is needed by lattice.out. Only one executable can be selected.

Then one selects the computers to run on. In the upper right hand corner of the dialog box appears a list of available Macintoshes whose owners have permitted Program Linking in the File Sharing Control Panel. One selects from this list the computers one wishes to run on. In Figure 4, three computers from the Local Zone have been selected for execution, Dawson, uclapic5, and uclapic6. Five other computers were available, but 4 of them were already busy running a parallel job (uclapic1-4). It is not required that the user's computer be one of those selected for running the parallel application.

Once the application and computers have been selected, one clicks on Transfer Files. The files are then copied to the AppleSeed folder on each computer and each application is started up. MacMPI controls any further communication between nodes. That's all there is to it.

The Launch Den Mother works by sending an Apple Event to the Finder on each remote Macintosh, requesting that the Launch Puppy be started up. The Launch Puppy must be in the AppleSeed folder so that the Finder can find it. This remote Apple Event requires MacOS 8.0 to work properly. Once all the remote Launch Puppies are started up, the Den Mother sends the requested files to each Puppy, which then copies them to the AppleSeed folder on the local Macintosh. When the files have been successfully copied, the Puppy starts the parallel application, sends a message to the Den Mother, and the Den Mother tells the Puppy to quit. After all the Puppies have been told to quit, the Den Mother herself quits, and MacMPI takes over.

If the user selected his or her own machine to participate, that machine becomes the master node (node 0 in MPI). Otherwise, the first node on the list becomes a remote master, and the user's Launch Den Mother starts up and passes control to a remote Launch Den Mother. Further details about the Launch Den Mother utility can be found in the README documentation available with the distribution on our web site.

During execution, some errors detected by MacMPI are written to Fortran unit 2, which defaults to a file called FOR002.DAT. This file should be examined if problems occur. Some errors may be due to the fact that our implementation of MPI is only partial. There is one error log entry generated which is caused by a bug in AppleTalk. This error entry says that an Incomplete Read occurred, but the expected and actual data received are the same. The MacMPI library has a work around for this bug, so this error entry is for informational purposes only. Most MPI errors are fatal and will cause the program to halt.

After execution, there are usually output files created by the application. In most of our applications, only the master node produces any output. All the other nodes which have output data send it to the master node using MPI. Since the master node is usually either on the user's desk or in a common area available to everyone, there is no difficulty accessing these files. However, it is possible that there may be output files in the AppleSeed folder on a remote computer. These can be accessed using AppleShare (via the Chooser), if the owner of the remote computer allows it. To allow access, the owner first needs to turn File Sharing on in the File Sharing Control Panel and allow guests to connect to the computer in the Users & Group Control Panel. To allow read only access to the AppleSeed folder while disallowing all other accesses to all other files, one first selects the Appleseed folder and chooses Sharing in the File Menu, then selects Share this item and allows Read Only access to Everyone.

Manual execution

Parallel applications can also be started manually. This is necessary if the Macintoshes are running a system earlier than MacOS 8.0 on older PowerPCs.

MacMPI requires that the master node have a file called nodelist present in the same directory as the executable. If the parallel job is started manually, this file must be created by the user. (The Launch Den Mother utility creates this file automatically.) This is a straight text file. The first line contains a port name. If the name ppc_link is used, then the slave nodes do not need to have a host file. (If some other port name is used, then the slave nodes need to have a nodelist file which contains only the port name.) The second line contains the name self. This name is required only if the cluster contains a single processor. Finally the remaining lines consist of a list of computer names and zones, one name per line, in the form:

computer_name@zone_name

If there is only one zone in the AppleTalk network, the zone names are omitted. The names cannot have trailing blanks. A sample nodelist file is shown in Table V.

Table V.

Sample nodelist file

ppc_link
self
uclapic1
BarbH2@Physics-Plasma Theory
fried@Physics-Plasma Theory
 

To start the parallel job manually, one has to copy the executable to each node (via floppy disk, AppleShare or ftp), and start up (by double clicking) each executable. The master must be started last.

Evaluation

The inexpensive, powerful cluster of Macintosh G3s has become a valuable addition to our research group. It is especially useful for running large calculations for extended periods. We have routinely run simulations on 4 nodes for 100 hours at a time, which use 1 GByte of memory. This is especially useful for unfunded research, for student or exploratory projects, or when meeting short deadlines. The turn around time for such jobs is often shorter than on supercomputer centers with more powerful computers, because we do not have to share this resource with the entire country. (Although some problems can only run on the supercomputer centers because they are too large for the Macintosh cluster.)

The presence of the cluster has encouraged students and visitors to learn how to write portable, parallel MPI programs, which they can run later on larger computers elsewhere. In fact, since Fast Ethernet is slow compared to the networks used by large parallel computers, our students are encouraged to develop better, more efficient algorithms that use less communication. Later, when they move the code to a larger parallel computer, the code scales very well with large numbers of processors.

Our current configuration has 4 machines located in one room, with a single monitor and monitor switch. These 4 Macs are a common resource and are available all the time for computing. The other 4 are located in various offices and are available when the owner of the machines decides to allow it, typically nights and weekends. The 4 nodes which are located together are ideal for debugging code and for long, extended calculations. The other 4 machines are useful for shorter, overnight calculations. It is also possible to use all 8 nodes in a single calculation, although the longest time they are all available is the weekend (about 60 hours).

Because the cluster is used only by a small research group, we do not need sophisticated job management or scheduling tools (which may not even exist). Everything is done in the spirit of cooperation, and so far that has worked. (We don't like to have uncooperative people in our group, anyway!)

Why are we using the MacOS? Why not run Linux (a free Unix) on the Macs, for example? One reason is that we have always been Macintosh users and are very productive in the MacOS environment. There are good third party mathematical or numerical software packages, such as Mathematica, which run better on the Macintosh G3 than on our Unix workstations. Another reason is that many of the Macs are used for purposes other than numerical calculations and rely on software written for MacOS. Furthermore, we find that the Mac environment makes it very easy to couple the output of our numerical codes to other software written in the MacOS, such as Fortner's graphics packages or Mathematica or QuickTime, or to programs we use for presentation, such as ClarisWorks or Microsoft Word. Finally, the MacOS has encouraged us to write software to a higher standard, that has more of a Mac "look and feel" (such as the Launch Den Mother).

Linux, in comparison, is far more difficult for the novice to use than the Mac. Installing a parallel cluster of machines using Linux is not trivial and virtually everyone we know who has done this has had problems, many lasting months. It requires substantial Unix expertise to correctly install, maintain, and run a Unix cluster. With the Mac cluster, the only required non-standard item is a single library contained in a single file, MacMPI. Everything else is right out of the box, just plug it in and connect it. Although students who become Unix experts often can go on and make lots of money, they often stop doing physics. And we are a physics department, after all, where physics research is our primary focus.

What are the problem areas? One area is network communications. The AppleTalk network is currently giving performance no better than 20% of the peak of the Fast Ethernet hardware. The current performance is adequate for many of the large problems we do, but it limits the range and types of problems that can be run on the cluster. Discussions with Apple Computer have indicated that the performance may improve significantly with the next release of the MacOS (8.5). If the performance does not improve, we will probably rewrite MacMPI to use Open Transport rather than the PPC Toolbox. We should also gain some modest performance improvement by replacing the Ethernet Hub with a Switch. Finally, other technologies for fast communications also look promising and are becoming inexpensive, such as FireWire, invented by Apple and used in digital cameras, and we will look into these as they mature.

Another problem area is the generation of remote Fortran run time errors by running applications, especially if they happen on computers which are located behind closed doors. Some run time errors gracefully abort, but they require user interaction to terminate, which may not be possible on a remote machine. Others are less graceful and hang the machine. We do not have any good solutions for these problems, except to encourage students to debug their codes on the 4 nodes which are located together and only run on remote machines when the code has been tested. Nevertheless, unexpected errors still happen, especially with student written software.

The future continues to look bright. The new, inexpensive iMacs recently available have built-in Fast Ethernet, and are only about 12% slower than the machines we tested while costing $335 less per machine. Faster G3s are now available and even more powerful G4s are expected, some with Motorola AltiVec vector processors. Computational power continues to improve rapidly, and the Macintosh has finally become an interesting and attractive platform for numerically intensive computing.

Acknowledgements

We wish to acknowledge the useful advice given to us by Myron Krawczuk Macintosh Consultant, New Jersey, Cliff McCollum, U. Victoria, Canada, Johan Berglund, KTH, Sweden, and Chris Thomas, Pete Nielsen, and Paul Hoffman, UCLA. This work is supported by NSF contracts DMS-9722121 and PHY 93-19198 and DOE contracts DE-FG03-98DP00211, DE-FG03-97ER25344, DE-FG03-86ER53225, and DE-FG03-92ER40727.

References

[1] D. S. Katz, T. Cwik, B. H. Kwan, J. Z. Lou, P. L. Springer, T. L. Sterling, and P. Wang, "An Assessment of a Beowulf System for a Wide Class of Analysis and Design Software," to appear in Advances in Engineering Software, v. 26(6-9), August 1998. See also http://www-hpc.jpl.nasa.gov/PS/HYGLAC/beowulf.html

[2] Samuel A. Fineberg and Kevin T. Pedretti, "Analysis of 100 Mbps Ethernet for the Whitney Commodity Computing Testbed," NAS Technical Report NAS-97-025, October, 1997. See also http://parallel.nas.nasa.gov/Parallel/Projects/Whitney

[3] M. S. Warren, J. K. Salmon, D. J. Becker, M. P. Goda, T. Sterling, and G. S. Winckelmans. "Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, II. Price/performance of $50/Mflop on Loki and Hyglac", Supercomputing '97, Los Alamitos, 1997. IEEE Comp. Soc. See also http://cnls.lanl.gov/avalon

[4] V. K. Decyk, "Benchmark Timings with Particle Plasma Simulation Codes," Supercomputer 27, vol V-5, p. 33 (1988).

[5] V. K. Decyk, "Skeleton PIC Codes for Parallel Computers," Computer Physics Communications 87 , 87 (1995).

[6] See http://www.absoft.com/

[7] Apple Computer, Inside Macintosh: Interapplication Communication [Addison-Wesley, Reading, MA, 1993], chapter 11.

[8] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, MPI: The Complete Reference [MIT Press, Cambridge, MA, 1996].

[9] R. D. Sydora, V. K. Decyk, and J. M. Dawson, "Fluctuation-induced heat transport results from a large global 3D toroidal particle simulation model", Plasma Phys. Control. Fusion 38 , A281 (1996).

[10] K.-C. Tzen, W. B. Mori, and T. Katsouleas, "Electron Beam Characteristics from Laser-Driven Wave Breaking," Phys. Rev. Lett. 79 , 5258 (1997).

[11] William Gropp, Ewing Lush, and Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface [MIT Press, Cambridge, MA, 1994].

[12] Lon Poole, MacWorld Mac OS 8 Bible [IDG Books Worldwide, Foster City, CA, 1997], chapter 17.

[13] See http://devworld.apple.com/technotes/ic/AutoGuest.sea.hqx