When the monitor variable is set to 2, a complete log file of all the MPI messages is recorded in a file on each node. In Fortran, the name of the file is whatever is connected to unit 2 (and defaults to the name FOR002.DAT, if no open for unit 2 has been called.) In C, the file is called MPIErrs. To help decipher the meaning of these files, we start with a successful run of the knock program with two nodes, using MacMPI_IP.f. Then we will give some examples of failures. CASE I: Successful run of knock.f On node 0: Log Data Meaning of entry: MPI_INIT started <= MPI_INIT successfully started local host=169.232.159.155,port= 32385 <= IP address of this node requesting connection with=169.232.159.155 <= node 0 is starting connection to itself tentative connection started with=169.232.159.155 connection confirmed with idproc = 0 <= node 0 (itself) is successfully connected requesting connection with=169.232.159.156 <= node 0 is starting connection to node 1 tentative connection started with=169.232.159.156 send posted: destination= 1 size= 8 tag= 3 <= confirmation requested sent to node 1 sent: destination= 1 size= 8 time= 107 tag= 3 receive posted: source= 1 size= 4 tag= 4 received: source= 1 size= 4 time= 5939 tag= 4 <= confirmation confirmed from node 1 connection confirmed with idproc = 1 <= node 1 is successfully connected all nodes activated: idproc, nproc= 0 2 <= all nodes in nodelist connected send posted: destination= 1 size= 0 tag= 1 <= sending to node 1 the address of next node (0 size means there are no other nodes) sent: destination= 1 size= 0 time= 46 tag= 1 receive posted: source= 1 size= 4 tag= 2 received: source= 1 size= 4 time= 1617 tag= 2 <= received verification of number of nodes MPI_INIT complete <= MPI_INIT successfully completed send posted: destination= 1 size= 12 tag= 1 <= Knock message send posted to node 1, 12 bytes, tag 1 sent: destination= 1 size= 12 time= 116 tag= 1 <= Knock message successfully sent to OpenTransport, time since posting, 116 microsec. receive posted: source= 1 size= 12 tag= 2 <= Knock message receive posted from node 1, 12 bytes, tag 2 received: source= 1 size= 12 time= 1321 tag= 2 <= Knock message received from Open Transport, time since posting, 1321 microsec. MPI_FINALIZE started <= MPI_FINALIZE successfully started Message Size Distribution Function <= statistics about the number of messages sent and received, Size(bytes) = 1 Number = 0 and their sizes Size(bytes) = 2 Number = 0 Size(bytes) = 4 Number = 2 Size(bytes) = 8 Number = 1 Size(bytes) = 16 Number = 2 Size(bytes) = 32 Number = 0 Size(bytes) = 64 Number = 0 Size(bytes) = 128 Number = 0 Size(bytes) = 256 Number = 0 Size(bytes) = 512 Number = 0 Size(bytes) = 1024 Number = 0 Size(bytes) = 2048 Number = 0 Size(bytes) = 4096 Number = 0 Size(bytes) = 8192 Number = 0 Size(bytes) = 16384 Number = 0 Size(bytes) = 32768 Number = 0 Size(bytes) = 65536 Number = 0 Size(bytes) = 131072 Number = 0 Size(bytes) = 262144 Number = 0 Size(bytes) = 524288 Number = 0 Size(bytes) = 1048576 Number = 0 Size(bytes) = 2097152 Number = 0 Size(bytes) = 4194304 Number = 0 Size(bytes) = 8388608 Number = 0 Sending Time = 0 %, Speed = 8.968610E-02 MB/s <= average send speed (time between posting and completion) Receiving Time = 0 %, Speed = 2.253013E-03 MB/s <= average receive speed (time between posting and completion) MPI_FINALIZE complete <= MPI_FINALIZE successfully completed On node 1: Log Data Meaning of entry: MPI_INIT started <= MPI_INIT successfully started local host=169.232.159.156,port= 32385 <= IP address of this node receive posted: source= 0 size= 8 tag= 3 <= confirmation request received from node 0 received: source= 0 size= 8 time= 78 tag= 3 send posted: destination= 0 size= 4 tag= 4 sent: destination= 0 size= 4 time= 70 tag= 4 <= confirmation sent to node 0 connection accepted with idproc = 0 <= node 0 is successfully connected requesting connection with=169.232.159.156 <= node 1 is starting connection to itself tentative connection started with=169.232.159.156 connection confirmed with idproc = 1 <= node 1 is successfully connected receive posted: source= 0 size= 16 tag= 1 <= receiving from node 0 the address of next node (0 size means there are no other nodes) received: source= 0 size= 0 time= 15 tag= 1 all nodes activated: idproc, nproc= 1 2 <= no more nodes to connect send posted: destination= 0 size= 4 tag= 2 <= sending verification of number of nodes sent: destination= 0 size= 4 time= 101 tag= 2 MPI_INIT complete <= MPI_INIT successfully completed receive posted: source= 0 size= 12 tag= 1 <= Knock message receive posted from node 0, 12 bytes, tag 1 received: source= 0 size= 12 time= 2067 tag= 1 <= Knock message received from Open Transport, time since posting, 2067 microsec. send posted: destination= 0 size= 12 tag= 2 <= Knock message send posted to node 0, 12 bytes, tag 2 sent: destination= 0 size= 12 time= 77 tag= 2 <= Knock message successfully sent to OpenTransport, time since posting, 77 microsec. MPI_FINALIZE started <= MPI_FINALIZE successfully started Message Size Distribution Function .... MPI_FINALIZE complete <= MPI_FINALIZE successfully completed Analysis: Everything was successfully completed. CASE II: TCP/IP network not functioning properly On node 0: Log Data Meaning of entry: MPI_INIT started local host=169.232.159.155,port= 32389 <= IP address of this node requesting connection with=169.232.159.155 tentative connection started with=169.232.159.155 connection confirmed with idproc = 0 <= node 0 (itself) is successfully connected requesting connection with=169.232.159.156 <= node 0 is starting connection to node 1 OTConnect Wait Time Exceeded <= More than 60 seconds expired trying to connect Trying to start location = 169.232.159.156 <= node 1 (169.232.159.156) did not respond MPI_FINALIZE started OrderlyDisconnect Timeout: i, state = 2 3 <= unable to disconnect endpoint 2 (node 1) within 5 seconds (current state = 3, T_OUTCON, connection attempt started) Message Size Distribution Function ... MPI_FINALIZE complete On node 1: Log Data Meaning of entry: MPI_INIT started local host=169.232.159.156,port= 32389 <= IP address of this node OTListen Wait Time Exceeded <= More than 60 seconds expired waiting for connection MPI_FINALIZE started MPI_FINALIZE complete Analysis: Neither node was able to see each other. Either TCP/IP network is not working properly, or TCP IP address in nodelist file is incorrect (when manually created). If the AppleTalk and TCP/IP networks are using different adapters, it is possible for the Launch Den Mother to see the nodes with AppleTalk, but not see them with TCP/IP. (In fact, I had disconnected the TCP/IP Ethernet adapter just after launch.) CASE III: successful MPI collective calls in pingpong.f On node 0: Log Data Meaning of entry: MPI_BARRIER started <= MPI_BARRIER initiated receive posted: source= 1 size= 4 tag= 0 <= receive posted from node 1 by MPI_BARRIER received: source= 1 size= 4 time= 48 tag= 0 <= data received from node 1 by MPI_BARRIER, 48 microsec. MPI_BCAST started <= MPI_BCAST called by MPI_BARRIER, Everyone checked in send posted: destination= 1 size= 4 tag= 0 <= send posted to node 1 by MPI_BCAST, 4 bytes sent: destination= 1 size= 4 time= 78 tag= 0 <= data sent to node 1 by MPI_BCAST, 4 bytes, 78 microsec. MPI_BCAST complete <= MPI_BCAST sucessfully completed MPI_BARRIER complete <= MPI_BARRIER successfuly completed ... MPI_ALLREDUCE started <= MPI_ALLREDUCE started MPI_REDUCE started <= MPI_REDUCE called by MPI_ALLREDUCE receive posted: source= 1 size= 4 tag= 1 <= receive posted from node 1 by MPI_REDUCE, 4 bytes received: source= 1 size= 4 time= 186 tag= 1 <= data received from node 1 by MPI_REDUCE, 4 bytes, 186 microsec. MPI_REDUCE complete <= MPI_REDUCE successfully completed MPI_BCAST started <= MPI_BCAST called by MPI_ALLREDUCE send posted: destination= 1 size= 4 tag= 0 <= send posted to node 1 by MPI_BCAST, 4 bytes sent: destination= 1 size= 4 time= 51 tag= 0 <= data sent to node 1 by MPI_BCAST, 4 bytes, 51 microsec. MPI_BCAST complete <= MPI_BCAST successfully completed MPI_ALLREDUCE complete <= MPI_ALLREDUCE succesfully completed ... On node 1: Log Data Meaning of entry: MPI_BARRIER started <= MPI_BARRIER initiated send posted: destination= 0 size= 4 tag= 0 <= send posted to node 0 by MPI_BARRIER, 4 bytes sent: destination= 0 size= 4 time= 91 tag= 0 <= data sent to node 0 by MPI_BARRIER, 4 bytes, 91 microsec. MPI_BCAST started <= MPI_BCAST called by MPI_BARRIER, node 1 checked in receive posted: source= 0 size= 4 tag= 0 <= receive posted from node 0 by MPI_BCAST received: source= 0 size= 4 time= 4447 tag= 0 <= data received from node 0 by MPI_BARRIER, 4447 microsec. MPI_BCAST complete <= MPI_BCAST sucessfully completed MPI_BARRIER complete <= MPI_BARRIER successfuly completed ... MPI_ALLREDUCE started <= MPI_ALLREDUCE started MPI_REDUCE started <= MPI_REDUCE called by MPI_ALLREDUCE send posted: destination= 0 size= 4 tag= 1 <= send posted to node 0 by MPI_REDUCE, 4 bytes sent: destination= 0 size= 4 time= 46 tag= 1 <= data sent to node 0 by MPI_REDUCE, 4 bytes, 46 microsec. MPI_REDUCE complete <= MPI_REDUCE successfully completed MPI_BCAST started <= MPI_BCAST called by MPI_ALLREDUCE receive posted: source= 0 size= 4 tag= 0 <= receive posted from node 0 by MPI_BCAST, 4 bytes received: source= 0 size= 4 time= 886 tag= 0 <= data received from node 0 by MPI_REDUCE, 4 bytes, 886 microsec. MPI_BCAST complete <= MPI_BCAST successfully completed MPI_ALLREDUCE complete <= MPI_ALLREDUCE succesfully completed ... Analysis: Successful collective calls in pingpong.f CASE IV: MPI collective calls failed due to network error in pingpong.f On node 0: Log Data Meaning of entry: send posted: destination= 1 size= 4194304 tag= 21 <= send posted to node 1, 4194304 bytes, tag 21 Send Timeout, 10.0660 sec, Retrying... <= data fragment not sent within 10 sec., destination= 1 size= 4194304 tag= 21 to node 1, sending 4194304 bytes, tag 21 Flow Control prevents sending data <= Result of retry: cannot send data ... Send Timeout, 110.075 sec, Retrying... <= data not sent within 110 sec., destination= 1 size= 4194304 tag= 21 to node 1, sendng 4194304 bytes, tag 21 Flow Control prevents sending data <= Result of retry: cannot send data OTSnd Error, nerr, dest, tag= -3155 1 21 <= Error -3155 detected in message to node 1, with tag 21 Endpoint not in an appropriate state <= MEANING OF ERROR -3155, kOTOutStateErr: NODE 1 IS DEAD, Summary of failed message state follows: destination/source = 1 <= Failed message was being sent to node 1 communicator = 0 <= Message had Communitor 0 = MPI_COMM_WORLD tag = 21 <= Message had tag 21 datatype = 18 <= Message had datatype 18 = MPI_INTEGER (see mpif.h) request handle = 1 <= index to which message record, 1 means first record Endpoint reference pointer = 492288676 <= pointer to Open Transport endpoint iocompletion flag = -3155 <= Error return code from Open Transport current buffer pointer = 493445204 <= pointer to where next fragment of data will sent from current buffer length = 3375104 <= size of data remaining to be sent T_MORE flag = 0 <= used only by AppleTalk, ignore for TCP/IP data pointer = 492626004 <= pointer to start of data being sent data length = 4194304 <= total length of data to be sent actual length sent/received = 819200 <= amount of data sent so far, 819200 bytes out of 4194304 was successfully sent before failure first time stamp = 1 -223974229 <= time stamp when send was first posted second time stamp = 1 -112248008 <= current time stamp next message pointer = 0 <= which message record to be processed next (0 means no more messages in send queue) non-fatal error code = -3161 <= last non-fatal error, -3161, kOTFlowErr, Flow Control MPI_TEST: Error code = -3155 <= MPI_TEST returned with error -3155, kOTOutStateErr MPI_FINALIZE started ... MPI_FINALIZE complete MPI_ABORT complete <= MPI_ABORT called by MPI_TEST On node 1: Log Data Meaning of entry: receive posted: source= 0 size= 4194304 tag= 21 <= receive posted from node 0, 4194304 bytes, tag 21 Receive Timeout, 10.1160 sec, Retrying... <= data fragment not received within 10 sec., source= 0 size= 4194304 tag= 21 to node 0, receiving 4194304 bytes, tag 21 No data is currently available <= Result of retry: no data has arrived ... Receive Timeout, 300.148 sec, Retrying... <= data fragment not received within 300 sec., source= 0 size= 4194304 tag= 21 to node 0, receiving 4194304 bytes, tag 21 No data is currently available <= Result of retry: no data has arrived OTRcv Retry failed <= GIVING UP RETRYING AFTER 300 SECONDS Summary of failed message state follows: destination/source = 0 <= Failed message was being received from node 0 communicator = 0 <= Message had Communitor 0 = MPI_COMM_WORLD tag = 21 <= Message had tag 21 datatype = 18 <= Message had datatype 18 = MPI_INTEGER (see mpif.h) request handle = 1 <= index to which message record, 1 means first record Endpoint reference pointer = 495571780 <= pointer to Open Transport endpoint iocompletion flag = 51 <= if positive, indicates number of fragments received current buffer pointer = 500860096 <= pointer to where next fragment of data will be received current buffer length = 3437720 <= size of space left in receive buffer T_MORE flag = 0 <= used only by AppleTalk, ignore for TCP/IP data pointer = 500103512 <= pointer to start of data being received data length = 4194304 <= total length of data in receive buffer actual length sent/received = 756584 <= amount of data received so far, 756584 bytes out of 4194304 was successfully received before failure first time stamp = 1 -499327666 <= time stamp when receive was first posted second time stamp = 1 -199177989 <= current time stamp next message pointer = 0 <= which message record to be processed next (0 means no more messages in receive queue) non-fatal error code = -3162 <= last non-fatal error, -3162, kOTNoDataErr, No Data MPI_TEST: Error code = -3162 <= MPI_TEST returned with error -3162, kOTNoDataErr MPI_FINALIZE started OrderlyDisconnect Timeout: i, state = 1 6 <= unable to disconnect endpoint 1 (node 0) within 5 seconds (current state = 6, T_OUTREL, disconnect not acknowledged) Message Size Distribution Function ... MPI_FINALIZE complete MPI_ABORT complete <= MPI_ABORT called by MPI_TEST Analysis: The logfile on node 0 indicates the node quit because it detected that node 1 was in an inappropriate state, probably dead. That means one should look in the logfile for node 1 to determine the actual cause of the failure. The logfile on node 1 indicates that that it quit because no data had arrived for 5 minutes. The fact that node 0 was not able to send and node 1 was not receiving anything indicates that there was probably a network failure. (In fact, I had disconnected the Ethernet adapter.)