When the monitor variable is set to 2, a complete log file of all the
MPI messages is recorded in a file on each node.  In Fortran, the name
of the file is whatever is connected to unit 2 (and defaults to the
name FOR002.DAT, if no open for unit 2 has been called.)  In C, the
file is called MPIErrs.

    To help decipher the meaning of these files, we start with a successful
run of the knock program with two nodes, using MacMPI_IP.f.  Then we will give some
examples of failures.

CASE I: Successful run of knock.f
On node 0: Log Data                                  Meaning of entry:

 MPI_INIT started                                     <= MPI_INIT successfully started
 local host=169.232.159.155,port=  32385              <= IP address of this node
 requesting connection with=169.232.159.155           <= node 0 is starting connection to itself
 tentative connection started with=169.232.159.155
 connection confirmed with idproc =   0               <= node 0 (itself) is successfully connected
 requesting connection with=169.232.159.156           <= node 0 is starting connection to node 1
 tentative connection started with=169.232.159.156
 send posted: destination=  1  size=  8 tag=  3       <= confirmation requested sent to node 1
 sent: destination=  1  size=  8  time=  107 tag=  3
 receive posted: source=  1  size=  4 tag=  4
 received: source=  1  size=  4  time=  5939 tag=  4  <= confirmation confirmed from node 1
 connection confirmed with idproc =   1               <= node 1 is successfully connected
 all nodes activated: idproc, nproc=  0  2            <= all nodes in nodelist connected
 send posted: destination=  1  size=  0 tag=  1       <= sending to node 1 the address of next node
                                                         (0 size means there are no other nodes)
 sent: destination=  1  size=  0  time=  46 tag=  1
 receive posted: source=  1  size=  4 tag=  2 
 received: source=  1  size=  4  time=  1617 tag=  2  <= received verification of number of nodes
 MPI_INIT complete                                    <= MPI_INIT successfully completed                             
 
 send posted: destination=  1  size=  12 tag=  1      <= Knock message send posted to node 1, 12 bytes, tag 1
 sent: destination=  1  size=  12  time=  116 tag=  1 <= Knock message successfully sent to OpenTransport,
                                                         time since posting, 116 microsec.
 receive posted: source=  1  size=  12 tag=  2        <= Knock message receive posted from node 1, 12 bytes, tag 2
 received: source=  1  size=  12  time=  1321 tag=  2 <= Knock message received from Open Transport,
                                                         time since posting, 1321 microsec.
 MPI_FINALIZE started                                 <= MPI_FINALIZE successfully started
  Message Size Distribution Function                  <= statistics about the number of messages sent and received,
  Size(bytes) =   1  Number =   0                        and their sizes
  Size(bytes) =   2  Number =   0
  Size(bytes) =   4  Number =   2
  Size(bytes) =   8  Number =   1
  Size(bytes) =   16  Number =   2
  Size(bytes) =   32  Number =   0
  Size(bytes) =   64  Number =   0
  Size(bytes) =   128  Number =   0
  Size(bytes) =   256  Number =   0
  Size(bytes) =   512  Number =   0
  Size(bytes) =   1024  Number =   0
  Size(bytes) =   2048  Number =   0
  Size(bytes) =   4096  Number =   0
  Size(bytes) =   8192  Number =   0
  Size(bytes) =   16384  Number =   0
  Size(bytes) =   32768  Number =   0
  Size(bytes) =   65536  Number =   0
  Size(bytes) =   131072  Number =   0
  Size(bytes) =   262144  Number =   0
  Size(bytes) =   524288  Number =   0
  Size(bytes) =   1048576  Number =   0
  Size(bytes) =   2097152  Number =   0
  Size(bytes) =   4194304  Number =   0
  Size(bytes) =   8388608  Number =   0
  Sending Time =   0 %, Speed =   8.968610E-02 MB/s   <= average send speed (time between posting and completion)
  Receiving Time =   0 %, Speed =   2.253013E-03 MB/s <= average receive speed (time between posting and completion)
 MPI_FINALIZE complete                                <= MPI_FINALIZE successfully completed


On node 1: Log Data                                  Meaning of entry:

 MPI_INIT started                                     <= MPI_INIT successfully started
 local host=169.232.159.156,port=  32385              <= IP address of this node
 receive posted: source=  0  size=  8 tag=  3         <= confirmation request received from node 0
 received: source=  0  size=  8  time=  78 tag=  3
 send posted: destination=  0  size=  4 tag=  4
 sent: destination=  0  size=  4  time=  70 tag=  4   <= confirmation sent to node 0
 connection accepted with idproc =   0                <= node 0 is successfully connected
 requesting connection with=169.232.159.156           <= node 1 is starting connection to itself
 tentative connection started with=169.232.159.156
 connection confirmed with idproc =   1               <= node 1 is successfully connected
 receive posted: source=  0  size=  16 tag=  1        <= receiving from node 0 the address of next node
                                                         (0 size means there are no other nodes)
 received: source=  0  size=  0  time=  15 tag=  1
 all nodes activated: idproc, nproc=  1  2            <= no more nodes to connect
 send posted: destination=  0  size=  4 tag=  2       <= sending verification of number of nodes
 sent: destination=  0  size=  4  time=  101 tag=  2
 MPI_INIT complete                                    <= MPI_INIT successfully completed 
 
 receive posted: source=  0  size=  12 tag=  1        <= Knock message receive posted from node 0, 12 bytes, tag 1
 received: source=  0  size=  12  time=  2067 tag=  1 <= Knock message received from Open Transport,
                                                         time since posting, 2067 microsec.
 send posted: destination=  0  size=  12 tag=  2      <= Knock message send posted to node 0, 12 bytes, tag 2
 sent: destination=  0  size=  12  time=  77 tag=  2  <= Knock message successfully sent to OpenTransport,
                                                         time since posting, 77 microsec.
 MPI_FINALIZE started                                 <= MPI_FINALIZE successfully started
  Message Size Distribution Function
  ....
 MPI_FINALIZE complete                                <= MPI_FINALIZE successfully completed
 
Analysis: Everything was successfully completed.


CASE II: TCP/IP network not functioning properly On node 0: Log Data                                  Meaning of entry:  MPI_INIT started  local host=169.232.159.155,port=  32389              <= IP address of this node  requesting connection with=169.232.159.155  tentative connection started with=169.232.159.155  connection confirmed with idproc =   0               <= node 0 (itself) is successfully connected  requesting connection with=169.232.159.156           <= node 0 is starting connection to node 1  OTConnect Wait Time Exceeded                         <= More than 60 seconds expired trying to connect  Trying to start location = 169.232.159.156           <= node 1 (169.232.159.156) did not respond  MPI_FINALIZE started  OrderlyDisconnect Timeout: i, state =   2  3         <= unable to disconnect endpoint 2 (node 1) within 5 seconds                                                          (current state = 3, T_OUTCON, connection attempt started)   Message Size Distribution Function   ...  MPI_FINALIZE complete On node 1: Log Data                                  Meaning of entry:  MPI_INIT started  local host=169.232.159.156,port=  32389              <= IP address of this node  OTListen Wait Time Exceeded                          <= More than 60 seconds expired waiting for connection  MPI_FINALIZE started  MPI_FINALIZE complete   Analysis: Neither node was able to see each other.  Either TCP/IP network is not working properly, or TCP IP address in nodelist file is incorrect (when manually created). If the AppleTalk and TCP/IP  networks are using different adapters, it is possible for the Launch Den Mother to see the nodes with AppleTalk, but not see them with TCP/IP.   (In fact, I had disconnected the TCP/IP Ethernet adapter just after launch.)
CASE III: successful MPI collective calls in pingpong.f On node 0: Log Data                                  Meaning of entry:  MPI_BARRIER started                                  <= MPI_BARRIER initiated  receive posted: source=  1  size=  4 tag=  0         <= receive posted from node 1 by MPI_BARRIER  received: source=  1  size=  4  time=  48 tag=  0    <= data received from node 1 by MPI_BARRIER, 48 microsec.  MPI_BCAST started                                    <= MPI_BCAST called by MPI_BARRIER, Everyone checked in  send posted: destination=  1  size=  4 tag=  0       <= send posted to node 1 by MPI_BCAST, 4 bytes  sent: destination=  1  size=  4  time=  78 tag=  0   <= data sent to node 1 by MPI_BCAST, 4 bytes, 78 microsec.  MPI_BCAST complete                                   <= MPI_BCAST sucessfully completed  MPI_BARRIER complete                                 <= MPI_BARRIER successfuly completed  ...  MPI_ALLREDUCE started                                <= MPI_ALLREDUCE started  MPI_REDUCE started                                   <= MPI_REDUCE called by MPI_ALLREDUCE  receive posted: source=  1  size=  4 tag=  1         <= receive posted from node 1 by MPI_REDUCE, 4 bytes  received: source=  1  size=  4  time=  186 tag=  1   <= data received from node 1 by MPI_REDUCE, 4 bytes, 186 microsec.  MPI_REDUCE complete                                  <= MPI_REDUCE successfully completed  MPI_BCAST started                                    <= MPI_BCAST called by MPI_ALLREDUCE  send posted: destination=  1  size=  4 tag=  0       <= send posted to node 1 by MPI_BCAST, 4 bytes  sent: destination=  1  size=  4  time=  51 tag=  0   <= data sent to node 1 by MPI_BCAST, 4 bytes, 51 microsec.  MPI_BCAST complete                                   <= MPI_BCAST successfully completed  MPI_ALLREDUCE complete                               <= MPI_ALLREDUCE succesfully completed  ...   On node 1: Log Data                                  Meaning of entry:    MPI_BARRIER started                                  <= MPI_BARRIER initiated  send posted: destination=  0  size=  4 tag=  0       <= send posted to node 0 by MPI_BARRIER, 4 bytes  sent: destination=  0  size=  4  time=  91 tag=  0   <= data sent to node 0 by MPI_BARRIER, 4 bytes, 91 microsec.  MPI_BCAST started                                    <= MPI_BCAST called by MPI_BARRIER, node 1 checked in  receive posted: source=  0  size=  4 tag=  0         <= receive posted from node 0 by MPI_BCAST  received: source=  0  size=  4  time=  4447 tag=  0  <= data received from node 0 by MPI_BARRIER, 4447 microsec.  MPI_BCAST complete                                   <= MPI_BCAST sucessfully completed  MPI_BARRIER complete                                 <= MPI_BARRIER successfuly completed  ...  MPI_ALLREDUCE started                                <= MPI_ALLREDUCE started  MPI_REDUCE started                                   <= MPI_REDUCE called by MPI_ALLREDUCE  send posted: destination=  0  size=  4 tag=  1       <= send posted to node 0 by MPI_REDUCE, 4 bytes  sent: destination=  0  size=  4  time=  46 tag=  1   <= data sent to node 0 by MPI_REDUCE, 4 bytes, 46 microsec.  MPI_REDUCE complete                                  <= MPI_REDUCE successfully completed  MPI_BCAST started                                    <= MPI_BCAST called by MPI_ALLREDUCE  receive posted: source=  0  size=  4 tag=  0         <= receive posted from node 0 by MPI_BCAST, 4 bytes  received: source=  0  size=  4  time=  886 tag=  0   <= data received from node 0 by MPI_REDUCE, 4 bytes, 886 microsec.  MPI_BCAST complete                                   <= MPI_BCAST successfully completed  MPI_ALLREDUCE complete                               <= MPI_ALLREDUCE succesfully completed  ...   Analysis: Successful collective calls in pingpong.f
CASE IV: MPI collective calls failed due to network error in pingpong.f On node 0: Log Data                                     Meaning of entry:  send posted: destination=  1  size=  4194304 tag=  21  <= send posted to node 1, 4194304 bytes, tag 21  Send Timeout,   10.0660  sec, Retrying...              <= data fragment not sent within 10 sec.,  destination=  1  size=  4194304 tag=  21                  to node 1, sending 4194304 bytes, tag 21  Flow Control prevents sending data                     <= Result of retry: cannot send data   ...  Send Timeout,   110.075  sec, Retrying...              <= data not sent within 110 sec.,        destination=  1  size=  4194304 tag=  21                  to node 1, sendng 4194304 bytes, tag 21  Flow Control prevents sending data                     <= Result of retry: cannot send data   OTSnd Error, nerr, dest, tag= -3155  1  21             <= Error -3155 detected in message to node 1, with tag 21  Endpoint not in an appropriate state                   <= MEANING OF ERROR -3155, kOTOutStateErr: NODE 1 IS DEAD,                                                             Summary of failed message state follows:  destination/source =   1                               <= Failed message was being sent to node 1  communicator =   0                                     <= Message had Communitor 0 = MPI_COMM_WORLD  tag =   21                                             <= Message had tag 21  datatype =   18                                        <= Message had datatype 18 = MPI_INTEGER (see mpif.h)  request handle =   1                                   <= index to which message record, 1 means first record  Endpoint reference pointer =   492288676               <= pointer to Open Transport endpoint  iocompletion flag =  -3155                             <= Error return code from Open Transport  current buffer pointer =   493445204                   <= pointer to where next fragment of data will sent from  current buffer length =   3375104                      <= size of data remaining to be sent  T_MORE flag =   0                                      <= used only by AppleTalk, ignore for TCP/IP  data pointer =   492626004                             <= pointer to start of data being sent  data length =   4194304                                <= total length of data to be sent  actual length sent/received =   819200                 <= amount of data sent so far, 819200 bytes out of 4194304                                                            was successfully sent before failure  first time stamp =   1 -223974229                      <= time stamp when send was first posted  second time stamp =   1 -112248008                     <= current time stamp  next message pointer =   0                             <= which message record to be processed next                                                            (0 means no more messages in send queue)  non-fatal error code =  -3161                          <= last non-fatal error, -3161, kOTFlowErr, Flow Control    MPI_TEST: Error code =  -3155                          <= MPI_TEST returned with error -3155, kOTOutStateErr  MPI_FINALIZE started   ...  MPI_FINALIZE complete  MPI_ABORT complete                                     <= MPI_ABORT called by MPI_TEST     On node 1: Log Data                                    Meaning of entry:  receive posted: source=  0  size=  4194304 tag=  21    <= receive posted from node 0, 4194304 bytes, tag 21  Receive Timeout,   10.1160  sec, Retrying...           <= data fragment not received within 10 sec.,  source=  0  size=  4194304 tag=  21                       to node 0, receiving 4194304 bytes, tag 21  No data is currently available                         <= Result of retry: no data has arrived   ...  Receive Timeout,   300.148  sec, Retrying...           <= data fragment not received within 300 sec.,  source=  0  size=  4194304 tag=  21                       to node 0, receiving 4194304 bytes, tag 21  No data is currently available                         <= Result of retry: no data has arrived  OTRcv Retry failed                                     <= GIVING UP RETRYING AFTER 300 SECONDS                                                            Summary of failed message state follows:  destination/source =   0                               <= Failed message was being received from node 0  communicator =   0                                     <= Message had Communitor 0 = MPI_COMM_WORLD  tag =   21                                             <= Message had tag 21  datatype =   18                                        <= Message had datatype 18 = MPI_INTEGER (see mpif.h)  request handle =   1                                   <= index to which message record, 1 means first record  Endpoint reference pointer =   495571780               <= pointer to Open Transport endpoint  iocompletion flag =   51                               <= if positive, indicates number of fragments received  current buffer pointer =   500860096                   <= pointer to where next fragment of data will be received  current buffer length =   3437720                      <= size of space left in receive buffer  T_MORE flag =   0                                      <= used only by AppleTalk, ignore for TCP/IP  data pointer =   500103512                             <= pointer to start of data being received  data length =   4194304                                <= total length of data in receive buffer  actual length sent/received =   756584                 <= amount of data received so far, 756584 bytes out of 4194304                                                            was successfully received before failure  first time stamp =   1 -499327666                      <= time stamp when receive was first posted  second time stamp =   1 -199177989                     <= current time stamp  next message pointer =   0                             <= which message record to be processed next                                                            (0 means no more messages in receive queue)  non-fatal error code =  -3162                          <= last non-fatal error, -3162, kOTNoDataErr, No Data    MPI_TEST: Error code =  -3162                          <= MPI_TEST returned with error -3162, kOTNoDataErr  MPI_FINALIZE started  OrderlyDisconnect Timeout: i, state =   1  6           <= unable to disconnect endpoint 1 (node 0) within 5 seconds                                                          (current state = 6, T_OUTREL, disconnect not acknowledged)   Message Size Distribution Function   ...  MPI_FINALIZE complete  MPI_ABORT complete                                     <= MPI_ABORT called by MPI_TEST Analysis: The logfile on node 0 indicates the node quit because it detected that node 1 was in an inappropriate state, probably dead.  That means one should look in the logfile for node 1 to determine the actual cause of the failure.  The logfile on node 1 indicates that that it quit because no data had arrived for 5 minutes.  The fact that node 0 was not able to send and node 1 was not receiving anything indicates that there was probably a network failure.  (In fact, I had disconnected the Ethernet adapter.)

AppleSeed Development Web Page | AppleSeed Multiprocessing Web Page | Back to AppleSeed

Last Updated: November 6, 2000