>I have since discovered that my performance issue with MPI_INIT taking a
>long time seems to be due to underlying network issues, not LAM (we have a
>guest account on some UPenn BProc machines that we do our testing on).
>So we can throw that possibility out. Other than running slowly, the
>mandlebrot example runs fine for me.
>
>What's your filesystem situation like? Do you have a more-or-less uniform
>filesystem (from the user's point of view) on all nodes, such that
>"master" and "slave" can be found in the same directory on all nodes?
>
>Stripping down the app schema file to the following works for me:
>
>-----
>h /home/jsquyres/cvs/trillium/examples/mandelbrot/master
>C /home/jsquyres/cvs/trillium/examples/mandelbrot/slave
>-----
>
>Note that I removed the "-s h" from the slave line since it just causes
>more slowness on the network, and since /home is uniformly exported to all
>nodes. Note, too, that the absolute filename isn't necessary -- it's just
>generated that way as a "safest" example. You could just have:
>
>-----
>h master
>C slave
>-----
>
I have /home NFS mounted on all the computer nodes, so they can all see
the master and slave applications. I decided to let things run for a
while and the program finally terminated:
master: allocating block (380, 180) - (399, 199) to process 3
master: allocating block (400, 180) - (419, 199) to process 2
master: allocating block (420, 180) - (439, 199) to process 4
master: allocating block (440, 180) - (459, 199) to process 1
master: allocating block (460, 180) - (479, 199) to process 3
MPI_Recv: process in local group is dead (rank 2, MPI_COMM_WORLD)
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 5608 failed on node n0 (192.168.1.1) with exit status 1.
-----------------------------------------------------------------------------
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
Rank (2, MPI_COMM_WORLD): - MPI_Recv()
Rank (2, MPI_COMM_WORLD): - main()
I'm assuming this just means some timeout value was reached. As I said,
if I launch mpirun on one of the compute nodes instead of the head,
everything runs just fine. Perhaps I have a problem with my networking
configuration.
Mike
|