LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Daniel Ng (danielng52_at_[hidden])
Date: 2007-10-08 21:57:03


Hi,

I was running an MPI timing program (2cz-prog1) through qsub job.1 (attached at the end of message).
A few runs have been successfully executed until end of the job file.
However there was one job run terminated with the following message (see result1 file for full output):

::
bufferd (dtry_send): No child processes
mpirun (rpwait): Connection reset by peer
Broken pipe
::

Then I resubmitted the same job file and it was ok. Can someone enlighten me what happened that caused the message above?

Thanks.
Daniel.

======================== begin of job.1 ==============================
#!/bin/bash
#$ -cwd
#$ -o ./result1
#$ -j y
#$ -r y
echo "Job started."
lamboot -v hosts/hostfile
lamnodes
echo --------------------------- RUN 1 ---------------------------
date
# Each mpirun below creates one line of output
mpirun n0-8 ./2cz-prog1 482 1.9584 0.00001 t
mpirun n0-10 ./2cz-prog1 482 1.9584 0.00001 t
mpirun n0-12 ./2cz-prog1 482 1.9584 0.00001 t
mpirun n0-15 ./2cz-prog1 482 1.9584 0.00001 t
date
mpirun n0-8 ./2cz-prog1 962 1.9784 0.00001 t
mpirun n0-10 ./2cz-prog1 962 1.9784 0.00001 t
mpirun n0-12 ./2cz-prog1 962 1.9784 0.00001 t
mpirun n0-15 ./2cz-prog1 962 1.9784 0.00001 t
date
mpirun n0-8 ./2cz-prog1 1442 1.9853 0.00001 t
mpirun n0-10 ./2cz-prog1 1442 1.9853 0.00001 t
mpirun n0-12 ./2cz-prog1 1442 1.9853 0.00001 t
mpirun n0-15 ./2cz-prog1 1442 1.9853 0.00001 t
date
======================== end of job.1 ==============================

======================== begin of result1 ==============================
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Batch job started.
n-1<23242> ssi:boot:base:linear: booting n0 (aurora.local)
n-1<23242> ssi:boot:base:linear: booting n1 (compute-0-0.local)
n-1<23242> ssi:boot:base:linear: booting n2 (compute-0-1.local)
n-1<23242> ssi:boot:base:linear: booting n3 (compute-0-2.local)
n-1<23242> ssi:boot:base:linear: booting n4 (compute-0-3.local)
n-1<23242> ssi:boot:base:linear: booting n5 (compute-0-4.local)
n-1<23242> ssi:boot:base:linear: booting n6 (compute-0-5.local)
n-1<23242> ssi:boot:base:linear: booting n7 (compute-0-6.local)
n-1<23242> ssi:boot:base:linear: booting n8 (compute-0-7.local)
n-1<23242> ssi:boot:base:linear: booting n9 (compute-0-8.local)
n-1<23242> ssi:boot:base:linear: booting n10 (compute-0-9.local)
n-1<23242> ssi:boot:base:linear: booting n11 (compute-0-10.local)
n-1<23242> ssi:boot:base:linear: booting n12 (compute-0-11.local)
n-1<23242> ssi:boot:base:linear: booting n13 (compute-0-12.local)
n-1<23242> ssi:boot:base:linear: booting n14 (compute-0-13.local)
n-1<23242> ssi:boot:base:linear: booting n15 (compute-0-14.local)
n-1<23242> ssi:boot:base:linear: finished

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

n0 aurora.cs.usm.my:1:
n1 compute-0-0.local:1:
n2 compute-0-1.local:1:
n3 compute-0-2.local:1:
n4 compute-0-3.local:1:
n5 compute-0-4.local:1:
n6 compute-0-5.local:1:
n7 compute-0-6.local:1:
n8 compute-0-7.local:1:
n9 compute-0-8.local:1:
n10 compute-0-9.local:1:
n11 compute-0-10.local:1:
n12 compute-0-11.local:1:
n13 compute-0-12.local:1:
n14 compute-0-13.local:1:
n15 compute-0-14.local:1:origin,this_node
--------------------------- RUN 1 ---------------------------
Tue Oct 9 08:48:51 MYT 2007
0.499175 prog1 n=482 nW=8 panel=15 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:51AM
0.494605 prog1 n=482 nW=10 panel=12 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:52AM
0.451702 prog1 n=482 nW=12 panel=10 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:53AM
0.558133 prog1 n=482 nW=15 panel=8 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:55AM
Tue Oct 9 08:48:55 MYT 2007
2.700609 prog1 n=962 nW=8 panel=30 w=1.9784 Itr=592 Theo=525 eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:48:58AM
2.189707 prog1 n=962 nW=10 panel=24 w=1.9784 Itr=592 Theo=525 eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:01AM
1.897054 prog1 n=962 nW=12 panel=20 w=1.9784 Itr=592 Theo=525 eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:04AM
bufferd (dtry_send): No child processes
mpirun (rpwait): Connection reset by peer
Broken pipe
Tue Oct 9 08:49:06 MYT 2007
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host compute-0-14.local.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "mpirun" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
:
: .... The "there is no lamd running" message repeats ....
:

======================== end of result1 ==============================