LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-10-09 03:23:22


Here's a guess: your time limit expired on the job and PBS killed a
bunch of LAM daemons / processes such that internal LAM communication
started failing, eventually resulting in mpirun's failing because the
local lamd was already dead. Or PBS (or some entity) killed lamd's
for some other reason.

On Oct 9, 2007, at 3:57 AM, Daniel Ng wrote:

>
> Hi,
>
> I was running an MPI timing program (2cz-prog1) through qsub job.1
> (attached at the end of message).
> A few runs have been successfully executed until end of the job file.
> However there was one job run terminated with the following message
> (see result1 file for full output):
>
> ::
> bufferd (dtry_send): No child processes
> mpirun (rpwait): Connection reset by peer
> Broken pipe
> ::
>
> Then I resubmitted the same job file and it was ok. Can someone
> enlighten me what happened that caused the message above?
>
> Thanks.
> Daniel.
>
>
> ======================== begin of job.1 ==============================
> #!/bin/bash
> #$ -cwd
> #$ -o ./result1
> #$ -j y
> #$ -r y
> echo "Job started."
> lamboot -v hosts/hostfile
> lamnodes
> echo --------------------------- RUN 1 ---------------------------
> date
> # Each mpirun below creates one line of output
> mpirun n0-8 ./2cz-prog1 482 1.9584 0.00001 t
> mpirun n0-10 ./2cz-prog1 482 1.9584 0.00001 t
> mpirun n0-12 ./2cz-prog1 482 1.9584 0.00001 t
> mpirun n0-15 ./2cz-prog1 482 1.9584 0.00001 t
> date
> mpirun n0-8 ./2cz-prog1 962 1.9784 0.00001 t
> mpirun n0-10 ./2cz-prog1 962 1.9784 0.00001 t
> mpirun n0-12 ./2cz-prog1 962 1.9784 0.00001 t
> mpirun n0-15 ./2cz-prog1 962 1.9784 0.00001 t
> date
> mpirun n0-8 ./2cz-prog1 1442 1.9853 0.00001 t
> mpirun n0-10 ./2cz-prog1 1442 1.9853 0.00001 t
> mpirun n0-12 ./2cz-prog1 1442 1.9853 0.00001 t
> mpirun n0-15 ./2cz-prog1 1442 1.9853 0.00001 t
> date
> ======================== end of job.1 ==============================
>
>
> ======================== begin of result1
> ==============================
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Batch job started.
> n-1<23242> ssi:boot:base:linear: booting n0 (aurora.local)
> n-1<23242> ssi:boot:base:linear: booting n1 (compute-0-0.local)
> n-1<23242> ssi:boot:base:linear: booting n2 (compute-0-1.local)
> n-1<23242> ssi:boot:base:linear: booting n3 (compute-0-2.local)
> n-1<23242> ssi:boot:base:linear: booting n4 (compute-0-3.local)
> n-1<23242> ssi:boot:base:linear: booting n5 (compute-0-4.local)
> n-1<23242> ssi:boot:base:linear: booting n6 (compute-0-5.local)
> n-1<23242> ssi:boot:base:linear: booting n7 (compute-0-6.local)
> n-1<23242> ssi:boot:base:linear: booting n8 (compute-0-7.local)
> n-1<23242> ssi:boot:base:linear: booting n9 (compute-0-8.local)
> n-1<23242> ssi:boot:base:linear: booting n10 (compute-0-9.local)
> n-1<23242> ssi:boot:base:linear: booting n11 (compute-0-10.local)
> n-1<23242> ssi:boot:base:linear: booting n12 (compute-0-11.local)
> n-1<23242> ssi:boot:base:linear: booting n13 (compute-0-12.local)
> n-1<23242> ssi:boot:base:linear: booting n14 (compute-0-13.local)
> n-1<23242> ssi:boot:base:linear: booting n15 (compute-0-14.local)
> n-1<23242> ssi:boot:base:linear: finished
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n0 aurora.cs.usm.my:1:
> n1 compute-0-0.local:1:
> n2 compute-0-1.local:1:
> n3 compute-0-2.local:1:
> n4 compute-0-3.local:1:
> n5 compute-0-4.local:1:
> n6 compute-0-5.local:1:
> n7 compute-0-6.local:1:
> n8 compute-0-7.local:1:
> n9 compute-0-8.local:1:
> n10 compute-0-9.local:1:
> n11 compute-0-10.local:1:
> n12 compute-0-11.local:1:
> n13 compute-0-12.local:1:
> n14 compute-0-13.local:1:
> n15 compute-0-14.local:1:origin,this_node
> --------------------------- RUN 1 ---------------------------
> Tue Oct 9 08:48:51 MYT 2007
> 0.499175 prog1 n=482 nW=8 panel=15 w=1.9584 Itr=305 Theo=263
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:51AM
> 0.494605 prog1 n=482 nW=10 panel=12 w=1.9584 Itr=305 Theo=263
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:52AM
> 0.451702 prog1 n=482 nW=12 panel=10 w=1.9584 Itr=305 Theo=263
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:53AM
> 0.558133 prog1 n=482 nW=15 panel=8 w=1.9584 Itr=305 Theo=263
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:55AM
> Tue Oct 9 08:48:55 MYT 2007
> 2.700609 prog1 n=962 nW=8 panel=30 w=1.9784 Itr=592 Theo=525
> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:48:58AM
> 2.189707 prog1 n=962 nW=10 panel=24 w=1.9784 Itr=592 Theo=525
> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:01AM
> 1.897054 prog1 n=962 nW=12 panel=20 w=1.9784 Itr=592 Theo=525
> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:04AM
> bufferd (dtry_send): No child processes
> mpirun (rpwait): Connection reset by peer
> Broken pipe
> Tue Oct 9 08:49:06 MYT 2007
> ----------------------------------------------------------------------
> -------
> It seems that there is no lamd running on the host compute-0-14.local.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "mpirun" command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment. See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> ----------------------------------------------------------------------
> -------
> :
> : .... The "there is no lamd running" message repeats ....
> :
>
> ======================== end of result1 ==============================
>
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Cisco Systems