LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Daniel Ng (danielng52_at_[hidden])
Date: 2007-10-09 05:59:05


Hi Jeff,

Actually that instance of unsuccessful run was the last job of a queue
(total 8 jobs) that had lasted for about 11 h 23 mins.
The exact running times for the 7 jobs are (hh:mm:ss)
      5:18:11 1:14:32 1:36:44 0:28:36 1:17:43 1:03:38 0:24:07

Is there a default total time limit set for a user to run a series of jobs?

----- Original Message -----
From: "Jeff Squyres" <jsquyres_at_[hidden]>
To: "General LAM/MPI mailing list" <lam_at_[hidden]>
Sent: Tuesday, October 09, 2007 3:23 PM
Subject: Re: LAM: bufferd (dtry_send): No child processes

> Here's a guess: your time limit expired on the job and PBS killed a
> bunch of LAM daemons / processes such that internal LAM communication
> started failing, eventually resulting in mpirun's failing because the
> local lamd was already dead. Or PBS (or some entity) killed lamd's
> for some other reason.
>
>
> On Oct 9, 2007, at 3:57 AM, Daniel Ng wrote:
>
>>
>> Hi,
>>
>> I was running an MPI timing program (2cz-prog1) through qsub job.1
>> (attached at the end of message).
>> A few runs have been successfully executed until end of the job file.
>> However there was one job run terminated with the following message
>> (see result1 file for full output):
>>
>> ::
>> bufferd (dtry_send): No child processes
>> mpirun (rpwait): Connection reset by peer
>> Broken pipe
>> ::
>>
>> Then I resubmitted the same job file and it was ok. Can someone
>> enlighten me what happened that caused the message above?
>>
>> Thanks.
>> Daniel.
>>
>>
>> ======================== begin of job.1 ==============================
>> #!/bin/bash
>> #$ -cwd
>> #$ -o ./result1
>> #$ -j y
>> #$ -r y
>> echo "Job started."
>> lamboot -v hosts/hostfile
>> lamnodes
>> echo --------------------------- RUN 1 ---------------------------
>> date
>> # Each mpirun below creates one line of output
>> mpirun n0-8 ./2cz-prog1 482 1.9584 0.00001 t
>> mpirun n0-10 ./2cz-prog1 482 1.9584 0.00001 t
>> mpirun n0-12 ./2cz-prog1 482 1.9584 0.00001 t
>> mpirun n0-15 ./2cz-prog1 482 1.9584 0.00001 t
>> date
>> mpirun n0-8 ./2cz-prog1 962 1.9784 0.00001 t
>> mpirun n0-10 ./2cz-prog1 962 1.9784 0.00001 t
>> mpirun n0-12 ./2cz-prog1 962 1.9784 0.00001 t
>> mpirun n0-15 ./2cz-prog1 962 1.9784 0.00001 t
>> date
>> mpirun n0-8 ./2cz-prog1 1442 1.9853 0.00001 t
>> mpirun n0-10 ./2cz-prog1 1442 1.9853 0.00001 t
>> mpirun n0-12 ./2cz-prog1 1442 1.9853 0.00001 t
>> mpirun n0-15 ./2cz-prog1 1442 1.9853 0.00001 t
>> date
>> ======================== end of job.1 ==============================
>>
>>
>> ======================== begin of result1
>> ==============================
>> Warning: no access to tty (Bad file descriptor).
>> Thus no job control in this shell.
>> Batch job started.
>> n-1<23242> ssi:boot:base:linear: booting n0 (aurora.local)
>> n-1<23242> ssi:boot:base:linear: booting n1 (compute-0-0.local)
>> n-1<23242> ssi:boot:base:linear: booting n2 (compute-0-1.local)
>> n-1<23242> ssi:boot:base:linear: booting n3 (compute-0-2.local)
>> n-1<23242> ssi:boot:base:linear: booting n4 (compute-0-3.local)
>> n-1<23242> ssi:boot:base:linear: booting n5 (compute-0-4.local)
>> n-1<23242> ssi:boot:base:linear: booting n6 (compute-0-5.local)
>> n-1<23242> ssi:boot:base:linear: booting n7 (compute-0-6.local)
>> n-1<23242> ssi:boot:base:linear: booting n8 (compute-0-7.local)
>> n-1<23242> ssi:boot:base:linear: booting n9 (compute-0-8.local)
>> n-1<23242> ssi:boot:base:linear: booting n10 (compute-0-9.local)
>> n-1<23242> ssi:boot:base:linear: booting n11 (compute-0-10.local)
>> n-1<23242> ssi:boot:base:linear: booting n12 (compute-0-11.local)
>> n-1<23242> ssi:boot:base:linear: booting n13 (compute-0-12.local)
>> n-1<23242> ssi:boot:base:linear: booting n14 (compute-0-13.local)
>> n-1<23242> ssi:boot:base:linear: booting n15 (compute-0-14.local)
>> n-1<23242> ssi:boot:base:linear: finished
>>
>> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>>
>> n0 aurora.cs.usm.my:1:
>> n1 compute-0-0.local:1:
>> n2 compute-0-1.local:1:
>> n3 compute-0-2.local:1:
>> n4 compute-0-3.local:1:
>> n5 compute-0-4.local:1:
>> n6 compute-0-5.local:1:
>> n7 compute-0-6.local:1:
>> n8 compute-0-7.local:1:
>> n9 compute-0-8.local:1:
>> n10 compute-0-9.local:1:
>> n11 compute-0-10.local:1:
>> n12 compute-0-11.local:1:
>> n13 compute-0-12.local:1:
>> n14 compute-0-13.local:1:
>> n15 compute-0-14.local:1:origin,this_node
>> --------------------------- RUN 1 ---------------------------
>> Tue Oct 9 08:48:51 MYT 2007
>> 0.499175 prog1 n=482 nW=8 panel=15 w=1.9584 Itr=305 Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:51AM
>> 0.494605 prog1 n=482 nW=10 panel=12 w=1.9584 Itr=305 Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:52AM
>> 0.451702 prog1 n=482 nW=12 panel=10 w=1.9584 Itr=305 Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:53AM
>> 0.558133 prog1 n=482 nW=15 panel=8 w=1.9584 Itr=305 Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:55AM
>> Tue Oct 9 08:48:55 MYT 2007
>> 2.700609 prog1 n=962 nW=8 panel=30 w=1.9784 Itr=592 Theo=525
>> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:48:58AM
>> 2.189707 prog1 n=962 nW=10 panel=24 w=1.9784 Itr=592 Theo=525
>> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:01AM
>> 1.897054 prog1 n=962 nW=12 panel=20 w=1.9784 Itr=592 Theo=525
>> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:04AM
>> bufferd (dtry_send): No child processes
>> mpirun (rpwait): Connection reset by peer
>> Broken pipe
>> Tue Oct 9 08:49:06 MYT 2007
>> ----------------------------------------------------------------------
>> -------
>> It seems that there is no lamd running on the host compute-0-14.local.
>>
>> This indicates that the LAM/MPI runtime environment is not operating.
>> The LAM/MPI runtime environment is necessary for the "mpirun" command.
>>
>> Please run the "lamboot" command the start the LAM/MPI runtime
>> environment. See the LAM/MPI documentation for how to invoke
>> "lamboot" across multiple machines.
>> ----------------------------------------------------------------------
>> -------
>> :
>> : .... The "there is no lamd running" message repeats ....
>> :
>>
>> ======================== end of result1 ==============================
>>
>>
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/