LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Riju John (riju.john_at_[hidden])
Date: 2005-06-01 15:16:04


Thank you for the detailed reply. I will try what you suggested & let
you know how it goes.
Once again thanks!

-Riju

On 5/31/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> Are you able to do a "tping -c 3" when this occurs? I'm wondering if
> your LAM daemons are somehow locking up. More specifically, if you can
> login to the first node where processes from your second job have
> failed to start, can you run the command "lamnodes"? This command
> connects to the local lamd and queries some information from it -- it
> will verify if the lamd on that node is still functioning properly.
>
> If it responds, then that lamd is working fine. If it does not
> respond, then the lamd has gone catatonic for some reason (verify that
> it's still actually running).
>
> Let me know what you find.
>
>
>
> On May 27, 2005, at 6:19 PM, Riju John wrote:
>
> > Hi Jeff,
> >
> > What I noticed was that mpirun starts, the process starts on the first
> > few nodes, but not on the other remaining nodes. Those nodes only have
> > lamd running, and no processes.
> >
> > Thanks,
> > Riju
> >
> > On 5/27/05, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> >> On May 27, 2005, at 5:32 PM, Riju John wrote:
> >>
> >>> I am running lam-7.0 on a Opteron cluster running SuSe 8.1.
> >>>
> >>> I noticed that mpirun sometimes hangs when running multiple MPI jobs.
> >>> These jobs run on 64 slave nodes, and keep the system resources
> >>> fairly
> >>> busy. The first job is doing a fair amount of disk i/o when the
> >>> second
> >>> job starts. The second job sometimes hangs. This happens before even
> >>> getting to MPI_init. Has anyone seen this kind of problem before. Is
> >>> there any option in mpirun that can help with this problem.
> >>
> >> No, I have not seen this before. Do you know if the processes start
> >> at
> >> all? I.e., do they reach main()? Or are they stuck somewhere between
> >> the beginning of main() and the beginning of MPI_INIT()?
> >>
> >> --
> >> {+} Jeff Squyres
> >> {+} jsquyres_at_[hidden]
> >> {+} http://www.lam-mpi.org/
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >>
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>