On Mar 28, 2006, at 9:05 AM, j_reichel_at_[hidden] wrote:
> i'am trying to integrate LAM to SGE 6.0. But it won't work in the
> right way.
> I have an startlam script and i add a new Parallel Enviroment into
> the SGE.
> But after sending the job there is no result.
> I think there is a problem with the lamboot command.
> I started it with the option -d to see what happens.
> When i look to the logfile i can see that the lamd daemon is
> startet on all the Nodes of the cluster.
>
> But after all in the last part of the logfile ist the comment that
> there is no lamd on the head node.
<snip>
> n-1<10054> ssi:boot:rsh: starting on n0 (ppc207): hboot -t -c lam-
> conf.lamd -d -
> sessionsuffix sge-78-undefined -I -H 141.35.13.107 -P 32789 -n 0 -o 0
<snip>
> n-1<10054> ssi:boot: Closing
> ----------------------------------------------------------------------
> -------
> It seems that there is no lamd running on the host ppc207.
>
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for the "lamhalt"
> command.
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment. See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> ----------------------------------------------------------------------
> -------
It looks like the lamboot completed successfully, and then lamhalt
was the command that actually failed to run. As the error message
says, this is because it couldn't find a lamd running on host
ppc207. As we can see from the log file, lamboot certainly started a
lamd on the node. Can you login to that node and see if a lamd is
actually running? There are a couple of possible issues:
* For some reason, lamhalt isn't picking up the session suffix
from SGE. This
usually happens when you login to a node in a way that bypasses
the batch
scheduler, but have used the batch scheduler to start LAM/MPI.
See the
lamboot man page for more information on the session suffix.
* SGE decided for some reason to kill the lamd. I don't know much
about SGE,
so I can't comment on the likelihood of this scenario. But a
quick 'ps'
should shed some light on this possibility.
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|