Hi Amey!
It may seem that problems start even before lamboot. As you see in the
attached script, I ask for 2 processors (there are 210 available). The
output file says I got 1 processor.
Best regards, Jess
On Wed, 2003-12-03 at 17:17, Amey Dharurkar wrote:
>
> Hi,
>
> LAM never runs as a root so it can't change the user/group id. Can you be
> clarify more on the error (whether it is during lamboot or mpirun)? If
> you are getting an error during lamboot then send the output of "lamboot
> -d" to know the exact cause.
>
> Hope this helps.
>
> Amey S. Dharurkar
> ----------------------------------------------------------
> Graduate Student, Indiana University
> Ph. O:(812)855-3609, H:(812)331-8203
>
> On Tue, 3 Dec 2003, jess michelsen wrote:
>
> >
> > Hi everyone!
> >
> > I'm having a slight problem running LAM-MPI jobs under PBSpro. The
> > processes/nodes ranking higher than 0 respond with an error when they
> > attempt to start:
> >
> > user not in group
> >
> > I'm currently talking to the guys at PBSpro, and they are wondering
> > whether LAM for some reason could change the user/group id when it
> > spawns processes (does LAM spawn any processes? I believed PBS starts 1
> > proces on each node and I don't specify lamd). Authentication is by
> > PublicKey, the cluster uses NIS, and the user's working directory is nfs
> > mounted. LAM 7.0.2 is configured to employ PBSpro's tm. Platform is RH
> > 8.0. Jobs behave excellently when run manually, i.e. lamboot; mpirun.
> >
> > Best regards, Jess Michelsen
> >
> >
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
test1
Working directory is /home/afmjam/EllipSys2D/ProfileTest
Running on host s02n01.yggdrasil.mek.dtu.dk
s02n01
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/pbs.50.tyr.yggdrasil.mek.dtu.dk/lam-afmjam_at_[hidden]/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/pbs.50.tyr.yggdrasil.mek.dtu.dk/lam-afmjam_at_[hidden]/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/pbs.50.tyr.yggdrasil.mek.dtu.dk/lam-afmjam_at_[hidden]/lam-io-socket
tkill: f_kill = "/tmp/pbs.50.tyr.yggdrasil.mek.dtu.dk/lam-afmjam_at_[hidden]/lam-killfile"
tkill: nothing to kill: "/tmp/pbs.50.tyr.yggdrasil.mek.dtu.dk/lam-afmjam_at_[hidden]/lam-killfile"
LAM 7.0.2/MPI 2 C++ - Indiana University
--------------------------------------
Running PBS epilogue script
Killing processes of user afmjam on the batch nodes
--------------------------------------
Running PBS epilogue script
Killing processes of user afmjam on the batch nodes
Doing node s02n01
Doing node s02n01
Done.
Done.
n0<11968> ssi:boot: Opening
n0<11968> ssi:boot: opening module globus
n0<11968> ssi:boot: initializing module globus
n0<11968> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<11968> ssi:boot: module not available: globus
n0<11968> ssi:boot: opening module rsh
n0<11968> ssi:boot: initializing module rsh
n0<11968> ssi:boot:rsh: module initializing
n0<11968> ssi:boot:rsh:agent: ssh
n0<11968> ssi:boot:rsh:username: <same>
n0<11968> ssi:boot:rsh:verbose: 1000
n0<11968> ssi:boot:rsh:algorithm: linear
n0<11968> ssi:boot:rsh:priority: 10
n0<11968> ssi:boot: module available: rsh, priority: 10
n0<11968> ssi:boot: opening module tm
n0<11968> ssi:boot: initializing module tm
n0<11968> ssi:boot:tm: module initializing
n0<11968> ssi:boot:tm:verbose: 1000
n0<11968> ssi:boot:tm:priority: 75
n0<11968> ssi:boot: module available: tm, priority: 75
n0<11968> ssi:boot: finalizing module globus
n0<11968> ssi:boot:globus: finalizing
n0<11968> ssi:boot: closing module globus
n0<11968> ssi:boot: finalizing module rsh
n0<11968> ssi:boot:rsh: finalizing
n0<11968> ssi:boot: closing module rsh
n0<11968> ssi:boot: Selected boot module tm
n0<11968> ssi:boot:tm: found the following 1 hosts:
n0<11968> ssi:boot:tm: n0 s02n01.yggdrasil.mek.dtu.dk (cpu=1)
n0<11968> ssi:boot:tm: starting RTE procs
n0<11968> ssi:boot:base:linear_windowed: starting
n0<11968> ssi:boot:base:linear_windowed: window size: 5
n0<11968> ssi:boot:base:server: opening server TCP socket
n0<11968> ssi:boot:base:server: opened port 32897
n0<11968> ssi:boot:base:linear_windowed: booting n0 (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:tm: starting wipe on (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:tm: starting on n0 (s02n01.yggdrasil.mek.dtu.dk): /usr/lam/bin//tkill -setsid -d
n0<11968> ssi:boot:tm: successfully launched on n0 (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:tm: waiting for completion on n0 (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:tm: finished on n0 (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:tm: starting lamd on (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:tm: starting on n0 (s02n01.yggdrasil.mek.dtu.dk): /usr/lam/bin//lamd -H 172.16.2.1 -P 32897 -n 0 -o 0 -d
n-1<11970> ssi:boot: Opening
n-1<11970> ssi:boot: opening module globus
n-1<11970> ssi:boot: initializing module globus
n-1<11970> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<11970> ssi:boot: module not available: globus
n-1<11970> ssi:boot: opening module rsh
n-1<11970> ssi:boot: initializing module rsh
n-1<11970> ssi:boot:rsh: module initializing
n-1<11970> ssi:boot:rsh:agent: ssh
n-1<11970> ssi:boot:rsh:username: <same>
n-1<11970> ssi:boot:rsh:verbose: 1000
n-1<11970> ssi:boot:rsh:algorithm: linear
n-1<11970> ssi:boot:rsh:priority: 10
n-1<11970> ssi:boot: module available: rsh, priority: 10
n-1<11970> ssi:boot: opening module tm
n-1<11970> ssi:boot: initializing module tm
n-1<11970> ssi:boot:tm: module initializing
n-1<11970> ssi:boot:tm:verbose: 1000
n-1<11970> ssi:boot:tm:priority: 75
n-1<11970> ssi:boot: module available: tm, priority: 75
n-1<11970> ssi:boot: finalizing module globus
n-1<11970> ssi:boot:globus: finalizing
n-1<11970> ssi:boot: closing module globus
n-1<11970> ssi:boot: finalizing module rsh
n-1<11970> ssi:boot:rsh: finalizing
n-1<11970> ssi:boot: closing module rsh
n-1<11970> ssi:boot: Selected boot module tm
n0<11968> ssi:boot:tm: successfully launched on n0 (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:base:linear_windowed: finished launching
n0<11968> ssi:boot:base:server: expecting connection from finite list
n0<11968> ssi:boot:base:server: got connection from 172.16.2.1
n0<11968> ssi:boot:base:server: this connection is expected (n0)
n0<11968> ssi:boot:base:server: remote lamd is at 172.16.2.1:32798
n0<11968> ssi:boot:base:server: closing server socket
n0<11968> ssi:boot:base:server: connecting to lamd at 172.16.2.1:32901
n0<11968> ssi:boot:base:server: connected
n0<11968> ssi:boot:base:server: sending number of links (1)
n0<11968> ssi:boot:base:server: sending info: n0 (s02n01.yggdrasil.mek.dtu.dk)
n0<11968> ssi:boot:base:server: finished sending
n0<11968> ssi:boot:base:server: disconnected from 172.16.2.1:32901
n0<11968> ssi:boot:base:linear_windowed: finished
n0<11968> ssi:boot:tm: all RTE procs started
n0<11968> ssi:boot:tm: finalizing
n0<11968> ssi:boot: Closing
n-1<11970> ssi:boot:tm: finalizing
n-1<11970> ssi:boot: Closing
mpirun: cannot start ./MPItest1 on n0 (o): No such file or directory
Usage: ssh [options] host [command]
Options:
-l user Log in using this user name.
-n Redirect input from /dev/null.
-F config Config file (default: ~/.ssh/config).
-A Enable authentication agent forwarding.
-a Disable authentication agent forwarding (default).
-X Enable X11 connection forwarding.
-x Disable X11 connection forwarding (default).
-i file Identity for public key authentication (default: ~/.ssh/identity)
-t Tty; allocate a tty even if command is given.
-T Do not allocate a tty.
-v Verbose; display verbose debugging messages.
Multiple -v increases verbosity.
-V Display version number only.
-P Don't allocate a privileged port.
-q Quiet; don't display any warning messages.
-f Fork into background after authentication.
-e char Set escape character; ``none'' = disable (default: ~).
-c cipher Select encryption algorithm
-m macs Specify MAC algorithms for protocol version 2.
-p port Connect to this port. Server must be on the same port.
-L listen-port:host:port Forward local port to remote address
-R listen-port:host:port Forward remote port to local address
These cause ssh to listen for connections on a port, and
forward them to the other side by connecting to host:port.
-D port Enable dynamic application-level port forwarding.
-C Enable compression.
-N Do not execute a shell or command.
-g Allow remote hosts to connect to forwarded ports.
-1 Force protocol version 1.
-2 Force protocol version 2.
-4 Use IPv4 only.
-6 Use IPv6 only.
-o 'option' Process the option as if it was read from a configuration file.
-s Invoke command (mandatory) as SSH2 subsystem.
-b addr Local IP address.
Usage: ssh [options] host [command]
Options:
-l user Log in using this user name.
-n Redirect input from /dev/null.
-F config Config file (default: ~/.ssh/config).
-A Enable authentication agent forwarding.
-a Disable authentication agent forwarding (default).
-X Enable X11 connection forwarding.
-x Disable X11 connection forwarding (default).
-i file Identity for public key authentication (default: ~/.ssh/identity)
-t Tty; allocate a tty even if command is given.
-T Do not allocate a tty.
-v Verbose; display verbose debugging messages.
Multiple -v increases verbosity.
-V Display version number only.
-P Don't allocate a privileged port.
-q Quiet; don't display any warning messages.
-f Fork into background after authentication.
-e char Set escape character; ``none'' = disable (default: ~).
-c cipher Select encryption algorithm
-m macs Specify MAC algorithms for protocol version 2.
-p port Connect to this port. Server must be on the same port.
-L listen-port:host:port Forward local port to remote address
-R listen-port:host:port Forward remote port to local address
These cause ssh to listen for connections on a port, and
forward them to the other side by connecting to host:port.
-D port Enable dynamic application-level port forwarding.
-C Enable compression.
-N Do not execute a shell or command.
-g Allow remote hosts to connect to forwarded ports.
-1 Force protocol version 1.
-2 Force protocol version 2.
-4 Use IPv4 only.
-6 Use IPv6 only.
-o 'option' Process the option as if it was read from a configuration file.
-s Invoke command (mandatory) as SSH2 subsystem.
-b addr Local IP address.
|