LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: jerome lefevre (jlefevre_at_[hidden])
Date: 2005-06-27 06:48:44


Hardware : Dual-Opteron 246 Motherboard Tyan S2885 + gigabit
OS : Fedora Core 2 i386 + OSCAR 4.0
Cluster : 4 Nodes + 1 Front-end
www.ird.nc/UR65/ROMS

This Post is sent to OSCAR forum too.

Hi LAM Communauty,

I have some trouble with PBS, MAUI and LAM. I would like to manage job with PBS, but i have misery with TM boot.
Note, if i ran job with traditional sequence "lamboot", "mpirun", "lamhalt", i have success and all nodes compute.
However, sometimes lamds still remaining on my nodes (i suppose). i.e, when i restart a job with the same exe but with newer array, job failed, just like older array seems to be not properly cleared from memory.

So, with PBS i hope to manage cleanly my job. But, i ve no luck with LAM and PBS :)...

Here is an abstract about my test with some log output :

1 - With Test_Cluster (OSCAR script), output is always :
------------------------------------------------------------------------
>./test_cluster
Performing root tests...
Starting MAUI Scheduler: [ OK ]
Could not start maui, please check configuration and rerun tests
Maui service check:maui [FAILED]
PBS node check [PASSED]
Starting PBS Server: pbs_server: another server running
                                                           [ÉCHOUÉ]
Could not start pbs_server, please check configuration and rerun tests
PBS service check:pbs_server [FAILED]
There were issues running some root test scripts. Please check your logs
/home mounts [PASSED]
 Preparing user tests...
Performing user tests...
SSH ping test [PASSED]
SSH server->node [PASSED]
SSH node->server [PASSED]
LAM/MPI (via PBS) [FAILED]
Ganglia test [FAILED]
MPICH (via PBS) [FAILED]
PVM (via PBS) [FAILED]
PBS default queue definition [PASSED]
PBS Shell Test [FAILED]
There were issues running some user test scripts. Please check your logs
------------------------------------------------------------------------

If i submit job with PBS, Log from pbs_server is :
------------------------------------------------------------------------
06/27/2005 10:05:06;0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, Could not contact Scheduler
06/27/2005 10:06:06;0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, Could not contact Scheduler
06/27/2005 10:07:06;0001;PBS_Server;Svr;PBS_Server;Connection refused (111) in contact_sched, Could not contact Scheduler
------------------------------------------------------------------------

and qstat see my job permanently in queue.

2 - So, I try this : Stop service MAUI and restart PBS_SCHED. Now I can submit PBS job and pbs_server issue is :
------------------------------------------------------------------------
06/27/2005 16:04:11;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command time
06/27/2005 16:04:37;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command new
06/27/2005 16:04:38;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command recyc
06/27/2005 16:04:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command term
06/27/2005 16:05:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command time
06/27/2005 16:06:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler sent command time
------------------------------------------------------------------------

BUT, boot sequence is bad. LAM see only one node, like this issue (lamboot -d in my PBS script):
------------------------------------------------------------------------
n-1<9598> ssi:boot:open: opening
n-1<9598> ssi:boot:open: opening boot module globus
n-1<9598> ssi:boot:open: opened boot module globus
n-1<9598> ssi:boot:open: opening boot module rsh
n-1<9598> ssi:boot:open: opened boot module rsh
n-1<9598> ssi:boot:open: opening boot module slurm
n-1<9598> ssi:boot:open: opened boot module slurm
n-1<9598> ssi:boot:open: opening boot module tm
n-1<9598> ssi:boot:open: opened boot module tm
n-1<9598> ssi:boot:select: initializing boot module tm
n-1<9598> ssi:boot:tm: module initializing
n-1<9598> ssi:boot:tm:verbose: 1000
n-1<9598> ssi:boot:tm:priority: 75
n-1<9598> ssi:boot:select: boot module available: tm, priority: 75
n-1<9598> ssi:boot:select: initializing boot module globus
n-1<9598> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<9598> ssi:boot:select: boot module not available: globus
n-1<9598> ssi:boot:select: initializing boot module rsh
n-1<9598> ssi:boot:rsh: module initializing
n-1<9598> ssi:boot:rsh:agent: ssh -x
n-1<9598> ssi:boot:rsh:username: <same>
n-1<9598> ssi:boot:rsh:verbose: 1000
n-1<9598> ssi:boot:rsh:algorithm: linear
n-1<9598> ssi:boot:rsh:no_n: 0
n-1<9598> ssi:boot:rsh:no_profile: 0
n-1<9598> ssi:boot:rsh:fast: 0
n-1<9598> ssi:boot:rsh:ignore_stderr: 0
n-1<9598> ssi:boot:rsh:priority: 10
n-1<9598> ssi:boot:select: boot module available: rsh, priority: 10
n-1<9598> ssi:boot:select: initializing boot module slurm
n-1<9598> ssi:boot:slurm: not running under SLURM
n-1<9598> ssi:boot:select: boot module not available: slurm
n-1<9598> ssi:boot:select: finalizing boot module globus
n-1<9598> ssi:boot:globus: finalizing
n-1<9598> ssi:boot:select: closing boot module globus
n-1<9598> ssi:boot:select: finalizing boot module rsh
n-1<9598> ssi:boot:rsh: finalizing
n-1<9598> ssi:boot:select: closing boot module rsh
n-1<9598> ssi:boot:select: finalizing boot module slurm
n-1<9598> ssi:boot:slurm: finalizing
n-1<9598> ssi:boot:select: closing boot module slurm
n-1<9598> ssi:boot:select: selected boot module tm
n-1<9598> ssi:boot:tm: found the following 1 hosts:
n-1<9598> ssi:boot:tm: n0 editr.cluster.ird.nc (cpu=1)
n-1<9598> ssi:boot:tm: starting RTE procs
n-1<9598> ssi:boot:base:linear_windowed: starting
n-1<9598> ssi:boot:base:linear_windowed: window size: 5
n-1<9598> ssi:boot:base:server: opening server TCP socket
n-1<9598> ssi:boot:base:server: opened port 34939
n-1<9598> ssi:boot:base:linear_windowed: booting n0 (editr.cluster.ird.nc)
n-1<9598> ssi:boot:tm: starting wipe on (editr.cluster.ird.nc)
n-1<9598> ssi:boot:tm: starting on n0 (editr.cluster.ird.nc): /opt/lam-7.1_pgi/bin/tkill -setsid -d
n-1<9598> ssi:boot:tm: successfully launched on n0 (editr.cluster.ird.nc)
n-1<9598> ssi:boot:tm: waiting for completion on n0 (editr.cluster.ird.nc)
n-1<9598> ssi:boot:tm: finished on n0 (editr.cluster.ird.nc)
n-1<9598> ssi:boot:tm: starting lamd on (editr.cluster.ird.nc)
n-1<9598> ssi:boot:tm: starting on n0 (editr.cluster.ird.nc): /opt/lam-7.1_pgi/bin/lamd -H 192.168.150.50 -P 34939 -n 0 -o 0 -d
n-1<9598> ssi:boot:tm: successfully launched on n0 (editr.cluster.ird.nc)
n-1<9598> ssi:boot:base:linear_windowed: finished launching
n-1<9598> ssi:boot:base:server: expecting connection from finite list
n-1<9600> ssi:boot:open: opening
n-1<9600> ssi:boot:open: opening boot module globus
n-1<9600> ssi:boot:open: opened boot module globus
n-1<9600> ssi:boot:open: opening boot module rsh
n-1<9600> ssi:boot:open: opened boot module rsh
n-1<9600> ssi:boot:open: opening boot module slurm
n-1<9600> ssi:boot:open: opened boot module slurm
n-1<9600> ssi:boot:open: opening boot module tm
n-1<9600> ssi:boot:open: opened boot module tm
n-1<9600> ssi:boot:select: initializing boot module tm
n-1<9600> ssi:boot:tm: module initializing
n-1<9600> ssi:boot:tm:verbose: 1000
n-1<9600> ssi:boot:tm:priority: 75
n-1<9600> ssi:boot:select: boot module available: tm, priority: 75
n-1<9600> ssi:boot:select: initializing boot module globus
n-1<9600> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<9600> ssi:boot:select: boot module not available: globus
n-1<9600> ssi:boot:select: initializing boot module rsh
n-1<9600> ssi:boot:rsh: module initializing
n-1<9600> ssi:boot:rsh:agent: ssh -x
n-1<9600> ssi:boot:rsh:username: <same>
n-1<9600> ssi:boot:rsh:verbose: 1000
n-1<9600> ssi:boot:rsh:algorithm: linear
n-1<9600> ssi:boot:rsh:no_n: 0
n-1<9600> ssi:boot:rsh:no_profile: 0
n-1<9600> ssi:boot:rsh:fast: 0
n-1<9600> ssi:boot:rsh:ignore_stderr: 0
n-1<9600> ssi:boot:rsh:priority: 10
n-1<9600> ssi:boot:select: boot module available: rsh, priority: 10
n-1<9600> ssi:boot:select: initializing boot module slurm
n-1<9600> ssi:boot:slurm: not running under SLURM
n-1<9600> ssi:boot:select: boot module not available: slurm
n-1<9600> ssi:boot:select: finalizing boot module globus
n-1<9600> ssi:boot:globus: finalizing
n-1<9600> ssi:boot:select: closing boot module globus
n-1<9600> ssi:boot:select: finalizing boot module rsh
n-1<9600> ssi:boot:rsh: finalizing
n-1<9600> ssi:boot:select: closing boot module rsh
n-1<9600> ssi:boot:select: finalizing boot module slurm
n-1<9600> ssi:boot:slurm: finalizing
n-1<9600> ssi:boot:select: closing boot module slurm
n-1<9600> ssi:boot:select: selected boot module tm
n-1<9600> ssi:boot:send_lamd: getting node ID from command line
n-1<9600> ssi:boot:send_lamd: getting agent haddr from command line
n-1<9600> ssi:boot:send_lamd: getting agent port from command line
n-1<9600> ssi:boot:send_lamd: getting node ID from command line
n-1<9600> ssi:boot:send_lamd: connecting to 192.168.150.50:34939, node id 0
n-1<9600> ssi:boot:send_lamd: sending dli_port 32797
n-1<9598> ssi:boot:base:server: got connection from 192.168.150.50
n-1<9598> ssi:boot:base:server: this connection is expected (n0)
n-1<9598> ssi:boot:base:server: remote lamd is at 192.168.150.50:32797
n-1<9598> ssi:boot:base:server: closing server socket
n-1<9598> ssi:boot:base:server: connecting to lamd at 192.168.150.50:34943
n-1<9598> ssi:boot:base:server: connected
n-1<9598> ssi:boot:base:server: sending number of links (1)
n-1<9598> ssi:boot:base:server: sending info: n0 (editr.cluster.ird.nc)
n-1<9598> ssi:boot:base:server: finished sending
n-1<9600> ssi:boot:tm: finalizing
n-1<9598> ssi:boot:base:server: disconnected from 192.168.150.50:34943
n-1<9600> ssi:boot: Closing
n-1<9598> ssi:boot:base:linear_windowed: finished
n-1<9598> ssi:boot:tm: all RTE procs started
n-1<9598> ssi:boot:tm: finalizing
n-1<9598> ssi:boot: Closing
------------------------------------------------------------------------

I don't know what is wrong. I configure LAM 7.1 with this sequence :
./configure --prefix=/opt/lam-7.1_pgi --with-boot=tm --with-tm=/opt/pbs \
--with-pic --with-rsh="ssh -x" --with-prefix=memcopy --enable-shared --disable-static --enable-tv-queue

See my configure.log

What to check ?
Many thanks, Best regard

Jérôme