Hi,
Ok, i send you a snapshoot from my shell, but i confirm, there is no
pbs.conf.
I will read more info about PBS and perhaps reinstall PBS.
Good day
Jérôme
*************************************************************
[root_at_editr root]# updatedb
[root_at_editr root]# locate pbs.conf
[root_at_editr root]#
[root_at_editr root]# cexec 'updatedb'
************************* oscar_cluster *************************
--------- node1---------
Warning: No xauth data; using fake authentication data for X11 forwarding.
--------- node2---------
--------- node3---------
Warning: No xauth data; using fake authentication data for X11 forwarding.
--------- node4---------
Warning: No xauth data; using fake authentication data for X11 forwarding.
[root_at_editr root]# cexec 'locate pbs.conf'
************************* oscar_cluster *************************
--------- node1---------
Warning: No xauth data; using fake authentication data for X11 forwarding.
--------- node2---------
--------- node3---------
Warning: No xauth data; using fake authentication data for X11 forwarding.
--------- node4---------
Warning: No xauth data; using fake authentication data for X11 forwarding.
[root_at_editr root]#
[umr65_at_editr umr65]$ lamboot -v
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
n-1<29068> ssi:boot:base:linear: booting n0 (node1.cluster.ird.nc)
n-1<29068> ssi:boot:base:linear: booting n1 (node2.cluster.ird.nc)
n-1<29068> ssi:boot:base:linear: booting n2 (node3.cluster.ird.nc)
n-1<29068> ssi:boot:base:linear: booting n3 (node4.cluster.ird.nc)
n-1<29068> ssi:boot:base:linear: booting n4 (editr.cluster.ird.nc)
n-1<29068> ssi:boot:base:linear: finished
[umr65_at_editr umr65]$ lamhalt
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
[umr65_at_editr umr65]$ qsub -lnodes=2 -I
qsub: waiting for job 47.editr.cluster.ird.nc to start
Do you wish to terminate the job and exit (y|[n])? y
Job 47.editr.cluster.ird.nc is being deleted
[umr65_at_editr umr65]$ qstat -f
Job Id: 47.editr.cluster.ird.nc
Job_Name = STDIN
Job_Owner = umr65_at_[hidden]
job_state = Q
queue = workq
server = editr.cluster.ird.nc
Checkpoint = u
ctime = Thu Jun 30 17:04:02 2005
Error_Path = editr.cluster.ird.nc:/home/umr65/STDIN.e47
exec_host = node1.cluster.ird.nc/0+editr.cluster.ird.nc/0
Hold_Types = n
interactive = True
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Thu Jun 30 17:04:15 2005
Output_Path = editr.cluster.ird.nc:/home/umr65/STDIN.o47
Priority = 0
qtime = Thu Jun 30 17:04:02 2005
Rerunable = True
Resource_List.cput = 10000:00:00
Resource_List.ncpus = 1
Resource_List.nodect = 2
Resource_List.nodes = 2
Resource_List.walltime = 10000:00:00
Variable_List = PBS_O_HOME=/home/umr65,PBS_O_LANG=fr_FR.UTF-8,
PBS_O_LOGNAME=umr65,
PBS_O_PATH=/opt/intel_fc_81/bin:/usr/pgi/linux86/5.2/bin:/opt/intel_fc
_81/bin:/usr/pgi/linux86/5.2/bin:/usr/kerberos/bin:/opt/lam-7.1_pgi/bin
:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/opt/env-switcher/bin:/opt
/kernel_picker/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX:/opt/pvm3/bin/LINU
X:/opt/c3-4/:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/ferret_V58/bin:/op
t/netcdf-3.6_pgi/bin:/opt/NCO_300/bin:/home/umr65/bin:./:/opt/ferret_V5
8/bin:/opt/netcdf-3.6_pgi/bin:/opt/NCO_300/bin,
PBS_O_MAIL=/var/spool/mail/umr65,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=editr.cluster.ird.nc,PBS_O_WORKDIR=/home/umr65,
PBS_O_QUEUE=workq
comment = Job started on Thu Jun 30 at 17:04
etime = Thu Jun 30 17:04:02 2005
[umr65_at_editr umr65]$
********************************************************************
> The PBS config files only appear on one machine (the head node?); make
> sure to check for them on the relevant node. However, the output from
> qmgr and qstat means that PBS is configured somehow -- I'm guessing
> that you're looking for the config files in the wrong place.
>
> The big question is this:
>
> > > But, if i ask a PBS job in interactive mode, like this :
> > >
> > > [umr65_at_editr SCRATCH]$ qsub -lnodes=2 -I
> > > qsub: waiting for job 43.editr.cluster.ird.nc to start
> > >
> > > After a long time, PBS still waiting ... If i check with "qstat -f",
> > > "Resource_List.ncpus "is always equal to 1. However i asked 2 nodes !
> > > What
> > > is wrong ? Do you want other log, output ?
>
> Do you ever get a job shell? From this text, it's not clear if your
> prior tests were with this same qsub line or a different command line
> (because you said that your prior tests *ran*, but unexpectedly only
> with one node).
>
> You never answered my questions about what you meant with your problems
> with lamd's possibly remaining on a node after the job completed. So
> I'm assuming that I either misunderstood the question, or it's somehow
> no longer a problem.
>
> So I hand this thread off to the OSCAR list and someone who can answer
> PBS questions...
>
> Once you can reliably get a PBS job with the right number of nodes, try
> lamboot again; I'm guessing that LAM will do the Right Thing (i.e.,
> running "lamnodes" after lamboot will show all the nodes in your job).
> If it doesn't, post back to the LAM list, but please include a direct
> cut-n-paste from your shell output showing all the steps you took that
> results in lamboot not using all the nodes in your job. Thanks!
>
> --
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|