LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-06-28 06:45:35


On Jun 28, 2005, at 7:35 AM, jerome lefevre wrote:

> You will find my configure.log in this post.

Thanks.

> Some precision, first with Scheduler :
>
> With Maui running, PBS_server complains always with this message :
> 06/27/2005 10:14:06;0001;PBS_Server;Svr;PBS_Server;Connection refused
> (111)
> in contact_sched, Could not contact Scheduler

You'll need to ask the OSCAR list about this error -- I'm not a PBS
expert.

> If I stop MAUI and start PBS_SCHED, PBS_SERVER tell me :
> 06/27/2005 16:04:37;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
> sent
> command new
> 06/27/2005 16:04:38;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
> sent
> command recyc
> 06/27/2005 16:04:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
> sent
> command term
> 06/27/2005 16:05:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
> sent
> command time
>
> So, in the next test, i stop the scheduler MAUI and keep PBS_SCHED
> running.

LAM will not care whether you are running the PBS scheduler or the Maui
scheduler; it does not interact with the scheduler. It only interacts
with the PBS MOM's themselves (i.e., *after* all scheduling decisions
have been made). So whichever scheduler you get running (from LAM's
point of view) is fine.

> Now, if you look LAMINFO, we see : SSI boot: tm (API v1.1, Module
> v1.1)
> We can presume PBS and LAM 7.1 interface is correct ?

Probably, especially since you showed that it worked in your first post
(i.e., it did a lamboot successfully using TM). However, I would have
liked to see the full output from laminfo to see the other modules and
other configuration information.

> But, if i ask a PBS job in interactive mode, like this :
>
> [umr65_at_editr SCRATCH]$ qsub -lnodes=2 -I
> qsub: waiting for job 43.editr.cluster.ird.nc to start
>
> After a long time, PBS still waiting ... If i check with "qstat -f",
> "Resource_List.ncpus "is always equal to 1. However i asked 2 nodes !
> What
> is wrong ? Do you want other log, output ?

This also seems to be a PBS problem, not a LAM problem. LAM will *use*
PBS, but if PBS is configured incorrectly (e.g., PBS is only assigning
you 1 node when you asked for 2), LAM cannot fix this -- it can only
use the outputs from PBS. More specifically, PBS does all the
scheduling and assignments -- once the job starts and all decisions
have been made, LAM simply uses the results of those decisions to run
your parallel application.

The OSCAR list is probably your best bet for answers to these questions
(because OSCAR will have configured and setup Torque on your cluster).
They'll probably ask for more configuration information about your
Torque setup to see if something went wrong during the OSCAR
initialization.

Two more items:

- You did not answer my questions about your LAM-specific problems
(statements about lamds being left around, etc.). Are you having
problems like this, or are they all a symptom of PBS/Torque not running
correctly?

- Is there any reason you compiled/installed LAM manually rather than
use the LAM that comes with OSCAR? Did you do it simply to upgrade?
(OSCAR still ships with v7.0.6, mainly because I have not had the
cycles to upgrade the LAM package in OSCAR. I'll probably wait until
after LAM/MPI v7.1.2 ships)

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/