LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: etienne gondet (etienne.gondet_at_[hidden])
Date: 2003-08-06 08:50:34


    Dear Jeff,

    Thanks for all that clues.

The problems should be linked to lamboot unless it looks working it is
not . the deamon is not started on
 the remote node.

22:> lamboot -v boot_schema

LAM 7.0/MPI 2 C++/ROMIO - Indiana University

n0<247384> ssi:boot:base:linear: booting n0 (hpca2501)
n0<247384> ssi:boot:base:linear: booting n1 (hpca2301)
echo n0<247384> ssi:boot:base:linear: finished
23:> echo $?
0
24:> rsh hpca2301
*******************************************************************************
*
*
* Welcome to ECMWF HPC Phase 1A
Machine *
*
*
*******************************************************************************
*
*
* Please report any problems to the calldesk x2303 (++ 44 118
9499303) *
*
*
* Users please note there are new Loadleveler classes: ns
np *
*
*
* From Feb 12th you will no longer be able to run
jobs *
* in the original classes parallel, any, epilog,
bigmem. *
*
*
* please use class ns (for single process jobs)
or *
*
*
* class np (for multi-process
jobs) *
*
*
*******************************************************************************
Last login: Wed Aug 6 13:35:42 GMT 2003 on /dev/pts/0 from hpca2501

2:> ps -edf | grep -i lmz
     lmz 32558 68206 2 13:45:40 pts/0 0:00 ps -edf
     lmz 50516 68206 0 13:45:40 pts/0 0:00 grep -i lmz
     lmz 68206 70612 0 13:44:41 pts/0 0:00 -ksh

With 6.5.X series I was used to see the daemon with ps and this output
at rsh time was nott a problem.

Jeff Squyres a écrit:

>On Wed, 6 Aug 2003, etienne gondet wrote:
>
>
>
>>>- in the case where you do not use an app schema, you have the maxprocs
>>>argument of MPI_Comm_spawn set to 1, meaning that only one "block"
>>>executable is launched.
>>>
>>>
>> I don't understand what twofolds means but on that IBM SP4 the block
>>is never started on the other node. I go with rsh on that other nodes
>>and I never see a process called block with a ps -edf and the driver
>>processus is in a deadlock. The problem is before the pingpong and
>>
>>
>
>Are you sure that block is not spawned on the current node? Without an
>app schema, it should be launched on n0, not n1. There could be a
>buffering issue such that you simply do not see all the output before the
>deadlock.
>
>
    Yes It is not
    With no appschema the block is started on the same node and gives
back some output on stdout
    Usually with 6.5.9 on SP4 I add no buffering troubles such as on
VPP or NEC.

>
>
>>relative process management not to communication protocol and buffers. I
>>understand because I spawn a monoprocessus block that it should deadlock
>>later in the pingpong after 64k. But I just reduced the number of
>>processus in case of.
>>
>>
>
>I would suggest a few things:
>
>- lower your message size to less than 64k (say, 20 bytes).
>- make the argument to MPI_Comm_spawn (in the cases without app schemas)
> be 2, not 1.
>
>
    I tried with 4.

>Unless something else is going wrong, either or both of these should allow
>your program to complete.
>
>
>
>>>lamhalt will hang for a while if any of the LAM daemons have already died.
>>>
>>>
>> A very long while.
>>
>>
>
>It should only wait for about 15 seconds before giving up. Can you attach
>a debugger and see where exactly lamhalt is stuck?
>
>(you may need to recompile LAM with -g to get useful information here)
>
>
    I I will recompile lam with -g and try to attach a debugger in the
followng days.

>
>