LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2008-09-02 10:56:21


Hi -

Unfortunately, the only advice I can give is similar to that I gave last
time. Something is causing LAM to be unable to allocate enough System V
shared memory. The fact that nothing shows up in in icps is a little
unusual, but I can't offer any solutions to the problem.

LAM/MPI needs 4-8MB of System V shared memory and a couple of System V
semaphores per process to use the SYSV or USYSV rpis. If this is not
available, then there's not much we can do.

If rebooting helped last time, I'd try it again. But I'd also find a
Linux sysadmin with more expertice than I have to figure out what's going
on.

Good luck,

Brian

On Mon, 1 Sep 2008, Naveed Durrani wrote:

> I am again facing the problem for which I requested help previsously. My
> last post was
> "Dear respected members,
> Hi,
> I am newb to LAM and mpi and am facing following problem. I request you
> to guide me.
> I have been using LAM without any trouble for my mpi code until recently.
> My script uses PBS and is as follows:
> #!/bin/sh
> #PBS -l nodes=2:ppn=2
> #PBS -l walltime=100:00:00
> lamboot -v
> cd /data/rundir 
> mpirun -ssi rpi sysv -np 4 program_exe <input > output
>  
> Without actually changing anything, now I am getting a message with this
> script :
>  
> n0<17669> ssi:boot:base:linear_windowed: booting n0 (node18)
> n0<17669> ssi:boot:base:linear_windowed: booting n1 (node17)
> -----------------------------------------------------------------------------
> The selected RPI failed to initialize during MPI_INIT.  This is a
> fatal error; I must abort.
> This occurred on host node18 (n0).
> The PID of failed process was 17675 (MPI_COMM_WORLD rank: 0)
> -----------------------------------------------------------------------------
> -----------------------------------------------------------------------------
> One of the processes started by mpirun has exited with a nonzero exit
> code.  This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
> PID 17676 failed on node n0 (192.168.101.18) with exit status 1.
> -----------------------------------------------------------------------------
>  
> If I use exclude the '-ssi rpi sysv ' from my script, my program runs
> fine. I have gone through the FAQ's and to my best have tried remove
> ipcs, lamclean etc but of no avail.  Do I need to restart my cluster (it
> may be difficult as other people using it for their jobs with different
> nature). Please guide me about this... Any suggestions will be welcome.
> "
> I am running all my simulations now without using the '-ssi rpi sysv'
> option. Although at the moment the simulation are running okay but I am
> still not sure if leaving this option is okay to run LAM MPI based
> program.
>  
>  
> On 7/7/08, Brian W. Barrett <brbarret_at_[hidden]> wrote:
> It is likely ok now.  I'd guess that there was something a
> little off in the configuration of the node that was causing
> the failures.  System V shared memory can be really touchy
> and really hard to diagnose problems. If you're able to start
> running, there shouldn't be any danger of problems further
> down the line.
>
> Brian
>
>
> On Mon, 7 Jul 2008, Endee wrote:
>
> Note: My code is running well again with out any
> change done by me with
> mpirun -ssi rpi sysv -np 4 application <in >out .
> However I shall go through
> the sequence of what happened previously. I am
> still not sure what went
> wrong and is okay now?
> 1. $ laminfo
>          LAM/MPI: 7.0.4
>           Prefix: /home/mep02hx/applic/lam-7.0
>     Architecture: i686-pc-linux-gnu
>    Configured by: mep02hx
>    Configured on: Fri Apr 16 14:31:37 BST 2004
>   Configure host: bluegrid
>       C bindings: yes
>     C++ bindings: yes
>  Fortran bindings: yes
>      C profiling: yes
>    C++ profiling: yes
> Fortran profiling: yes
>    ROMIO support: yes
>     IMPI support: no
>    Debug support: no
>     Purify clean: no
>         SSI boot: globus (Module v0.5)
>         SSI boot: rsh (Module v1.0)
>         SSI boot: tm (Module v1.0)
>         SSI coll: lam_basic (Module v7.0)
>         SSI coll: smp (Module v1.0)
>          SSI rpi: crtcp (Module v1.0.1)
>          SSI rpi: lamd (Module v7.0)
>          SSI rpi: sysv (Module v7.0)
>          SSI rpi: tcp (Module v7.0)
>          SSI rpi: usysv (Module v7.0)
> 2. I submit my batch script batch.sh
> #!/bin/sh
> #PBS -l nodes=4:ppn=2
> #PBS -l walltime=100:00:00
> lamboot -v
> cd /rundir
> mpirun -ssi rpi sysv -np4 exefile < in > out
>
> # --above script ends---
> This scripts generates batch.sh.exxx and
> batch.sh.oxxx files apart from
> output out. batch.sh.oxxx gives "LAM 7.0.4/MPI 2
> C++/ROMIO - Indiana
> University" as before (which is due to 'lamboot
> -v' showing that LAM has
> started. However, unlike previous batch.sh.exxx
> which was
> "n0<9246> ssi:boot:base:linear_windowed: booting
> n0 (node10)
> n0<9246> ssi:boot:base:linear_windowed: booting
> n1 (node9)"
> showing successfull booting on nodes, now I get
> the message (output is in
> detail by using '-ssi rpi_verbose 1' option in
> the same script)
>
> n0<17991> ssi:boot:base:linear_windowed: booting
> n0 (node18)
> n0<17991> ssi:boot:base:linear_windowed: booting
> n1 (node17)
> n0<17997> ssi:rpi:sysv: module initializing
> n0<17997> ssi:rpi:sysv:pollyield: 1
> n0<17998> ssi:rpi:sysv: module initializing
> n0<17998> ssi:rpi:sysv:pollyield: 1
> n0<17998> ssi:rpi:sysv:short: 8192 bytes
> n0<17997> ssi:rpi:sysv:short: 8192 bytes
> n0<17998> ssi:rpi:sysv:shmpoolsize: 16777216
> bytes
> n0<17997> ssi:rpi:sysv:shmpoolsize: 16777216
> bytes
> n0<17998> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n0<17997> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n0<17998> ssi:rpi:tcp:short: 65536 bytes
> n0<17997> ssi:rpi:tcp:short: 65536 bytes
> n1<14838> ssi:rpi:sysv: module initializing
> n1<14838> ssi:rpi:sysv:pollyield: 1
> n1<14838> ssi:rpi:sysv:short: 8192 bytes
> -----------------------------------------------------------------------------
> The selected RPI failed to initialize during
> MPI_INIT.  This is a
> fatal error; I must abort.
>
> This occurred on host node18 (n0).
>
> The PID of failed process was 17997
> (MPI_COMM_WORLD rank: 0)
> -----------------------------------------------------------------------------
> n1<14838> ssi:rpi:sysv:shmpoolsize: 16777216
> bytes
> n1<14838> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n1<14838> ssi:rpi:tcp:short: 65536 bytes
> n1<14837> ssi:rpi:sysv: module initializing
> n1<14837> ssi:rpi:sysv:pollyield: 1
> n1<14837> ssi:rpi:sysv:short: 8192 bytes
> n1<14837> ssi:rpi:sysv:shmpoolsize: 16777216
> bytes
> n1<14837> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n1<14837> ssi:rpi:tcp:short: 65536 bytes
>
> -----------------------------------------------------------------------------
> One of the processes started by mpirun has exited
> with a nonzero exit
> code.  This typically indicates that the process
> finished in error.
> If your process did not finish in error, be sure
> to include a "return
> 0" or "exit(0)" in your C code before exiting the
> application.
>
> PID 17998 failed on node n0 (192.168.101.18) with
> exit status 1.
>
>
>
> 3. I removed the  '-ssi rpi sysv' from the script
> above in 2, and submit the
> command. The program is running without any
> problem for last 12 hours!!!.
> Now I just rechecked the original script of 2
> (above) and it is working
> with  '-ssi rpi sysv' . I dont know why my
> program without  '-ssi rpi sysv'
> is still running with correct output or why it
> was failing with it.
>
> 4. I have studied the OpenMP but am in the end of
> my PhD. Will implement
> that afterwards as am pressed for time and will
> prefer getting results with
> my present code.
>
> Thanks again for being so kind to guide me.
> Best regards,
> ND
>
>
>
> On 07/07/2008, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>
> You seem to be having multiple
> different errors:
>
> 1. a command line problem where
> mpirun thinks that "sysv" is the
> executable
> to run
> 2. a problem compiling the sysv RPI
> 3. a problem getting the sysv RPI to
> initialize (which I'm not sure how
> you
> got to this point, given #1 and #2)
>
> I think that these errors have gotten
> mixed up and muddled in the thread so
> far.  Can you send all the
> information listed here:
>
>  
> http://www.lam-mpi.org/using/support/
>
> Specifically, can you show exactly
> all the steps you are following (to
> include all commands) for each error?
>
> Thanks!
>
>
>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
>
>
> _______________________________________________
> This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
>
>
>
> --
>  Brian Barrett
>  LAM/MPI Developer
>  Make today a LAM/MPI day!
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>
>
> --
> Naveed Iqbal Durrani
>

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!