LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2008-07-07 16:55:43


It is likely ok now. I'd guess that there was something a little off in
the configuration of the node that was causing the failures. System V
shared memory can be really touchy and really hard to diagnose problems.
If you're able to start running, there shouldn't be any danger of problems
further down the line.

Brian

On Mon, 7 Jul 2008, Endee wrote:

> Note: My code is running well again with out any change done by me with
> mpirun -ssi rpi sysv -np 4 application <in >out . However I shall go through
> the sequence of what happened previously. I am still not sure what went
> wrong and is okay now?
> 1. $ laminfo
> LAM/MPI: 7.0.4
> Prefix: /home/mep02hx/applic/lam-7.0
> Architecture: i686-pc-linux-gnu
> Configured by: mep02hx
> Configured on: Fri Apr 16 14:31:37 BST 2004
> Configure host: bluegrid
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI boot: globus (Module v0.5)
> SSI boot: rsh (Module v1.0)
> SSI boot: tm (Module v1.0)
> SSI coll: lam_basic (Module v7.0)
> SSI coll: smp (Module v1.0)
> SSI rpi: crtcp (Module v1.0.1)
> SSI rpi: lamd (Module v7.0)
> SSI rpi: sysv (Module v7.0)
> SSI rpi: tcp (Module v7.0)
> SSI rpi: usysv (Module v7.0)
> 2. I submit my batch script batch.sh
> #!/bin/sh
> #PBS -l nodes=4:ppn=2
> #PBS -l walltime=100:00:00
> lamboot -v
> cd /rundir
> mpirun -ssi rpi sysv -np4 exefile < in > out
>
> # --above script ends---
> This scripts generates batch.sh.exxx and batch.sh.oxxx files apart from
> output out. batch.sh.oxxx gives "LAM 7.0.4/MPI 2 C++/ROMIO - Indiana
> University" as before (which is due to 'lamboot -v' showing that LAM has
> started. However, unlike previous batch.sh.exxx which was
> "n0<9246> ssi:boot:base:linear_windowed: booting n0 (node10)
> n0<9246> ssi:boot:base:linear_windowed: booting n1 (node9)"
> showing successfull booting on nodes, now I get the message (output is in
> detail by using '-ssi rpi_verbose 1' option in the same script)
>
> n0<17991> ssi:boot:base:linear_windowed: booting n0 (node18)
> n0<17991> ssi:boot:base:linear_windowed: booting n1 (node17)
> n0<17997> ssi:rpi:sysv: module initializing
> n0<17997> ssi:rpi:sysv:pollyield: 1
> n0<17998> ssi:rpi:sysv: module initializing
> n0<17998> ssi:rpi:sysv:pollyield: 1
> n0<17998> ssi:rpi:sysv:short: 8192 bytes
> n0<17997> ssi:rpi:sysv:short: 8192 bytes
> n0<17998> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
> n0<17997> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
> n0<17998> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n0<17997> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n0<17998> ssi:rpi:tcp:short: 65536 bytes
> n0<17997> ssi:rpi:tcp:short: 65536 bytes
> n1<14838> ssi:rpi:sysv: module initializing
> n1<14838> ssi:rpi:sysv:pollyield: 1
> n1<14838> ssi:rpi:sysv:short: 8192 bytes
> -----------------------------------------------------------------------------
> The selected RPI failed to initialize during MPI_INIT. This is a
> fatal error; I must abort.
>
> This occurred on host node18 (n0).
>
> The PID of failed process was 17997 (MPI_COMM_WORLD rank: 0)
> -----------------------------------------------------------------------------
> n1<14838> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
> n1<14838> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n1<14838> ssi:rpi:tcp:short: 65536 bytes
> n1<14837> ssi:rpi:sysv: module initializing
> n1<14837> ssi:rpi:sysv:pollyield: 1
> n1<14837> ssi:rpi:sysv:short: 8192 bytes
> n1<14837> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
> n1<14837> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
> n1<14837> ssi:rpi:tcp:short: 65536 bytes
>
> -----------------------------------------------------------------------------
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 17998 failed on node n0 (192.168.101.18) with exit status 1.
>
>
>
> 3. I removed the '-ssi rpi sysv' from the script above in 2, and submit the
> command. The program is running without any problem for last 12 hours!!!.
> Now I just rechecked the original script of 2 (above) and it is working
> with '-ssi rpi sysv' . I dont know why my program without '-ssi rpi sysv'
> is still running with correct output or why it was failing with it.
>
> 4. I have studied the OpenMP but am in the end of my PhD. Will implement
> that afterwards as am pressed for time and will prefer getting results with
> my present code.
>
> Thanks again for being so kind to guide me.
> Best regards,
> ND
>
>
>
> On 07/07/2008, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>
>> You seem to be having multiple different errors:
>>
>> 1. a command line problem where mpirun thinks that "sysv" is the executable
>> to run
>> 2. a problem compiling the sysv RPI
>> 3. a problem getting the sysv RPI to initialize (which I'm not sure how you
>> got to this point, given #1 and #2)
>>
>> I think that these errors have gotten mixed up and muddled in the thread so
>> far. Can you send all the information listed here:
>>
>> http://www.lam-mpi.org/using/support/
>>
>> Specifically, can you show exactly all the steps you are following (to
>> include all commands) for each error?
>>
>> Thanks!
>>
>>
>>
>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!