LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Naveed Durrani (ndurrani_at_[hidden])
Date: 2008-09-01 01:49:05


I am again facing the problem for which I requested help previsously. My
last post was
"Dear respected members,
Hi,
I am newb to LAM and mpi and am facing following problem. I request you to
guide me.
I have been using LAM without any trouble for my mpi code until recently. My
script uses PBS and is as follows:
#!/bin/sh
#PBS -l nodes=2:ppn=2
#PBS -l walltime=100:00:00
lamboot -v
cd /data/rundir
mpirun -ssi rpi sysv -np 4 program_exe <input > output

Without actually changing anything, now I am getting a message with this
script :

n0<17669> ssi:boot:base:linear_windowed: booting n0 (node18)
n0<17669> ssi:boot:base:linear_windowed: booting n1 (node17)
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.
This occurred on host node18 (n0).
The PID of failed process was 17675 (MPI_COMM_WORLD rank: 0)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 17676 failed on node n0 (192.168.101.18) with exit status 1.
-----------------------------------------------------------------------------

If I use exclude the '-ssi rpi sysv ' from my script, my program runs fine.
I have gone through the FAQ's and to my best have tried remove ipcs,
lamclean etc but of no avail. Do I need to restart my cluster (it may be
difficult as other people using it for their jobs with different nature).
Please guide me about this... Any suggestions will be welcome.
"
I am running all my simulations now without using the '-ssi rpi sysv'
option. Although at the moment the simulation are running okay but I am
still not sure if leaving this option is okay to run LAM MPI based program.

On 7/7/08, Brian W. Barrett <brbarret_at_[hidden]> wrote:
>
> It is likely ok now. I'd guess that there was something a little off in
> the configuration of the node that was causing the failures. System V
> shared memory can be really touchy and really hard to diagnose problems. If
> you're able to start running, there shouldn't be any danger of problems
> further down the line.
>
> Brian
>
>
> On Mon, 7 Jul 2008, Endee wrote:
>
> Note: My code is running well again with out any change done by me with
>> mpirun -ssi rpi sysv -np 4 application <in >out . However I shall go
>> through
>> the sequence of what happened previously. I am still not sure what went
>> wrong and is okay now?
>> 1. $ laminfo
>> LAM/MPI: 7.0.4
>> Prefix: /home/mep02hx/applic/lam-7.0
>> Architecture: i686-pc-linux-gnu
>> Configured by: mep02hx
>> Configured on: Fri Apr 16 14:31:37 BST 2004
>> Configure host: bluegrid
>> C bindings: yes
>> C++ bindings: yes
>> Fortran bindings: yes
>> C profiling: yes
>> C++ profiling: yes
>> Fortran profiling: yes
>> ROMIO support: yes
>> IMPI support: no
>> Debug support: no
>> Purify clean: no
>> SSI boot: globus (Module v0.5)
>> SSI boot: rsh (Module v1.0)
>> SSI boot: tm (Module v1.0)
>> SSI coll: lam_basic (Module v7.0)
>> SSI coll: smp (Module v1.0)
>> SSI rpi: crtcp (Module v1.0.1)
>> SSI rpi: lamd (Module v7.0)
>> SSI rpi: sysv (Module v7.0)
>> SSI rpi: tcp (Module v7.0)
>> SSI rpi: usysv (Module v7.0)
>> 2. I submit my batch script batch.sh
>> #!/bin/sh
>> #PBS -l nodes=4:ppn=2
>> #PBS -l walltime=100:00:00
>> lamboot -v
>> cd /rundir
>> mpirun -ssi rpi sysv -np4 exefile < in > out
>>
>> # --above script ends---
>> This scripts generates batch.sh.exxx and batch.sh.oxxx files apart from
>> output out. batch.sh.oxxx gives "LAM 7.0.4/MPI 2 C++/ROMIO - Indiana
>> University" as before (which is due to 'lamboot -v' showing that LAM has
>> started. However, unlike previous batch.sh.exxx which was
>> "n0<9246> ssi:boot:base:linear_windowed: booting n0 (node10)
>> n0<9246> ssi:boot:base:linear_windowed: booting n1 (node9)"
>> showing successfull booting on nodes, now I get the message (output is in
>> detail by using '-ssi rpi_verbose 1' option in the same script)
>>
>> n0<17991> ssi:boot:base:linear_windowed: booting n0 (node18)
>> n0<17991> ssi:boot:base:linear_windowed: booting n1 (node17)
>> n0<17997> ssi:rpi:sysv: module initializing
>> n0<17997> ssi:rpi:sysv:pollyield: 1
>> n0<17998> ssi:rpi:sysv: module initializing
>> n0<17998> ssi:rpi:sysv:pollyield: 1
>> n0<17998> ssi:rpi:sysv:short: 8192 bytes
>> n0<17997> ssi:rpi:sysv:short: 8192 bytes
>> n0<17998> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
>> n0<17997> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
>> n0<17998> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
>> n0<17997> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
>> n0<17998> ssi:rpi:tcp:short: 65536 bytes
>> n0<17997> ssi:rpi:tcp:short: 65536 bytes
>> n1<14838> ssi:rpi:sysv: module initializing
>> n1<14838> ssi:rpi:sysv:pollyield: 1
>> n1<14838> ssi:rpi:sysv:short: 8192 bytes
>>
>> -----------------------------------------------------------------------------
>> The selected RPI failed to initialize during MPI_INIT. This is a
>> fatal error; I must abort.
>>
>> This occurred on host node18 (n0).
>>
>> The PID of failed process was 17997 (MPI_COMM_WORLD rank: 0)
>>
>> -----------------------------------------------------------------------------
>> n1<14838> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
>> n1<14838> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
>> n1<14838> ssi:rpi:tcp:short: 65536 bytes
>> n1<14837> ssi:rpi:sysv: module initializing
>> n1<14837> ssi:rpi:sysv:pollyield: 1
>> n1<14837> ssi:rpi:sysv:short: 8192 bytes
>> n1<14837> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
>> n1<14837> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
>> n1<14837> ssi:rpi:tcp:short: 65536 bytes
>>
>>
>> -----------------------------------------------------------------------------
>> One of the processes started by mpirun has exited with a nonzero exit
>> code. This typically indicates that the process finished in error.
>> If your process did not finish in error, be sure to include a "return
>> 0" or "exit(0)" in your C code before exiting the application.
>>
>> PID 17998 failed on node n0 (192.168.101.18) with exit status 1.
>>
>>
>>
>> 3. I removed the '-ssi rpi sysv' from the script above in 2, and submit
>> the
>> command. The program is running without any problem for last 12 hours!!!.
>> Now I just rechecked the original script of 2 (above) and it is working
>> with '-ssi rpi sysv' . I dont know why my program without '-ssi rpi
>> sysv'
>> is still running with correct output or why it was failing with it.
>>
>> 4. I have studied the OpenMP but am in the end of my PhD. Will implement
>> that afterwards as am pressed for time and will prefer getting results
>> with
>> my present code.
>>
>> Thanks again for being so kind to guide me.
>> Best regards,
>> ND
>>
>>
>>
>> On 07/07/2008, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>
>>>
>>> You seem to be having multiple different errors:
>>>
>>> 1. a command line problem where mpirun thinks that "sysv" is the
>>> executable
>>> to run
>>> 2. a problem compiling the sysv RPI
>>> 3. a problem getting the sysv RPI to initialize (which I'm not sure how
>>> you
>>> got to this point, given #1 and #2)
>>>
>>> I think that these errors have gotten mixed up and muddled in the thread
>>> so
>>> far. Can you send all the information listed here:
>>>
>>> http://www.lam-mpi.org/using/support/
>>>
>>> Specifically, can you show exactly all the steps you are following (to
>>> include all commands) for each error?
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>>
>>
> --
> Brian Barrett
> LAM/MPI Developer
> Make today a LAM/MPI day!
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
Naveed Iqbal Durrani