LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Endee (nd1977_at_[hidden])
Date: 2008-07-07 14:18:38


Note: My code is running well again with out any change done by me with
mpirun -ssi rpi sysv -np 4 application <in >out . However I shall go through
the sequence of what happened previously. I am still not sure what went
wrong and is okay now?
1. $ laminfo
           LAM/MPI: 7.0.4
            Prefix: /home/mep02hx/applic/lam-7.0
      Architecture: i686-pc-linux-gnu
     Configured by: mep02hx
     Configured on: Fri Apr 16 14:31:37 BST 2004
    Configure host: bluegrid
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: no
     Debug support: no
      Purify clean: no
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI boot: tm (Module v1.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0.1)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)
2. I submit my batch script batch.sh
#!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=100:00:00
lamboot -v
cd /rundir
mpirun -ssi rpi sysv -np4 exefile < in > out

# --above script ends---
This scripts generates batch.sh.exxx and batch.sh.oxxx files apart from
output out. batch.sh.oxxx gives "LAM 7.0.4/MPI 2 C++/ROMIO - Indiana
University" as before (which is due to 'lamboot -v' showing that LAM has
started. However, unlike previous batch.sh.exxx which was
"n0<9246> ssi:boot:base:linear_windowed: booting n0 (node10)
 n0<9246> ssi:boot:base:linear_windowed: booting n1 (node9)"
 showing successfull booting on nodes, now I get the message (output is in
detail by using '-ssi rpi_verbose 1' option in the same script)

n0<17991> ssi:boot:base:linear_windowed: booting n0 (node18)
n0<17991> ssi:boot:base:linear_windowed: booting n1 (node17)
n0<17997> ssi:rpi:sysv: module initializing
n0<17997> ssi:rpi:sysv:pollyield: 1
n0<17998> ssi:rpi:sysv: module initializing
n0<17998> ssi:rpi:sysv:pollyield: 1
n0<17998> ssi:rpi:sysv:short: 8192 bytes
n0<17997> ssi:rpi:sysv:short: 8192 bytes
n0<17998> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
n0<17997> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
n0<17998> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
n0<17997> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
n0<17998> ssi:rpi:tcp:short: 65536 bytes
n0<17997> ssi:rpi:tcp:short: 65536 bytes
n1<14838> ssi:rpi:sysv: module initializing
n1<14838> ssi:rpi:sysv:pollyield: 1
n1<14838> ssi:rpi:sysv:short: 8192 bytes
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.

This occurred on host node18 (n0).

The PID of failed process was 17997 (MPI_COMM_WORLD rank: 0)
-----------------------------------------------------------------------------
n1<14838> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
n1<14838> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
n1<14838> ssi:rpi:tcp:short: 65536 bytes
n1<14837> ssi:rpi:sysv: module initializing
n1<14837> ssi:rpi:sysv:pollyield: 1
n1<14837> ssi:rpi:sysv:short: 8192 bytes
n1<14837> ssi:rpi:sysv:shmpoolsize: 16777216 bytes
n1<14837> ssi:rpi:sysv:shmmaxalloc: 65536 bytes
n1<14837> ssi:rpi:tcp:short: 65536 bytes

-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 17998 failed on node n0 (192.168.101.18) with exit status 1.

3. I removed the '-ssi rpi sysv' from the script above in 2, and submit the
command. The program is running without any problem for last 12 hours!!!.
Now I just rechecked the original script of 2 (above) and it is working
with '-ssi rpi sysv' . I dont know why my program without '-ssi rpi sysv'
is still running with correct output or why it was failing with it.

4. I have studied the OpenMP but am in the end of my PhD. Will implement
that afterwards as am pressed for time and will prefer getting results with
my present code.

Thanks again for being so kind to guide me.
Best regards,
ND

On 07/07/2008, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> You seem to be having multiple different errors:
>
> 1. a command line problem where mpirun thinks that "sysv" is the executable
> to run
> 2. a problem compiling the sysv RPI
> 3. a problem getting the sysv RPI to initialize (which I'm not sure how you
> got to this point, given #1 and #2)
>
> I think that these errors have gotten mixed up and muddled in the thread so
> far. Can you send all the information listed here:
>
> http://www.lam-mpi.org/using/support/
>
> Specifically, can you show exactly all the steps you are following (to
> include all commands) for each error?
>
> Thanks!
>
>
>
>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>