Dear respected members,
Hi,
I am newb to LAM and mpi and am facing following problem. I request you to
guide me.
I have been using LAM without any trouble for my mpi code until recently. My
script uses PBS and is as follows:
#!/bin/sh
#PBS -l nodes=2:ppn=2
#PBS -l walltime=100:00:00
lamboot -v
cd /data/rundir
mpirun -ssi rpi sysv -np 4 program_exe <input > output
Without actually changing anything, now I am getting a message with this
script :
n0<17669> ssi:boot:base:linear_windowed: booting n0 (node18)
n0<17669> ssi:boot:base:linear_windowed: booting n1 (node17)
-----------------------------------------------------------------------------
The selected RPI failed to initialize during MPI_INIT. This is a
fatal error; I must abort.
This occurred on host node18 (n0).
The PID of failed process was 17675 (MPI_COMM_WORLD rank: 0)
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 17676 failed on node n0 (192.168.101.18) with exit status 1.
-----------------------------------------------------------------------------
If I use exclude the '-ssi rpi sysv ' from my script, my program runs fine.
I have gone through the FAQ's and to my best have tried remove ipcs,
lamclean etc but of no avail. Do I need to restart my cluster (it may be
difficult as other people using it for their jobs with different nature).
Please guide me about this... Any suggestions will be welcome.
ipcs returns following:
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
------ Semaphore Arrays --------
key semid owner perms nsems
------ Message Queues --------
key msqid owner perms used-bytes messages
so there is no semaphore arrays etc.
Lamclean and the lamboot kickstarts LAM without any trouble but abover error
stays with the same code which was running perfectly previously.
Thanks,
ND
|