Hi,
Thanks for your response
No the program does not fail immediately it runs for a while and fails.
remember the same code runs just fine on other cluster...
Here I am also sending the output for laminfo
rangij_at_sd1:~> laminfo
LAM/MPI: 7.0
Prefix: /opt/lam
Architecture: x86_64-unknown-linux-gnu
Configured by: root
Configured on: Wed Sep 24 00:00:55 UTC 2003
Configure host: morricone
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: yes
Debug support: no
Purify clean: no
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: impi (Module v7.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)
Thanks Again...
-Jai Rangi
-----Original Message-----
From: Nihar Sanghvi [mailto:nsanghvi_at_[hidden]]
Sent: Wednesday, March 31, 2004 4:05 PM
To: General LAM/MPI mailing list
Subject: Re: LAM: MPIRUN error
Hi,
We would appreciate if you could provide more details of the condition
under which the program is failing. Does it fail immediately after
starting or does it fail after running for a while ?
You could also check if the memory management for the huge matrix is being
done properly.
Output of laminfo will also give us an idea about the environment.
Thanks,
Nihar
On Wed, 31 Mar 2004, RANGI, JAI wrote:
>I got this error while doing the Matrix Multiplication for two matrixes of
>size 95x95.
>I don't get any error if the matrix is say 95x55 and 55x95 or smaller than
>this. I am running lam-7.0-67 version of Lam. And the cluster is made of
>64-bit optron processors with Suse 64-bit Operating system.
>
>I never had any problem with lam-6.5.4-1dyn version of lam on a different
>cluster built out of Pentium 2 machines. There I am able to do the matrix
>multiplication of up to 500x500.
>
>
>MPI_Send: process in local group is dead (rank 0, MPI_COMM_WORLD) Rank (0,
>MPI_COMM_WORLD): Call stack within LAM: Rank (0, MPI_COMM_WORLD): -
>MPI_Send() Rank (0, MPI_COMM_WORLD): - main()
>---------------------------------------------------------------------------
-
>-
>One of the processes started by mpirun has exited with a nonzero exit code.
>This typically indicates that the process finished in error. If your
process
>did not finish in error, be sure to include a "return 0" or "exit(0)" in
>your C code before exiting the application.
>
>PID 14236 failed on node n12 (192.168.1.113) with exit status 1.
>---------------------------------------------------------------------------
-
>
>Any hint will be appreciated
>Thanks
>
>
>Jai Rangi
>Unix System Administrator, Computing Services,
>South Dakota State University
>Brookings SD 57006.
>email: jai_rangi_at_[hidden]
>Ph: 605 688 4689
>Fax: 6056884605
>-------------------------------------------------------
>In the world with no fences, why would you need Gates ?
> - Linux
>-------------------------------------------------------
>
>
Powered by LAM/MPI...
---------------------------------------
Nihar Sanghvi
LAM/MPI Team
Graduate Student (Indiana University)
http://www.lam-mpi.org
--------------------------------------
|