Hi,
Given the fact that your program runs for a while before crashing and that
it runs on a different cluster, I would say that this is a programming
bug. There could be numerous reasons for this: program accessing matrix
element which it is not supposed to access, some race conditions,
probably a memory exception of some sort, etc..
You might want to try using a debugger for this.
Hope this helps,
Thanks,
Nihar
On Thu, 1 Apr 2004, RANGI, JAI wrote:
>Hi,
>Thanks for your response
>No the program does not fail immediately it runs for a while and fails.
>remember the same code runs just fine on other cluster...
>Here I am also sending the output for laminfo
>
>
>rangij_at_sd1:~> laminfo
> LAM/MPI: 7.0
> Prefix: /opt/lam
> Architecture: x86_64-unknown-linux-gnu
> Configured by: root
> Configured on: Wed Sep 24 00:00:55 UTC 2003
> Configure host: morricone
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> ROMIO support: yes
> IMPI support: yes
> Debug support: no
> Purify clean: no
> SSI boot: globus (Module v0.5)
> SSI boot: rsh (Module v1.0)
> SSI coll: impi (Module v7.0)
> SSI coll: lam_basic (Module v7.0)
> SSI coll: smp (Module v1.0)
> SSI rpi: crtcp (Module v1.0)
> SSI rpi: lamd (Module v7.0)
> SSI rpi: sysv (Module v7.0)
> SSI rpi: tcp (Module v7.0)
> SSI rpi: usysv (Module v7.0)
>
>Thanks Again...
>
>
>-Jai Rangi
>
>
>
>
>
>-----Original Message-----
>From: Nihar Sanghvi [mailto:nsanghvi_at_[hidden]]
>Sent: Wednesday, March 31, 2004 4:05 PM
>To: General LAM/MPI mailing list
>Subject: Re: LAM: MPIRUN error
>
>
>Hi,
>
>We would appreciate if you could provide more details of the condition
>under which the program is failing. Does it fail immediately after
>starting or does it fail after running for a while ?
>
>You could also check if the memory management for the huge matrix is being
>done properly.
>
>Output of laminfo will also give us an idea about the environment.
>
>Thanks,
>
>
>Nihar
>
>
>
>
>
>On Wed, 31 Mar 2004, RANGI, JAI wrote:
>
>>I got this error while doing the Matrix Multiplication for two matrixes of
>>size 95x95.
>>I don't get any error if the matrix is say 95x55 and 55x95 or smaller than
>>this. I am running lam-7.0-67 version of Lam. And the cluster is made of
>>64-bit optron processors with Suse 64-bit Operating system.
>>
>>I never had any problem with lam-6.5.4-1dyn version of lam on a different
>>cluster built out of Pentium 2 machines. There I am able to do the matrix
>>multiplication of up to 500x500.
>>
>>
>>MPI_Send: process in local group is dead (rank 0, MPI_COMM_WORLD) Rank (0,
>>MPI_COMM_WORLD): Call stack within LAM: Rank (0, MPI_COMM_WORLD): -
>>MPI_Send() Rank (0, MPI_COMM_WORLD): - main()
>>---------------------------------------------------------------------------
>-
>>-
>>One of the processes started by mpirun has exited with a nonzero exit code.
>>This typically indicates that the process finished in error. If your
>process
>>did not finish in error, be sure to include a "return 0" or "exit(0)" in
>>your C code before exiting the application.
>>
>>PID 14236 failed on node n12 (192.168.1.113) with exit status 1.
>>---------------------------------------------------------------------------
>-
>>
>>Any hint will be appreciated
>>Thanks
>>
>>
>>Jai Rangi
>>Unix System Administrator, Computing Services,
>>South Dakota State University
>>Brookings SD 57006.
>>email: jai_rangi_at_[hidden]
>>Ph: 605 688 4689
>>Fax: 6056884605
>>-------------------------------------------------------
>>In the world with no fences, why would you need Gates ?
>> - Linux
>>-------------------------------------------------------
>>
>>
>
>
>Powered by LAM/MPI...
>---------------------------------------
>Nihar Sanghvi
>LAM/MPI Team
>Graduate Student (Indiana University)
>http://www.lam-mpi.org
>--------------------------------------
>
>
>
Powered by LAM/MPI...
---------------------------------------
Nihar Sanghvi
LAM/MPI Team
Graduate Student (Indiana University)
http://www.lam-mpi.org
--------------------------------------
|