LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2003-10-16 10:59:27


It's pretty hard to pin down the problem based on the information you
provided. The LAM team no longer has access to HPUX machines, so it is
possible that something changed that causes problems on HPUX. On the
other hand, it is not unusual for codes developed on one platform to
have odd issues on another platform (even MPI apps :) ).

Have you run the LAM/MPI test suite (available for download on the same
page as the 7.0.2 release)? It does a reasonably good job of poking
around in the MPI implementation. If it passes, then I would start
looking at the user code. The complaint is that one of the two
processes died either in the barrier or before the barrier started.
Without seeing the code, I could not make any guesses as to the cause
of the death. A debugger may be helpful here.

Hope that helps,

Brian

On Thursday, October 16, 2003, at 05:38 AM, Sebastian Henkel wrote:

> I have installed LAM/MPI 7.0.2 on an HPUX 10.20 workstation. The
> compilation
> went successful and I could run the examples and several other test
> cases
> without problems on several nodes. As written on the webpage at "lam
> dash
> mpi dot org" I have attached to this mail the config.log file and the
> laminfo output.
>
> The problem I am having is with a CFD program developed with LAM/MPI on
> RH-Linux. According to the developers it is running without problems
> there.
> When started on two nodes on the same processor it is working without
> any
> problems. If I start the case on two nodes each on a different
> workstation
> the program doesn't get past MPI_BARRIER. I suppose it could be a
> problem of
> HPUX 10.20, but I don't know.
>
> I need help, as I am not familiar with MPI in anyway besides compiling
> it.
>
> I run mpi with the following command (tcp is the default)
>
> mpirun n0 n1 program
>
> The error I receive is:
>
> MPI_Recv: process in local group is dead (rank 0, MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Recv()
> Rank (0, MPI_COMM_WORLD): - MPI_Barrier()
> Rank (0, MPI_COMM_WORLD): - main()
>
> Having added write statements to the program I know it is always
> crashing
> when calling MPI_BARRIER.
>
> The CFD program and the MPI implementation are written in Fortran 90
> using
> the mpif77 wrapper. When compiling LAM I made sure that mpif77 will
> use f90
> as Fortran compiler.
>
> Hopefully someone can give me a hint as to what the reason might be.
>
> Best regards
>
> Sebastian Henkel
> --
>
> Dipl.-Ing. Sebastian Henkel, Naval Architect, TKB Basic Design
>
> Tel. : +49 461 4940-508 FLENSBURGER SCHIFFBAU-GESELLSCHAFT mbH &
> Co. KG
> Fax : +49 461 4940-217 Batteriestrasse 52, D-24939 Flensburg,
> Germany
> E-Mail: henkel at fsg-ship dot de
>
>
> <config.log.gz><laminfo.out>___________________________________________
> ____
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/