LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Thomas Poehnitzsch (tpoehn_at_[hidden])
Date: 2004-03-09 04:56:37


Hi Llamas,

I have a problem with the IMPI part of LAM. Let's start with a
description of my problem first. When I try to run an impi-job with more
than one process per client, the client hangs in an MPI_Recv called from
within MPI_Init.

I tried it with the smallest MPI-program I could think of:

  #include <stdio.h>
  #include "mpi.h"

  int main(int argc, char* argv[])
  {
      MPI_Init(&argc, &argv);
      printf("Hello!\n");
      MPI_Finalize();
      return 0;
  }

Then I started off by running an impi_server and two clients with one
process each, and everything was fine. As soon as I tried to increase
the number of processes on (at least) one of the clients, the client
hung.

Ok, now in more detail. In order to make bug-squishing a bit easier, I
did all my test in a homogeneous environment. I read somewhere, that
most of the development of LAM is done on SunOS, so I tried everything
on a Sun (uname -a: SunOS node43 5.8 Generic_108528-19 sun4u sparc).

As for the programs used and how they were build:

gcc version 2.95.2 19991024 (release)
export CFLAGS="-ggdb -g -DLAM_WANT_DEBUG"
export LAMRSH="ssh -x"

impi_server-1.3: ./configure --prefix=/PATH_TO/impi-server_debug \
                        --with-debug && make && make install

lam-7.0.4: ./configure --prefix=/opt/home/thpo/lam/solaris_debug \
                --with-impi --with-debug && make && make install

Additionally I set some fl_debugs in the LAM-sources to "1", in order to
get more debug-information. The problem however existed before I made
any changes, so "it wasn't me".

$ laminfo
           LAM/MPI: 7.0.4
            Prefix: /opt/home/thpo/lam/solaris_debug
      Architecture: sparc-sun-solaris2.8
     Configured by: thpo
     Configured on: Thu Mar 4 00:51:25 MET 2004
    Configure host: node41
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: yes
     Debug support: yes
      Purify clean: no
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI coll: impi (Module v7.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0.1)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)

So which commandlines did I use to build and run the test-program, you
ask? Here we go:

$ mpicc -ggdb -g -o host_lam_solaris src/host.c
$ export IMPI_AUTH_NONE=

$ impi_server -server 1 -p 12345 -v -auth 0
  Attempting to parse auth command line, argv[i + 1] = 0.
  server_auths[0] = 0
  IMPI server version 0 started on host node41
  IMPI server listening on port 12345 for 1 connection(s).
  192.168.12.65:12345
  IMPI server: Entering main server loop.

$ impirun -v -client 0 192.168.12.65:12345 -np 2 host_lam_solaris
  20027 host_lam_solaris running on n0 (o)
  14354 host_lam_solaris running on n1
  IMPI_Init rank 0 waiting for args from mpirun (1024 -- 1024)
  Got impid args: 192.168.12.65:12345 0
  Spawning: impid 0 192.168.12.65:12345
  IMPI_Send_lamgiappe
  LAM IMPI client starting: client rank 0, server 192.168.12.65, port 12345
  Done with comm spawn
  About to intercomm merge
  IMPI_Send_lamgiappe
  IMPI_Send_lamgiappe
  IMPI_Send_lamgiappe
  Done with comm spawn
  About to intercomm merge
  Finished MPI_Init in impid
  IMPI_Send_lamgiappe
  WAITING ON MANDATORY REQUESTS: 1
  IMPI_Send_lamgiappe
  WAITING ON MANDATORY REQUESTS: 0
  MANDATORY REQUESTS FINISHED
  MPI_REQ_ADVANCE RETURNED IN MANDATORY LOOP
  MPI_REQ_ADVANCE IN MANDATORY LOOP: found mand req done: 0x1045d0
  MANDATORY REQUEST LOOP: looping 0
  MANDATORY REQUESTS FINISHED
  IMPI_Send_lamgiappe
  IMPI_Send_lamgiappe
  WAITING ON MANDATORY REQUESTS: 1
  WAITING ON MANDATORY REQUESTS: 0
  MPI_REQ_ADVANCE RETURNED IN MANDATORY LOOP
  MANDATORY REQUESTS FINISHED
  MPI_REQ_ADVANCE IN MANDATORY LOOP: found mand req done: 0x148bf0
  MANDATORY REQUEST LOOP: looping 0
  MANDATORY REQUESTS FINISHED
  IMPI_Send_lamgiappe
  Did intercomm merge
  About to MPI_Gather
  IMPI_Send_lamgiappe
  Did MPI_Gather
  About to MPI_Recv
  Did intercomm merge
  About to MPI_Gather
  IMPI_Send_lamgiappe
  Did MPI_Gather
  About to MPI_Recv

On the first 2 nodes of the cluster, the MPI-program is running and the
backtracing with gdb revealed, that both are waiting in an MPI_Recv
called from within MPI_Init.
The impid on the first node (the only impid running) is waiting in a
PMPI_Recv called by MPI_Intercomm_merge.

Well as this mail already became a bit lengthy I better stop describing
everything in more detail now.

I hope I haven't overseen anything in the FAQ that causes this problem,
and that you can reproduce it. If you need any more information
(gdb-, configure-output) to pinpoint the problem, just ask.

Thanks in advance for spending time to solve my problem.

    Thomas

-- 
I doubt, therefore I might be.


  • application/pgp-signature attachment: stored