Hi Llamas,
I have a problem with the IMPI part of LAM. Let's start with a
description of my problem first. When I try to run an impi-job with more
than one process per client, the client hangs in an MPI_Recv called from
within MPI_Init.
I tried it with the smallest MPI-program I could think of:
#include <stdio.h>
#include "mpi.h"
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
printf("Hello!\n");
MPI_Finalize();
return 0;
}
Then I started off by running an impi_server and two clients with one
process each, and everything was fine. As soon as I tried to increase
the number of processes on (at least) one of the clients, the client
hung.
Ok, now in more detail. In order to make bug-squishing a bit easier, I
did all my test in a homogeneous environment. I read somewhere, that
most of the development of LAM is done on SunOS, so I tried everything
on a Sun (uname -a: SunOS node43 5.8 Generic_108528-19 sun4u sparc).
As for the programs used and how they were build:
gcc version 2.95.2 19991024 (release)
export CFLAGS="-ggdb -g -DLAM_WANT_DEBUG"
export LAMRSH="ssh -x"
impi_server-1.3: ./configure --prefix=/PATH_TO/impi-server_debug \
--with-debug && make && make install
lam-7.0.4: ./configure --prefix=/opt/home/thpo/lam/solaris_debug \
--with-impi --with-debug && make && make install
Additionally I set some fl_debugs in the LAM-sources to "1", in order to
get more debug-information. The problem however existed before I made
any changes, so "it wasn't me".
$ laminfo
LAM/MPI: 7.0.4
Prefix: /opt/home/thpo/lam/solaris_debug
Architecture: sparc-sun-solaris2.8
Configured by: thpo
Configured on: Thu Mar 4 00:51:25 MET 2004
Configure host: node41
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: yes
Debug support: yes
Purify clean: no
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: impi (Module v7.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0.1)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)
So which commandlines did I use to build and run the test-program, you
ask? Here we go:
$ mpicc -ggdb -g -o host_lam_solaris src/host.c
$ export IMPI_AUTH_NONE=
$ impi_server -server 1 -p 12345 -v -auth 0
Attempting to parse auth command line, argv[i + 1] = 0.
server_auths[0] = 0
IMPI server version 0 started on host node41
IMPI server listening on port 12345 for 1 connection(s).
192.168.12.65:12345
IMPI server: Entering main server loop.
$ impirun -v -client 0 192.168.12.65:12345 -np 2 host_lam_solaris
20027 host_lam_solaris running on n0 (o)
14354 host_lam_solaris running on n1
IMPI_Init rank 0 waiting for args from mpirun (1024 -- 1024)
Got impid args: 192.168.12.65:12345 0
Spawning: impid 0 192.168.12.65:12345
IMPI_Send_lamgiappe
LAM IMPI client starting: client rank 0, server 192.168.12.65, port 12345
Done with comm spawn
About to intercomm merge
IMPI_Send_lamgiappe
IMPI_Send_lamgiappe
IMPI_Send_lamgiappe
Done with comm spawn
About to intercomm merge
Finished MPI_Init in impid
IMPI_Send_lamgiappe
WAITING ON MANDATORY REQUESTS: 1
IMPI_Send_lamgiappe
WAITING ON MANDATORY REQUESTS: 0
MANDATORY REQUESTS FINISHED
MPI_REQ_ADVANCE RETURNED IN MANDATORY LOOP
MPI_REQ_ADVANCE IN MANDATORY LOOP: found mand req done: 0x1045d0
MANDATORY REQUEST LOOP: looping 0
MANDATORY REQUESTS FINISHED
IMPI_Send_lamgiappe
IMPI_Send_lamgiappe
WAITING ON MANDATORY REQUESTS: 1
WAITING ON MANDATORY REQUESTS: 0
MPI_REQ_ADVANCE RETURNED IN MANDATORY LOOP
MANDATORY REQUESTS FINISHED
MPI_REQ_ADVANCE IN MANDATORY LOOP: found mand req done: 0x148bf0
MANDATORY REQUEST LOOP: looping 0
MANDATORY REQUESTS FINISHED
IMPI_Send_lamgiappe
Did intercomm merge
About to MPI_Gather
IMPI_Send_lamgiappe
Did MPI_Gather
About to MPI_Recv
Did intercomm merge
About to MPI_Gather
IMPI_Send_lamgiappe
Did MPI_Gather
About to MPI_Recv
On the first 2 nodes of the cluster, the MPI-program is running and the
backtracing with gdb revealed, that both are waiting in an MPI_Recv
called from within MPI_Init.
The impid on the first node (the only impid running) is waiting in a
PMPI_Recv called by MPI_Intercomm_merge.
Well as this mail already became a bit lengthy I better stop describing
everything in more detail now.
I hope I haven't overseen anything in the FAQ that causes this problem,
and that you can reproduce it. If you need any more information
(gdb-, configure-output) to pinpoint the problem, just ask.
Thanks in advance for spending time to solve my problem.
Thomas
--
I doubt, therefore I might be.
- application/pgp-signature attachment: stored
|