Hello!
I quite new to lam-mpi and have just installed it on the cluster
consisting of 8 AMD machines with Linux /OpenMosix-2.4.26. I have there
two networks - the slow administrative one and the fast one for lam
data transfers. This second networks is reserved for lam by populating
/etc/lam-bhost.def with the machine names assigned to cards in the
gigabit network. Everywhere else in the systems traffic is routed to the
first (administative) network. It works almost fine except with the
programms which call MPI_Comm_spawn (for example spawn.c and
spawn_multiple.c from lamtests suite). If LAM_MPI_SSI_rpi is set to
values other than "lamd" the programs hang at this call and can be
terminated only with lamhalt or ctrl+C. It depends on the number of
lamd daemons booted - with lesser numbers the hangs are less frequent.
Otherwise when I set the environment variable LAM_MPI_SSI_rpi to lamd
there are no hangs, but the response time in this case is worse. I have
tested that with lam-7.1.1 and lam-7.1.2b29.
Is there a fix for this bug so that rpi:tcp could be used with
MPI_Comm_spawn. rpi:tcp module is required AFAIK for Mosix to do
load-balancing and migrate process between nodes.
Thank you
M.Kondrin
PS lam was configured with:
CFLAGS="-O2 -march=i486 -mcpu=i686" ./configure --prefix=/usr
--sysconfdir=/etc --localstatedir
=/var --enable-shared=yes --with-rpi=tcp --with-modules --with-trillium
The boot module is rsh (I have kerberized rsh on the boxes). Lam
executables are linked with:
ldd /usr/bin/mpirun
liblam.so.0 => /usr/lib/liblam.so.0 (0x40024000)
libdl.so.2 => /lib/libdl.so.2 (0x4006e000)
libutil.so.1 => /lib/libutil.so.1 (0x40072000)
libpthread.so.0 => /lib/libpthread.so.0 (0x40076000)
libc.so.6 => /lib/libc.so.6 (0x400c7000)
/lib/ld-linux.so.2 (0x40000000)
laminfo -all :
LAM/MPI: 7.1.2b29
SSI boot: globus (SSI v1.0, API v1.1, Module v0.6)
SSI boot: rsh (SSI v1.0, API v1.1, Module v1.1)
SSI boot: slurm (SSI v1.0, API v1.1, Module v1.0)
SSI coll: lam_basic (SSI v1.0, API v1.1, Module v7.1)
SSI coll: shmem (SSI v1.0, API v1.1, Module v1.0)
SSI coll: smp (SSI v1.0, API v1.1, Module v1.2)
SSI rpi: crtcp (SSI v1.0, API v1.1, Module v1.1)
SSI rpi: lamd (SSI v1.0, API v1.0, Module v7.1)
SSI rpi: tcp (SSI v1.0, API v1.0, Module v7.1)
SSI rpi: sysv (SSI v1.0, API v1.0, Module v7.1)
SSI rpi: usysv (SSI v1.0, API v1.0, Module v7.1)
SSI cr: self (SSI v1.0, API v1.0, Module v1.0)
Prefix: /usr
Bindir: /usr/bin
Libdir: /usr/lib
Incdir: /usr/include
Pkglibdir: /usr/lib/lam
Sysconfdir: /etc
Architecture: i686-pc-linux-gnu
Configured by: root
Configured on: Sat Dec 10 13:24:55 UTC 2005
Configure host: alpha....
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C char size: 1
C bool size: 1
C short size: 2
C int size: 4
C long size: 4
C float size: 4
C double size: 8
C pointer size: 4
C char align: 1
C bool align: 1
C int align: 4
C float align: 4
C double align: 4
C++ compiler: g++
Fortran compiler: g77
Fortran symbols: double_underscore
Fort integer size: 4
Fort real size: 4
Fort dbl prec size: 4
Fort cplx size: 4
Fort dbl cplx size: 4
Fort integer align: 4
Fort real align: 4
Fort dbl prec align: 4
Fort cplx align: 4
Fort dbl cplx align: 4
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI base: parameter "verbose" (default value: <none>)
SSI mpi: parameter "mpi_hostmap" (default value:
"/etc/lam-hostmap.txt")
SSI base: parameter "base_module_path" (default value:
"/usr/lib/lam")
SSI boot: parameter "boot_verbose" (default value: <none>)
SSI boot: parameter "boot" (default value: <none>)
SSI boot: parameter "boot_base_promisc" (default value: "0")
SSI boot: parameter "boot_base_window_size" (default value: "5")
SSI boot: parameter "boot_globus_priority" (default value: "3")
SSI boot: parameter "boot_rsh_username" (default value: <none>)
SSI boot: parameter "boot_rsh_agent" (default value: "rsh")
SSI boot: parameter "boot_rsh_no_n" (default value: "0")
SSI boot: parameter "boot_rsh_no_profile" (default value: "0")
SSI boot: parameter "boot_rsh_fast" (default value: "0")
SSI boot: parameter "boot_rsh_ignore_stderr" (default value:
"0")
SSI boot: parameter "boot_rsh_priority" (default value: "10")
SSI boot: parameter "boot_slurm_priority" (default value: "50")
SSI rpi: parameter "rpi_verbose" (default value: <none>)
SSI rpi: parameter "rpi" (default value: <none>)
SSI rpi: parameter "rpi_crtcp_priority" (default value: "25")
SSI rpi: parameter "rpi_crtcp_short" (default value: "65536")
SSI rpi: parameter "rpi_crtcp_sockbuf" (default value: "-1")
SSI rpi: parameter "rpi_lamd_priority" (default value: "20")
SSI rpi: parameter "rpi_tcp_short" (default value: "65536")
SSI rpi: parameter "rpi_tcp_sockbuf" (default value: "-1")
SSI rpi: parameter "rpi_tcp_priority" (default value: "75")
SSI rpi: parameter "rpi_sysv_pollyield" (default value: "1")
SSI rpi: parameter "rpi_sysv_poolsize" (default value:
"16777216")
SSI rpi: parameter "rpi_sysv_maxalloc" (default value:
"1048576")
SSI rpi: parameter "rpi_sysv_short" (default value: "8192")
SSI rpi: parameter "rpi_sysv_priority" (default value: "30")
SSI rpi: parameter "rpi_usysv_readlockpoll" (default value:
"10000")
SSI rpi: parameter "rpi_usysv_writelockpoll" (default value:
"10")
SSI rpi: parameter "rpi_usysv_pollyield" (default value: "1")
SSI rpi: parameter "rpi_usysv_poolsize" (default value:
"16777216")
SSI rpi: parameter "rpi_usysv_maxalloc" (default value:
"1048576")
SSI rpi: parameter "rpi_usysv_short" (default value: "8192")
SSI rpi: parameter "rpi_usysv_priority" (default value: "40")
SSI coll: parameter "coll_verbose" (default value: <none>)
SSI coll: parameter "coll_shmem" (default value: "0")
SSI cr: parameter "cr_verbose" (default value: <none>)
SSI cr: parameter "cr" (default value: <none>)
SSI cr: parameter "cr_self_priority" (default value: "25")
SSI cr: parameter "cr_self_do_restart" (default value: "0")
SSI cr: parameter "cr_self_prefix" (default value:
"lam_cr_self")
SSI cr: parameter "cr_self_checkpoint" (default value: <none>)
SSI cr: parameter "cr_self_continue" (default value: <none>)
SSI cr: parameter "cr_self_restart" (default value: <none>)
|