Mosix is going to have quite a few problems when trying to run MPI
applications. They may handle some of the issues by leading socket
forwarding agents around, but in general LAM is not setup to handle
Mosix migration.
In particular, note that LAM/MPI is composed of two parts -- the run-
time environment and the MPI application. If Mosix is migrating
these indiscriminately, Bad Things will happen (I can explain more if
you'd like). Try turning off migration/Mosix and see if the problem
disappears.
On Dec 10, 2005, at 4:03 PM, M.Kondrin wrote:
> Hello!
> I quite new to lam-mpi and have just installed it on the cluster
> consisting of 8 AMD machines with Linux /OpenMosix-2.4.26. I have
> there
> two networks - the slow administrative one and the fast one for lam
> data transfers. This second networks is reserved for lam by
> populating
> /etc/lam-bhost.def with the machine names assigned to cards in the
> gigabit network. Everywhere else in the systems traffic is routed
> to the
> first (administative) network. It works almost fine except with the
> programms which call MPI_Comm_spawn (for example spawn.c and
> spawn_multiple.c from lamtests suite). If LAM_MPI_SSI_rpi is set to
> values other than "lamd" the programs hang at this call and can be
> terminated only with lamhalt or ctrl+C. It depends on the number of
> lamd daemons booted - with lesser numbers the hangs are less frequent.
> Otherwise when I set the environment variable LAM_MPI_SSI_rpi to lamd
> there are no hangs, but the response time in this case is worse. I
> have
> tested that with lam-7.1.1 and lam-7.1.2b29.
> Is there a fix for this bug so that rpi:tcp could be used with
> MPI_Comm_spawn. rpi:tcp module is required AFAIK for Mosix to do
> load-balancing and migrate process between nodes.
> Thank you
> M.Kondrin
> PS lam was configured with:
> CFLAGS="-O2 -march=i486 -mcpu=i686" ./configure --prefix=/usr
> --sysconfdir=/etc --localstatedir
> =/var --enable-shared=yes --with-rpi=tcp --with-modules --with-
> trillium
> The boot module is rsh (I have kerberized rsh on the boxes). Lam
> executables are linked with:
> ldd /usr/bin/mpirun
> liblam.so.0 => /usr/lib/liblam.so.0 (0x40024000)
> libdl.so.2 => /lib/libdl.so.2 (0x4006e000)
> libutil.so.1 => /lib/libutil.so.1 (0x40072000)
> libpthread.so.0 => /lib/libpthread.so.0 (0x40076000)
> libc.so.6 => /lib/libc.so.6 (0x400c7000)
> /lib/ld-linux.so.2 (0x40000000)
>
> laminfo -all :
> LAM/MPI: 7.1.2b29
> SSI boot: globus (SSI v1.0, API v1.1, Module v0.6)
> SSI boot: rsh (SSI v1.0, API v1.1, Module v1.1)
> SSI boot: slurm (SSI v1.0, API v1.1, Module v1.0)
> SSI coll: lam_basic (SSI v1.0, API v1.1, Module v7.1)
> SSI coll: shmem (SSI v1.0, API v1.1, Module v1.0)
> SSI coll: smp (SSI v1.0, API v1.1, Module v1.2)
> SSI rpi: crtcp (SSI v1.0, API v1.1, Module v1.1)
> SSI rpi: lamd (SSI v1.0, API v1.0, Module v7.1)
> SSI rpi: tcp (SSI v1.0, API v1.0, Module v7.1)
> SSI rpi: sysv (SSI v1.0, API v1.0, Module v7.1)
> SSI rpi: usysv (SSI v1.0, API v1.0, Module v7.1)
> SSI cr: self (SSI v1.0, API v1.0, Module v1.0)
> Prefix: /usr
> Bindir: /usr/bin
> Libdir: /usr/lib
> Incdir: /usr/include
> Pkglibdir: /usr/lib/lam
> Sysconfdir: /etc
> Architecture: i686-pc-linux-gnu
> Configured by: root
> Configured on: Sat Dec 10 13:24:55 UTC 2005
> Configure host: alpha....
> Memory manager: ptmalloc2
> C bindings: yes
> C++ bindings: yes
> Fortran bindings: yes
> C compiler: gcc
> C char size: 1
> C bool size: 1
> C short size: 2
> C int size: 4
> C long size: 4
> C float size: 4
> C double size: 8
> C pointer size: 4
> C char align: 1
> C bool align: 1
> C int align: 4
> C float align: 4
> C double align: 4
> C++ compiler: g++
> Fortran compiler: g77
> Fortran symbols: double_underscore
> Fort integer size: 4
> Fort real size: 4
> Fort dbl prec size: 4
> Fort cplx size: 4
> Fort dbl cplx size: 4
> Fort integer align: 4
> Fort real align: 4
> Fort dbl prec align: 4
> Fort cplx align: 4
> Fort dbl cplx align: 4
> C profiling: yes
> C++ profiling: yes
> Fortran profiling: yes
> C++ exceptions: no
> Thread support: yes
> ROMIO support: yes
> IMPI support: no
> Debug support: no
> Purify clean: no
> SSI base: parameter "verbose" (default value: <none>)
> SSI mpi: parameter "mpi_hostmap" (default value:
> "/etc/lam-hostmap.txt")
> SSI base: parameter "base_module_path" (default value:
> "/usr/lib/lam")
> SSI boot: parameter "boot_verbose" (default value: <none>)
> SSI boot: parameter "boot" (default value: <none>)
> SSI boot: parameter "boot_base_promisc" (default value:
> "0")
> SSI boot: parameter "boot_base_window_size" (default
> value: "5")
> SSI boot: parameter "boot_globus_priority" (default
> value: "3")
> SSI boot: parameter "boot_rsh_username" (default value:
> <none>)
> SSI boot: parameter "boot_rsh_agent" (default value:
> "rsh")
> SSI boot: parameter "boot_rsh_no_n" (default value: "0")
> SSI boot: parameter "boot_rsh_no_profile" (default
> value: "0")
> SSI boot: parameter "boot_rsh_fast" (default value: "0")
> SSI boot: parameter "boot_rsh_ignore_stderr" (default
> value:
> "0")
> SSI boot: parameter "boot_rsh_priority" (default value:
> "10")
> SSI boot: parameter "boot_slurm_priority" (default
> value: "50")
> SSI rpi: parameter "rpi_verbose" (default value: <none>)
> SSI rpi: parameter "rpi" (default value: <none>)
> SSI rpi: parameter "rpi_crtcp_priority" (default
> value: "25")
> SSI rpi: parameter "rpi_crtcp_short" (default value:
> "65536")
> SSI rpi: parameter "rpi_crtcp_sockbuf" (default value:
> "-1")
> SSI rpi: parameter "rpi_lamd_priority" (default value:
> "20")
> SSI rpi: parameter "rpi_tcp_short" (default value:
> "65536")
> SSI rpi: parameter "rpi_tcp_sockbuf" (default value:
> "-1")
> SSI rpi: parameter "rpi_tcp_priority" (default value:
> "75")
> SSI rpi: parameter "rpi_sysv_pollyield" (default
> value: "1")
> SSI rpi: parameter "rpi_sysv_poolsize" (default value:
> "16777216")
> SSI rpi: parameter "rpi_sysv_maxalloc" (default value:
> "1048576")
> SSI rpi: parameter "rpi_sysv_short" (default value:
> "8192")
> SSI rpi: parameter "rpi_sysv_priority" (default value:
> "30")
> SSI rpi: parameter "rpi_usysv_readlockpoll" (default
> value:
> "10000")
> SSI rpi: parameter "rpi_usysv_writelockpoll" (default
> value:
> "10")
> SSI rpi: parameter "rpi_usysv_pollyield" (default
> value: "1")
> SSI rpi: parameter "rpi_usysv_poolsize" (default value:
> "16777216")
> SSI rpi: parameter "rpi_usysv_maxalloc" (default value:
> "1048576")
> SSI rpi: parameter "rpi_usysv_short" (default value:
> "8192")
> SSI rpi: parameter "rpi_usysv_priority" (default
> value: "40")
> SSI coll: parameter "coll_verbose" (default value: <none>)
> SSI coll: parameter "coll_shmem" (default value: "0")
> SSI cr: parameter "cr_verbose" (default value: <none>)
> SSI cr: parameter "cr" (default value: <none>)
> SSI cr: parameter "cr_self_priority" (default value:
> "25")
> SSI cr: parameter "cr_self_do_restart" (default
> value: "0")
> SSI cr: parameter "cr_self_prefix" (default value:
> "lam_cr_self")
> SSI cr: parameter "cr_self_checkpoint" (default
> value: <none>)
> SSI cr: parameter "cr_self_continue" (default value:
> <none>)
> SSI cr: parameter "cr_self_restart" (default value:
> <none>)
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
|