LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Topp, Dave (GEAE) (Dave.Topp_at_[hidden])
Date: 2003-11-03 10:50:56


I am running LAM-7.0.2 on Red Hat 7.3 (2.4.20-19.7smp) and am having
problems getting LAM to boot when I specify the LAM_MPI_SESSION_SUFFIX
environment variable. I am running jobs under LSF Batch (but not using the
LSF Parallel product). The documentation says if I specify this variable, I
will be able to support multiple mpirun sessions from the same user on the
same host - something I need in our batch environment. I am using my own
rsh command (using LSF functionality to start remote processes rather than
rsh). LAM will boot when I don't specify the session suffix. When I try
with suffix specified, I get something like this:
 
lamboot is /afs/ae.ge.com/apps/lam/LINUX24/lam-7.0.2_prod_ge/bin/lamboot
----------------------------------------------------------------------------
-
 
LAM 7.0.2/MPI 2 C++/ROMIO - Indiana University
 
Synopsis: hboot [-dhnNstv] [-c <schema>] [-I <inet_topo>] [-R
<rtr_topo>]
 
Description: Start LAM on the local node
 
Options:
        -c <conf> Use <conf> as the process schema
        -b <name> Use <name> for the unix socket names
        -d Print debug information (implies -v)
        -h Print this message
        -I <inet_topo> Set $inet_topo variable
        -N Pretend to hboot (used with recon(1))
        -R <rtr_topo> Set $rtr_topo variable
        -s Close stdio of processes
        -t Kill existing session first
        -v Be verbose
----------------------------------------------------------------------------
-
----------------------------------------------------------------------------
-
LAM failed to execute a LAM binary on the remote node "csep0005".
Since LAM was already able to determine your remote shell as "hboot",
it is probable that this is not an authentication problem.
 
LAM tried to use the remote agent command
"/afs/ae.ge.com/apps/lam/LINUX24/lam-7
.0.2_prod_ge/ge/ge_rsh.ksh"
to invoke the following command:
 
        /afs/ae.ge.com/apps/lam/LINUX24/lam-7.0.2_prod_ge/ge/ge_rsh.ksh
csep0005
 -n hboot -t -c lam-conf.lamd -sessionsuffix 27921 -s -I "-x -H 129.202.44.4
-P
47860 -n 1 -o 0"

 
This can indicate several things. You should check the following:
 
        - The LAM binaries are in your $PATH
        - You can run the LAM binaries
        - The $PATH variable is set properly before your
          .cshrc/.profile exits
 
Try to invoke the command listed above manually at a Unix prompt.
 
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
 
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.

 
Any idea what is wrong here? I have also tried with LAM7.0 and had same
result...
 
Thanks,
Dave