LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Sarat C Maruvada (csarat1_at_[hidden])
Date: 2004-04-09 14:59:46


Hello Everyone. I am sure the The subject shows a topic that has been
written about time and time again. Rest assured that I did look through
mmost of them before deciding to post again. Here is the problem:

LAM/MPI version 7.0.2
Sun Grid Engine v 5.3p4

* installed a OSCAR based cluster. LAM/MPI works perfectly fine when run
as a user (lamboot followed by mpi runs and lamclean) all work very well
* When using integration script with SGE, the nodes do not boot because of
 "warning: fake X11 data forwarded" and hence the lamboot fails. As a user
all the nodes of the cluster have ssh keys that are in known_hosts file.

I have tried the latest integration scripts shown in the mailing list also
but to no avail. There was a note about ckill.c having problems but the
LAM installed didnt have the file ckill.c. After many unsuccessful tries,I
have given up. I will attach the PE script in SGE here along with any
other relevant scripts. Any help would be grately appricieated.

PE: lammpi
----------
pe_name lammpi
queue_list all
slots 30
user_lists NONE
xuser_lists NONE
start_proc_args /home/SGE/mpi/lamstart.sh $pe_hostfile
stop_proc_args /home/SGE/mpi/lamstop.sh
allocation_rule $round_robin
control_slaves FALSE
job_is_first_task TRUE

lamstart.sh:
------------
#!/bin/sh

cat /dev/null > /tmp/lamnodes-$USER.$HOSTNAME
cat $1 | while read line; do
    host=`echo $line | cut -f1 -d" "| cut -f1 -d"."`
    nslots=`echo $line | cut -f2 -d" "`
    echo "${host} cpu=${nslots}" >> /tmp/lamnodes-$USER.$HOSTNAME
done
/opt/lam-7.0/bin/lamboot -ssi boot rsh -ssi rsh_agent "ssh -x"
/tmp/lamnodes-$USER.$HOSTNAME > /dev/null
#/opt/lam-7.0/bin/lamboot /tmp/lamnodes-$USER.$HOSTNAME >/dev/null

##rm -f /tmp/lamnodes-$USER.$HOSTNAME
-> Tried using ssh -x to supress error message but works as if -x option
was not specified at all.Anywhere else I should change it?
******************************************************************

lamstop.sh:
-----------
#!/bin/sh

lamhalt >/dev/null

Also the standard error returned by SGE into .pe*** file is shown
below.ITs only the starting block..all errors in the file are of the same
format:
---------------------------------------------------------------------------
ERROR: LAM/MPI unexpectedly received the following on stderr:
Warning: No xauth data; using fake authentication data for X11 forwarding.
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "nodelamda32",
but received some output on the standard error.

LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke "echo $SHELL" on the remote node.

This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a list of items that you may
wish to check on the remote node:

        - You have an account and can login to the remote machine
        - Incorrect permissions on your home directory (should
          probably be 0755)
        - Incorrect permissions on your $HOME/.rhosts file (if you are
          using rsh -- they should probably be 0644)
        - You have an entry in the remote $HOME/.rhosts file (if you
          are using rsh) for the machine and username that you are
          running from
        - Your .cshrc/.profile must not print anything out to the
          standard error
        - Your .cshrc/.profile should set a correct TERM type
        - Your .cshrc/.profile should set the SHELL environment
          variable to your default shell

Try invoking the following command at the unix command line:

        /usr/bin/ssh nodelamda32 -n echo $SHELL

You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.

When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
------------------------------------------------------------------------

Thanks a lot in advance.

Sincerely,
Sarat.