Hello,
I am running an optimization software from Sandia Labs called DAKOTA,
this software calls simulator_script which runs a simulation called
GAMES in parallel on a cluster. I first call lamboot with a hostfile and
then call DAKOTA which calls the simulator_script which sets up the
parallel computing and runs GAMES with run_Gamesv1p2.c which calls the
shell script that runs the Matlab GAMES simulation.
My problem is that LAM receives the following message on stderr xhost:
unable to open display "" when it calls /usr/bin/ssh -x -a head -n 'echo
$SHELL' on the nodes other than head in the cluster. The command
correctly prints $SHELL but first displays the xhost error.
I've tried the following (individually and in combination), but none of
them have solved the problem:
1. mpiexec -ssi boot_rsh_ignore_stderr 1 (within simulator_script)
2. mpiexec -ssi rsh_agent "ssh -X" (within simulator_script)
3. mpiexec -ssi rsh_agent "ssh" (within simulator_script)
4. LAMRSH = "ssh -X" export LAMRSH (before calling DAKOTA but after
lamboot and before lamboot)
5. LAMRSH = "ssh" export LAMRSH (before calling DAKOTA but after lamboot
and before lamboot)
6. lamboot -ssi boot_rsh_ignore_stderr 1
7. lamboot -ssi boot rsh -ssi boot_rsh_ignore_stderr 1
8. mpiexec -ssi boot rsh -ssi boot_rsh_ignore_stderr 1 (within
simulator_script)
How can I change the ssh script that LAM/MPI runs when it tries to
return to the head node?
I've attached laminfo along with the complete Error message I receive.
I've also attached my simulator_script which is where I call mpiexec and
run_Gamesv1p2.c which sets up the MPI environment.
Thanks,
Jennifer Thorne
LAM/MPI: 7.1.4
Prefix: /opt/lam-7.1.4
Architecture: x86_64-redhat-linux-gnu
Configured by: root
Configured on: Mon Nov 5 19:07:20 CET 2007
Configure host: fuji
Memory manager: ptmalloc2
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C compiler: gcc
C++ compiler: g++
Fortran compiler: gfortran
Fortran symbols: underscore
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
C++ exceptions: no
Thread support: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (API v1.1, Module v0.6)
SSI boot: rsh (API v1.1, Module v1.1)
SSI boot: slurm (API v1.1, Module v1.0)
SSI boot: tm (API v1.1, Module v1.1)
SSI coll: lam_basic (API v1.1, Module v7.1)
SSI coll: shmem (API v1.1, Module v1.0)
SSI coll: smp (API v1.1, Module v1.2)
SSI rpi: crtcp (API v1.1, Module v1.1)
SSI rpi: lamd (API v1.0, Module v7.1)
SSI rpi: sysv (API v1.0, Module v7.1)
SSI rpi: tcp (API v1.0, Module v7.1)
SSI rpi: usysv (API v1.0, Module v7.1)
SSI cr: self (API v1.0, Module v1.0)
ERROR: LAM/MPI unexpectedly received the following on stderr:
xhost: unable to open display ""
-----------------------------------------------------------------------------
LAM attempted to execute a process on the remote node "stimpy",
but received some output on the standard error. This heuristic
assumes that any output on the standard error indicates a fatal error,
and therefore aborts. You can disable this behavior (i.e., have LAM
ignore output on standard error) in the rsh boot module by setting the
SSI parameter boot_rsh_ignore_stderr to 1.
LAM tried to use the remote agent command "/usr/bin/ssh"
to invoke "echo $SHELL" on the remote node.
*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
*** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.
This can indicate an authentication error with the remote agent, or
can indicate an error in your $HOME/.cshrc, $HOME/.login, or
$HOME/.profile files. The following is a (non-inclusive) list of items
that you should check on the remote node:
- You have an account and can login to the remote machine
- Incorrect permissions on your home directory (should
probably be 0755)
- Incorrect permissions on your $HOME/.rhosts file (if you are
using rsh -- they should probably be 0644)
- You have an entry in the remote $HOME/.rhosts file (if you
are using rsh) for the machine and username that you are
running from
- Your .cshrc/.profile must not print anything out to the
standard error
- Your .cshrc/.profile should set a correct TERM type
- Your .cshrc/.profile should set the SHELL environment
variable to your default shell
Try invoking the following command at the unix command line:
/usr/bin/ssh -x -a head -n 'echo $SHELL'
You will need to configure your local setup such that you will *not*
be prompted for a password to invoke this command on the remote node.
No output should be printed from the remote node before the output of
the command is displayed.
When you can get this command to execute successfully by hand, LAM
will probably be able to function properly.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host node02.jhuapl.edu.
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
|