LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Robert Becker (beckerr_at_[hidden])
Date: 2005-04-26 13:59:17


I currently am working on getting a 64bit Scyld 29cz4 based cluster up
and running to perform ABAQUS jobs. Since Scyld is Bproc based, I am
having a few problems.

The first problem was getting LAM 7.0.4 to compile. After some digging
around I found a patch to fix that problem. LAM compiled and (I think)
installed fine.

The way Bproc does hostnames could be messing somethings up. Bproc uses
numbers for the hostnames. Master node being -1, compute nodes ranging
from 0-n. With a host definition file listing -1,0,1 I am able to
lamboot with no problems. Most of the test applications work as well.

ABAQUS uses mpiexec to boot LAM with out the need to have the user boot
it for every instance.

Here are the relevant MPI variables for ABAQUS.

mp_host_list=[['-1',2],['0',2],['1',2]]
mp_mode=MPI
mp_mpi_implementation=LAM
mp_file_system=(SHARED,SHARED)
mp_mpirun_path= {LAM: '/usr/local/lam-7.0.4/bin/mpiexec'}
mp_rsh_command = 'bpsh -N %H %C'

I have fudged the mp_rsh_command, I am not sure if this will work or
not. From the research I have done ABAQUS only uses the RSH command to
move files between nodes when a shared file system is not in use. Since
I am using a shared file system it should not use it and cause any
problems.

Attached is the verbose output from attempting to start an ABAQUS job.
There are two things I see that could be the problem. The first is
ABAQUS uses the master hostname as one of the nodes automatically. It
gets the second one from the config file.

Other things I notice is during the boot up it is trying to reference a
node -3. There is no node -3.

Any info or help here would be highly appreciated.

Thanks.


[becker_at_edms-abaqus abatest]$ /usr/local/abaqus/Commands/abq652 python testmpi.py -mp_mode mpi -verbose 3 mp_mpi_implementation : LAM
mp_mpirun_path : /usr/local/lam-7.0.4/bin/mpiexec
mp_mpirun_options : -d
mp_host_list : (('-1', 2),)
mp_file_system : (SHARED, SHARED)
mp_rsh_command : bpsh -N %H %C
local host : edms-abaqus
platform : lnx86_64
Begin ABAQUS Unit Test
Tue 26 Apr 2005 02:12:15 PM EDT
Did not find environment variable ABA_DDM_DEBUG.
Did not find environment variable ABA_ELP_SURFACE_SPLIT.
Did not find environment variable ABA_ELP_SUSPEND.
Did not find environment variable ABA_MPI_DEBUG_LEVEL.
Did not find environment variable ABA_RESOURCE_MONITOR.
Did not find environment variable ABAQUSLM_LICENSE_FILE.
Did not find environment variable ABQ_DATACHECK.
Did not find environment variable ABQ_RECOVER.
Did not find environment variable ABQ_RESTART.
Did not find environment variable ABQ_XPL_WINDOWDUMP.
Did not find environment variable ABQ_XPL_PARTITIONSIZE.
Did not find environment variable ABQLMHANGLIMIT.
Did not find environment variable ABQLMQUEUE.
Did not find environment variable ABQLMUSER.
Did not find environment variable CCI_INITIAL_EXCHANGE.
Did not find environment variable ABAQUS_CCI_DEBUG.
Did not find environment variable CCI_RENDEZVOUS.
Did not find environment variable DOMAIN_CPUS.
Did not find environment variable FLEXLM_DIAGNOSTICS.
Did not find environment variable FOR0064.
Did not find environment variable KMP_DUPLICATE_LIB_OK.
Did not find environment variable MP_NUMBER_OF_THREADS.
Did not find environment variable MPC_GANG.
Did not find environment variable MPI_WORKDIR.
Did not find environment variable OMP_DYNAMIC.
Did not find environment variable OMP_NUM_THREADS.
Did not find environment variable PAIDUP.
Did not find environment variable PARALLEL_METHOD.
Copying files to host: -1
Run mpiexec
  Command: /usr/local/lam-7.0.4/bin/mpiexec -d -machinefile /home/becker/abatest/dmpT_CommTest.app -wd /home/becker/abatest -x ABA_MEMORY_MODE,ABA_MPI_VERBOSE_LEVEL,ABA_PATH,ABAQUS_LANG,DOMAIN,NCPUS,P4_SOCKBUFSIZE,LD_LIBRARY_PATH, n0 n0 /usr/local/abaqus/6.5-2/exec/dmpT_CommTest.exe -outdir /home/becker/abatest -job dmpT_CommTest -cpus 2 -domains 2 -verbose 3
  Environment:
    ABAQUS_LANG = en_US.ISO8859-1
    ABAQUS_PY_TRANSLATION_DICTIONARY = Configuration/Xresources/en_US/en_US_PyDict.py
    ABAQUS_TRANSLATION_DICTIONARY = Configuration/Xresources/en_US/en_US_Dict.py
    ABA_COMMAND = /usr/local/abaqus/Commands/abq652
    ABA_HOME = /usr/local/abaqus/6.5-2
    ABA_LIBRARY_PATH = /usr/local/abaqus/6.5-2/cae/.:/usr/local/abaqus/6.5-2/cae/exec/lbr:/usr/local/abaqus/6.5-2/cae/Python/Obj/lbr:/usr/local/abaqus/6.5-2/cae/External/Acis:/usr/local/abaqus/6.5-2/cae/External:/usr/local/abaqus/6.5-2/exec:/usr/local/abaqus/6.5-2/cae/External/Interop_32:/usr/local/abaqus/6.5-2/cae/External/32
    ABA_LIBRARY_PATHNAME = LD_LIBRARY_PATH
    ABA_MEMORY_MODE = 1
    ABA_MPI_LIBRARY_PATH = /usr/local/abaqus/6.5-2/cae/External/dmp
    ABA_MPI_VERBOSE_LEVEL = 3
    ABA_PATH = /usr/local/abaqus/6.5-2:/usr/local/abaqus/6.5-2/cae
    DISPLAY = localhost:12.0
    DOMAIN = 2
    G_BROKEN_FILENAMES = 1
    HISTSIZE = 1000
    HOME = /home/becker
    HOSTNAME = edms-abaqus
    INPUTRC = /etc/inputrc
    LAMHOME = /usr/local/lam-7.0.4/
    LAM_MPI_SESSION_PREFIX = /tmp
    LAM_MPI_SESSION_SUFFIX = 30462
    LANG = en_US.UTF-8
    LD_LIBRARY_PATH = /usr/local/abaqus/6.5-2/cae/External/dmp/lam:/usr/local/abaqus/6.5-2/cae/exec/lbr:/usr/local/abaqus/6.5-2/cae/Python/Obj/lbr:/usr/local/abaqus/6.5-2/cae/External/Acis:/usr/local/abaqus/6.5-2/cae/External:/usr/local/abaqus/6.5-2/exec:/usr/local/abaqus/6.5-2/cae/External/Interop_32:/usr/local/abaqus/6.5-2/cae/External/32
    LESSOPEN = |/usr/bin/lesspipe.sh %s
    LOGNAME = becker
    LS_COLORS = no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
    MAIL = /var/spool/mail/becker
    NCPUS = 2
    OLDPWD = /home/becker/lam-7.0.4
    P4_SOCKBUFSIZE = 131072
    PATH = /usr/kerberos/bin:/usr/local/intel_fce_80/bin:/usr/local/intel_cce_80/bin:/usr/local/abaqus/Commands:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/becker/bin:/usr/local/lam-7.0.4/:/usr/local/lam-7.0.4//bin
    PWD = /home/becker/abatest
    PYTHONPATH = /usr/local/abaqus/6.5-2/cae/Python/Lib:/usr/local/abaqus/6.5-2/cae/Python/Obj:/usr/local/abaqus/6.5-2/cae/exec/lbr:.
    SHELL = /bin/bash
    SHLVL = 1
    SSH_ASKPASS = /usr/libexec/openssh/gnome-ssh-askpass
    SSH_CLIENT = 131.167.77.92 40184 22
    SSH_CONNECTION = 131.167.77.92 40184 131.167.45.84 22
    SSH_TTY = /dev/pts/1
    TERM = xterm
    USER = becker
    _ = /usr/local/abaqus/Commands/abq652
mpiexec: Global argument parsing done
mpiexec: Booting lam..
n0<30469> ssi:boot: Opening
n0<30469> ssi:boot: opening module bproc
n0<30469> ssi:boot: initializing module bproc
n0<30469> ssi:boot:bproc: module initializing
n0<30469> ssi:boot:bproc:verbose: 1000
n0<30469> ssi:boot:bproc:priority: 50
n0<30469> ssi:boot: module available: bproc, priority: 50
n0<30469> ssi:boot: opening module globus
n0<30469> ssi:boot: initializing module globus
n0<30469> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<30469> ssi:boot: module not available: globus
n0<30469> ssi:boot: opening module rsh
n0<30469> ssi:boot: initializing module rsh
n0<30469> ssi:boot:rsh: module initializing
n0<30469> ssi:boot:rsh:agent: rsh
n0<30469> ssi:boot:rsh:username: <same>
n0<30469> ssi:boot:rsh:verbose: 1000
n0<30469> ssi:boot:rsh:algorithm: linear
n0<30469> ssi:boot:rsh:priority: 10
n0<30469> ssi:boot: module available: rsh, priority: 10
n0<30469> ssi:boot: finalizing module globus
n0<30469> ssi:boot:globus: finalizing
n0<30469> ssi:boot: closing module globus
n0<30469> ssi:boot: finalizing module rsh
n0<30469> ssi:boot:rsh: finalizing
n0<30469> ssi:boot: closing module rsh
n0<30469> ssi:boot: Selected boot module bproc
 
LAM 7.0.4/MPI 2 C++/ROMIO/bproc - Indiana University
 
n0<30469> ssi:boot:base: looking for boot schema in following directories:
n0<30469> ssi:boot:base: <current directory>
n0<30469> ssi:boot:base: $TROLLIUSHOME/etc
n0<30469> ssi:boot:base: $LAMHOME/etc
n0<30469> ssi:boot:base: /usr/local/lam-7.0.4//etc
n0<30469> ssi:boot:base: looking for boot schema file:
n0<30469> ssi:boot:base: /home/becker/abatest/dmpT_CommTest.app
n0<30469> ssi:boot:base: found boot schema: /home/becker/abatest/dmpT_CommTest.app
n0<30469> ssi:boot:bproc: found the following hosts:
n0<30469> ssi:boot:bproc: n0 -1 (cpu=2)
n0<30469> ssi:boot:bproc: n1 edms-abaqus (cpu=1)
n0<30469> ssi:boot:bproc: resolved hosts:
n0<30469> ssi:boot:bproc: n0 -1 --> 192.168.0.1 (origin)
n0<30469> ssi:boot:bproc: n1 edms-abaqus --> 192.168.0.101
n0<30469> ssi:boot:bproc: starting RTE procs
n0<30469> ssi:boot:bproc:vector: starting
n0<30469> ssi:boot:bproc:vector: launching on nodes -1,-3
n0<30469> ssi:boot:bproc:vector: starting wipe on -1,-3
n0<30469> ssi:boot:bproc: execmoving tkill -d to -1,-3
n0<30469> ssi:boot:bproc:vexecmove: index 0, node -1, child about to exec /usr/local/lam-7.0.4//bin/tkill
n0<30469> ssi:boot:bproc:vexecmove: index 0, node -1, parent did fork of child as pid 30470
n0<30469> ssi:boot:bproc:vexecmove: index 1, node -3, parent did fork of child as pid 30471
n0<30469> ssi:boot:bproc:vexecmove: index 1, node -3, child about to exec /usr/local/lam-7.0.4//bin/tkill
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-becker_at_edms-abaqus-30462/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-becker_at_edms-abaqus-30462/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-becker_at_edms-abaqus-30462/lam-io-socket
tkill: f_kill = "/tmp/lam-becker_at_edms-abaqus-30462/lam-killfile"
tkill: nothing to kill: "/tmp/lam-becker_at_edms-abaqus-30462/lam-killfile"
n0<30469> ssi:boot:bproc: successfully launched all processes on -1,-3
n0<30469> ssi:boot:bproc:vector: started 2 nodes when 2 were needed
n0<30469> ssi:boot:bproc:vector: finished
n0<30469> ssi:boot:base:linear_windowed: starting
n0<30469> ssi:boot:base:linear_windowed: window size: 5
n0<30469> ssi:boot:base:server: opening server TCP socket
n0<30469> ssi:boot:base:server: opened port 35016
n0<30469> ssi:boot:base:linear_windowed: booting n0 (-1)
n0<30469> ssi:boot:bproc: starting lamd on (-1)
n0<30469> ssi:boot:bproc: execmoving /usr/local/lam-7.0.4//bin/lamd -H 192.168.0.1 -P 35016 -n 0 -o 0 -d to -1
n0<30469> ssi:boot:bproc:vexecmove: index 0, node -1, parent did fork of child as pid 30472
n0<30469> ssi:boot:bproc:vexecmove: index 0, node -1, child about to exec /usr/local/lam-7.0.4//bin/lamd
n0<30469> ssi:boot:bproc: successfully launched all processes on -1
n0<30469> ssi:boot:base:linear_windowed: booting n1 (edms-abaqus)
n0<30469> ssi:boot:bproc: starting lamd on (edms-abaqus)
n0<30469> ssi:boot:bproc: execmoving /usr/local/lam-7.0.4//bin/lamd -H 192.168.0.1 -P 35016 -n 1 -o 0 -d to -3
n0<30469> ssi:boot:bproc:vexecmove: index 0, node -3, parent did fork of child as pid 30473
n0<30469> ssi:boot:bproc: successfully launched all processes on -3
n0<30469> ssi:boot:base:linear_windowed: finished launching
n0<30469> ssi:boot:base:server: expecting connection from finite list
n0<30469> ssi:boot:bproc:vexecmove: index 0, node -3, child about to exec /usr/local/lam-7.0.4//bin/lamd
n-1<30472> ssi:boot: Opening
n-1<30472> ssi:boot: opening module bproc
n-1<30472> ssi:boot: initializing module bproc
n-1<30472> ssi:boot:bproc: module initializing
n-1<30472> ssi:boot:bproc:verbose: 1000
n-1<30472> ssi:boot:bproc:priority: 50
n-1<30472> ssi:boot: module available: bproc, priority: 50
n-1<30472> ssi:boot: opening module globus
n-1<30472> ssi:boot: initializing module globus
n-1<30472> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<30472> ssi:boot: module not available: globus
n-1<30472> ssi:boot: opening module rsh
n-1<30472> ssi:boot: initializing module rsh
n-1<30472> ssi:boot:rsh: module initializing
n-1<30472> ssi:boot:rsh:agent: rsh
n-1<30472> ssi:boot:rsh:username: <same>
n-1<30472> ssi:boot:rsh:verbose: 1000
n-1<30472> ssi:boot:rsh:algorithm: linear
n-1<30472> ssi:boot:rsh:priority: 10
n-1<30472> ssi:boot: module available: rsh, priority: 10
n-1<30472> ssi:boot: finalizing module globus
n-1<30472> ssi:boot:globus: finalizing
n-1<30472> ssi:boot: closing module globus
n-1<30472> ssi:boot: finalizing module rsh
n-1<30472> ssi:boot:rsh: finalizing
n-1<30472> ssi:boot: closing module rsh
n-1<30472> ssi:boot: Selected boot module bproc
n0<30469> ssi:boot:base:server: got connection from 192.168.0.1
n0<30469> ssi:boot:base:server: this connection is expected (n0)
n0<30469> ssi:boot:base:server: remote lamd is at 192.168.0.1:33225
n0<30469> ssi:boot:base:server: expecting connection from finite list
n0<30469> ssi:boot:base:server: got connection from 0.0.0.0
-----------------------------------------------------------------------------
The lamboot agent timed out while waiting for the newly-booted process
to call back and indicated that it had successfully booted.
 
As far as LAM could tell, the remote process started properly, but
then never called back. Possible reasons that this may happen:
 
        - There are network filters between the lamboot agent host and
          the remote host such that communication on random TCP ports
          is blocked
        - Network routing from the remote host to the local host isn't
          properly configured (this is uncommon)
 
You can check these things by watching the output from "lamboot -d".
 
1. On the command line for hboot, there are two important parameters:
   one is the IP address of where the lamboot agent was invoked, the
   other is the port number that the lamboot agent is expecting the
   newly-booted process to call back on (this will be a random
   integer).
 
2. Manually login to the remote machine and try to telnet to the port
   indicated on the hboot command line. For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection refused" error. If
   you get any other kind of error, it could indicate either of the
   two conditions above. Consult with your system/network
   administrator.
-----------------------------------------------------------------------------
n0<30469> ssi:boot:base:server: failed to connect to remote lamd!
n0<30469> ssi:boot:base:server: closing server socket
n0<30469> ssi:boot:base:linear_windowed: aborted!
-----------------------------------------------------------------------------
lamboot encountered some error (see above) during the boot process,
and will now attempt to kill all nodes that it was previously able to
boot (if any).
 
Please wait for LAM to finish; if you interrupt this process, you may
have LAM daemons still running on remote nodes.
-----------------------------------------------------------------------------
n0<30474> ssi:boot: Opening
n0<30474> ssi:boot: opening module bproc
n0<30474> ssi:boot: initializing module bproc
n0<30474> ssi:boot:bproc: module initializing
n0<30474> ssi:boot:bproc:verbose: 1000
n0<30474> ssi:boot:bproc:priority: 50
n0<30474> ssi:boot: module available: bproc, priority: 50
n0<30474> ssi:boot: opening module globus
n0<30474> ssi:boot: initializing module globus
n0<30474> ssi:boot:globus: globus-job-run not found, globus boot will not run
n0<30474> ssi:boot: module not available: globus
n0<30474> ssi:boot: opening module rsh
n0<30474> ssi:boot: initializing module rsh
n0<30474> ssi:boot:rsh: module initializing
n0<30474> ssi:boot:rsh:agent: rsh
n0<30474> ssi:boot:rsh:username: <same>
n0<30474> ssi:boot:rsh:verbose: 1000
n0<30474> ssi:boot:rsh:algorithm: linear
n0<30474> ssi:boot:rsh:priority: 10
n0<30474> ssi:boot: module available: rsh, priority: 10
n0<30474> ssi:boot: finalizing module globus
n0<30474> ssi:boot:globus: finalizing
n0<30474> ssi:boot: closing module globus
n0<30474> ssi:boot: finalizing module rsh
n0<30474> ssi:boot:rsh: finalizing
n0<30474> ssi:boot: closing module rsh
n0<30474> ssi:boot: Selected boot module bproc
n0<30474> ssi:boot:base: looking for boot schema in following directories:
n0<30474> ssi:boot:base: <current directory>
n0<30474> ssi:boot:base: $TROLLIUSHOME/etc
n0<30474> ssi:boot:base: $LAMHOME/etc
n0<30474> ssi:boot:base: /usr/local/lam-7.0.4//etc
n0<30474> ssi:boot:base: looking for boot schema file:
n0<30474> ssi:boot:base: /home/becker/abatest/dmpT_CommTest.app
n0<30474> ssi:boot:base: found boot schema: /home/becker/abatest/dmpT_CommTest.app
n0<30474> ssi:boot:bproc: found the following hosts:
n0<30474> ssi:boot:bproc: n0 -1 (cpu=2)
n0<30474> ssi:boot:bproc: n1 edms-abaqus (cpu=1)
n0<30474> ssi:boot:bproc: resolved hosts:
n0<30474> ssi:boot:bproc: n0 -1 --> 192.168.0.1 (origin)
n0<30474> ssi:boot:bproc: n1 edms-abaqus --> 192.168.0.101
n0<30474> ssi:boot:bproc: starting RTE procs
n0<30474> ssi:boot:bproc:vector: starting
n0<30474> ssi:boot:bproc:vector: launching on nodes -1,-3
n0<30474> ssi:boot:bproc:vector: starting wipe on -1,-3
n0<30474> ssi:boot:bproc: execmoving tkill -d to -1,-3
n0<30474> ssi:boot:bproc:vexecmove: index 0, node -1, parent did fork of child as pid 30475
n0<30474> ssi:boot:bproc:vexecmove: index 0, node -1, child about to exec /usr/local/lam-7.0.4//bin/tkill
n0<30474> ssi:boot:bproc:vexecmove: index 1, node -3, parent did fork of child as pid 30476
n0<30474> ssi:boot:bproc:vexecmove: index 1, node -3, child about to exec /usr/local/lam-7.0.4//bin/tkill
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-becker_at_edms-abaqus-30462/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-becker_at_edms-abaqus-30462/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-becker_at_edms-abaqus-30462/lam-io-socket
tkill: f_kill = "/tmp/lam-becker_at_edms-abaqus-30462/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 30472 ...
tkill: trying -9 ...
tkill: cannot kill
-----------------------------------------------------------------------------
tkill failed to kill the LAM daemon; I think that it is PID 30472, but
"kill" and "kill -9" did not seem to kill it.
 
Things to check:
 
        - Do a "ps" and see if the process still exists
        - Use the Unix kill(1) command to kill the process
-----------------------------------------------------------------------------
tkill: all finished
n0<30474> ssi:boot:bproc: successfully launched all processes on -1,-3
n0<30474> ssi:boot:bproc:vector: started 2 nodes when 2 were needed
n0<30474> ssi:boot:bproc:vector: finished
n0<30474> ssi:boot:bproc: all RTE procs started
n0<30474> ssi:boot:bproc: finalizing
n0<30474> ssi:boot: Closing
lamboot did NOT complete successfully
mpiexec: Inside handle_waitpid_status Function: lamboot, Error Status: 28160
mpiexec: mpiexec_die called
 
LAM 7.0.4/MPI 2 C++/ROMIO/bproc - Indiana University
 
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host edms-abaqus.
 
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "lamhalt" command.
 
Please run the "lamboot" command the start the LAM/MPI runtime
environment. See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
 
Tue 26 Apr 2005 02:13:32 PM EDT
Phase ABAQUS Unit Test status 253
ABAQUS Error: ABAQUS Unit Test exited with an error.
[becker_at_edms-abaqus abatest]$