Hi all !
Maybe someone can help me with this issue.
I have a problem with lam-mpi using RHEL4 with lam v.7.0.6 that did not
occour on Redhat9 with lam
v.6.5.9.
We use simulation packages (eg. ls-dyna) and work with biprocessor machines.
In order to take full advantage of the 64bits architecture and the OS
using the mentioned software, we need to run it in parallel mode; thing
that would be done by using lam-mpi (the software has been compiled on
the purpose of using the lam-mpi on 64bits EM64T architecture by the
developers).
To use the previous, I need to start the lam-mpi process by issuing the
"lamboot" command which should start the mpi process enabling the 2 cpus.
Well, issuing the "lamboot" I get the following message:
$ lamboot -v
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
n-1<4268> ssi:boot:base:linear: booting n0 (localhost)
n-1<4268> ssi:boot:base:linear: finished
The above means that the process exited without enabling node1 and
therefore it fails the initialization.
At first I thought it was due to the fact that rsh'ing I was getting
some messages in return:
$ rsh redhat2
connect to address 192.168.1.11: Connection refused
Trying krb4 rlogin...
connect to address 192.168.1.11: Connection refused
trying normal rlogin (/usr/bin/rlogin)
Last login: Thu May 11 11:22:38 from redhat15
After fiddling enough ( ;-) ) and managing to get rid of the above, the
problem still persists.
Trying to run the job, the latter exits miserably...
The lamboot -d command returns the following:
$ lamboot $HOME/cluster -d
n-1<8737> ssi:boot: Opening
n-1<8737> ssi:boot: opening module globus
n-1<8737> ssi:boot: initializing module globus
n-1<8737> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<8737> ssi:boot: module not available: globus
n-1<8737> ssi:boot: opening module rsh
n-1<8737> ssi:boot: initializing module rsh
n-1<8737> ssi:boot:rsh: module initializing
n-1<8737> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<8737> ssi:boot:rsh:username: <same>
n-1<8737> ssi:boot:rsh:verbose: 1000
n-1<8737> ssi:boot:rsh:algorithm: linear
n-1<8737> ssi:boot:rsh:priority: 10
n-1<8737> ssi:boot: module available: rsh, priority: 10
n-1<8737> ssi:boot: finalizing module globus
n-1<8737> ssi:boot:globus: finalizing
n-1<8737> ssi:boot: closing module globus
n-1<8737> ssi:boot: Selected boot module rsh
LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
n-1<8737> ssi:boot:base: looking for boot schema in following directories:
n-1<8737> ssi:boot:base: <current directory>
n-1<8737> ssi:boot:base: $TROLLIUSHOME/etc
n-1<8737> ssi:boot:base: $LAMHOME/etc
n-1<8737> ssi:boot:base: /etc/lam
n-1<8737> ssi:boot:base: looking for boot schema file:
n-1<8737> ssi:boot:base: /home/catusr/cluster
n-1<8737> ssi:boot:base: found boot schema: /home/catusr/cluster
n-1<8737> ssi:boot:rsh: found the following hosts:
n-1<8737> ssi:boot:rsh: n0 redhat2 (cpu=2)
n-1<8737> ssi:boot:rsh: resolved hosts:
n-1<8737> ssi:boot:rsh: n0 redhat2 --> 192.168.1.11 (origin)
n-1<8737> ssi:boot:rsh: starting RTE procs
n-1<8737> ssi:boot:base:linear: starting
n-1<8737> ssi:boot:base:server: opening server TCP socket
n-1<8737> ssi:boot:base:server: opened port 33121
n-1<8737> ssi:boot:base:linear: booting n0 (redhat2)
n-1<8737> ssi:boot:rsh: starting lamd on (redhat2)
n-1<8737> ssi:boot:rsh: starting on n0 (redhat2): hboot -t -c
lam-conf.lamd -d -I -H 192.168.1.11 -P 33121 -n 0 -o 0
n-1<8737> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-catusr_at_redhat2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-catusr_at_redhat2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-catusr_at_redhat2/lam-io-socket
tkill: f_kill = "/tmp/lam-catusr_at_redhat2/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 8718 ...
tkill: killed
tkill: all finished
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 8740 lamd -H 192.168.1.11 -P 33121 -n 0 -o 0 -d
n-1<8737> ssi:boot:rsh: successfully launched on n0 (redhat2)
n-1<8737> ssi:boot:base:server: expecting connection from finite list
n-1<8740> ssi:boot: Opening
n-1<8740> ssi:boot: opening module globus
n-1<8740> ssi:boot: initializing module globus
n-1<8740> ssi:boot:globus: globus-job-run not found, globus boot will
not run
n-1<8740> ssi:boot: module not available: globus
n-1<8740> ssi:boot: opening module rsh
n-1<8740> ssi:boot: initializing module rsh
n-1<8740> ssi:boot:rsh: module initializing
n-1<8740> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<8740> ssi:boot:rsh:username: <same>
n-1<8740> ssi:boot:rsh:verbose: 1000
n-1<8740> ssi:boot:rsh:algorithm: linear
n-1<8740> ssi:boot:rsh:priority: 10
n-1<8740> ssi:boot: module available: rsh, priority: 10
n-1<8740> ssi:boot: finalizing module globus
n-1<8740> ssi:boot:globus: finalizing
n-1<8740> ssi:boot: closing module globus
n-1<8740> ssi:boot: Selected boot module rsh
n-1<8737> ssi:boot:base:server: got connection from 192.168.1.11
n-1<8737> ssi:boot:base:server: this connection is expected (n0)
n-1<8737> ssi:boot:base:server: remote lamd is at 192.168.1.11:32772
n-1<8737> ssi:boot:base:server: closing server socket
n-1<8737> ssi:boot:base:server: connecting to lamd at 192.168.1.11:33122
n-1<8737> ssi:boot:base:server: connected
n-1<8737> ssi:boot:base:server: sending number of links (1)
n-1<8737> ssi:boot:base:server: sending info: n0 (redhat2)
n-1<8737> ssi:boot:base:server: finished sending
n-1<8737> ssi:boot:base:server: disconnected from 192.168.1.11:33122
n-1<8737> ssi:boot:base:linear: finished
n-1<8737> ssi:boot:rsh: all RTE procs started
n-1<8737> ssi:boot:rsh: finalizing
n-1<8737> ssi:boot: Closing
n-1<8740> ssi:boot:rsh: finalizing
n-1<8740> ssi:boot: Closing
And it looks to me that the lam process dies without any evident reasons.
I had a look at the /tmp/lam-debug-log.txt and I can see that the
process exits but without letting me know what is wrong with it
all............ :-( (The lam-debug-log.txt is inline at the bottom of
the msg....)
Does anybody have an idea on how to solve the problem ?
Any help will be greatly appreciated !
Thank you
Cheers
started (7.0.6), uid 300, gid 304
kernel: initialized
Link 0: node: 0, cpus: 2, type: 384, ip: 192.168.1.11
kio_req: new client on fd=13
kouter: attached process pid=4560, pri=1095, fd=13
flatd: flqload - successfully created file
/tmp/lam-catusr_at_redhat2/lam-flatd0
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 74 bytes to
/tmp/lam-catusr_at_redhat2/lam-flatd0
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynamic" in directory
"/usr2/CAE"
kenyad: found "/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynamic"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4561, index 11, rtf 0x79012
kenyad: create succeeded, process running
flatd: flqload - successfully created file
/tmp/lam-catusr_at_redhat2/lam-flatd1
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 74 bytes to
/tmp/lam-catusr_at_redhat2/lam-flatd1
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynamic" in directory
"/usr2/CAE"
kenyad: found "/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynamic"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4562, index 12, rtf 0x79012
kenyad: create succeeded, process running
died: caught child death; trying to detach
kouter: kqdetach detached process pid=4560
kouter: kqdetach calling kio_close
kouter: kqdetach calling knuke
kio_req: new client on fd=13
kouter: attached process pid=4563, pri=1095, fd=13
flatd: flqload - successfully created file
/tmp/lam-catusr_at_redhat2/lam-flatd2
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 73 bytes to
/tmp/lam-catusr_at_redhat2/lam-flatd2
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_static" in directory
"/usr2/CAE"
kenyad: found "/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_static"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4564, index 11, rtf 0x79012
kenyad: create succeeded, process running
flatd: flqload - successfully created file
/tmp/lam-catusr_at_redhat2/lam-flatd3
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 73 bytes to
/tmp/lam-catusr_at_redhat2/lam-flatd3
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_static" in directory
"/usr2/CAE"
kenyad: found "/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_static"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4565, index 12, rtf 0x79012
kenyad: create succeeded, process running
died: caught child death; trying to detach
kouter: kqdetach detached process pid=4563
kouter: kqdetach calling kio_close
kouter: kqdetach calling knuke
--
--------------------------------
| __ __ | Valter DAL BO
| / \ /| |'-. | e-mail: dalbo_at_[hidden]
| .\__/ || | | |
| _ / `._ \|_|_.-' | Tesco TS S.p.A.
| | / \__.`=._) (_ | http://www.tesco.it
| |/ ._/ |"""""""""| |
| |'. `\ | | | tel.: +390113011711
| ;"""/ / | | | fax : +390113140362
| ) /_/| |.-------.| | mobile: +393357707810
| ' `-`' " " | C.so Tazzoli 10137 Torino ITALY
--------------------------------
|