Hi Jeff,
Many Thanks for your mail. If I put a self loop in the program like
while(1); I find that it runs on both machines. I guess there might be
something preventing it from redirecting the stdout correctly from the slave
machine. I have attached below the lamboot log, in case that gives some
clue.
Thanks,
Anirban
lamboot -v -d lamhosts
n-1<24860> ssi:boot:open: opening
n-1<24860> ssi:boot:open: opening boot module globus
n-1<24860> ssi:boot:open: opened boot module globus
n-1<24860> ssi:boot:open: opening boot module rsh
n-1<24860> ssi:boot:open: opened boot module rsh
n-1<24860> ssi:boot:open: opening boot module slurm
n-1<24860> ssi:boot:open: opened boot module slurm
n-1<24860> ssi:boot:select: initializing boot module slurm
n-1<24860> ssi:boot:slurm: not running under SLURM
n-1<24860> ssi:boot:select: boot module not available: slurm
n-1<24860> ssi:boot:select: initializing boot module rsh
n-1<24860> ssi:boot:rsh: module initializing
n-1<24860> ssi:boot:rsh:agent: ssh -x
n-1<24860> ssi:boot:rsh:username: <same>
n-1<24860> ssi:boot:rsh:verbose: 1000
n-1<24860> ssi:boot:rsh:algorithm: linear
n-1<24860> ssi:boot:rsh:no_n: 0
n-1<24860> ssi:boot:rsh:no_profile: 0
n-1<24860> ssi:boot:rsh:fast: 0
n-1<24860> ssi:boot:rsh:ignore_stderr: 0
n-1<24860> ssi:boot:rsh:priority: 10
n-1<24860> ssi:boot:select: boot module available: rsh, priority: 10
n-1<24860> ssi:boot:select: initializing boot module globus
n-1<24860> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<24860> ssi:boot:select: boot module not available: globus
n-1<24860> ssi:boot:select: finalizing boot module slurm
n-1<24860> ssi:boot:slurm: finalizing
n-1<24860> ssi:boot:select: closing boot module slurm
n-1<24860> ssi:boot:select: finalizing boot module globus
n-1<24860> ssi:boot:globus: finalizing
n-1<24860> ssi:boot:select: closing boot module globus
n-1<24860> ssi:boot:select: selected boot module rsh
LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University
n-1<24860> ssi:boot:base: looking for boot schema in following directories:
n-1<24860> ssi:boot:base: <current directory>
n-1<24860> ssi:boot:base: $TROLLIUSHOME/etc
n-1<24860> ssi:boot:base: $LAMHOME/etc
n-1<24860> ssi:boot:base: /usr/lib/lam/etc
n-1<24860> ssi:boot:base: looking for boot schema file:
n-1<24860> ssi:boot:base: lamhosts
n-1<24860> ssi:boot:base: found boot schema: lamhosts
n-1<24860> ssi:boot:rsh: found the following hosts:
n-1<24860> ssi:boot:rsh: n0 alwolf00 (cpu=1)
n-1<24860> ssi:boot:rsh: n1 alwolf01 (cpu=1)
n-1<24860> ssi:boot:rsh: resolved hosts:
n-1<24860> ssi:boot:rsh: n0 alwolf00 --> 10.1.78.138 (origin)
n-1<24860> ssi:boot:rsh: n1 alwolf01 --> 10.1.78.141
n-1<24860> ssi:boot:rsh: starting RTE procs
n-1<24860> ssi:boot:base:linear: starting
n-1<24860> ssi:boot:base:server: opening server TCP socket
n-1<24860> ssi:boot:base:server: opened port 43692
n-1<24860> ssi:boot:base:linear: booting n0 (alwolf00)
n-1<24860> ssi:boot:rsh: starting lamd on (alwolf00)
n-1<24860> ssi:boot:rsh: starting on n0 (alwolf00): hboot -t -c
lam-conf.lamd -d -v -I -H 10.1.78.138 -P 43692 -n 0 -o 0
n-1<24860> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-wolf_at_alwolf00/lam-killfile
tkill: f_kill = "/tmp/lam-wolf_at_alwolf00/lam-killfile"
tkill: nothing to kill: "/tmp/lam-wolf_at_alwolf00/lam-killfile"
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1] 24863 lamd -H 10.1.78.138 -P 43692 -n 0 -o 0 -d
n-1<24860> ssi:boot:rsh: successfully launched on n0 (alwolf00)
n-1<24860> ssi:boot:base:server: expecting connection from finite list
n-1<24863> ssi:boot:open: opening
n-1<24863> ssi:boot:open: opening boot module globus
n-1<24863> ssi:boot:open: opened boot module globus
n-1<24863> ssi:boot:open: opening boot module rsh
n-1<24863> ssi:boot:open: opened boot module rsh
n-1<24863> ssi:boot:open: opening boot module slurm
n-1<24863> ssi:boot:open: opened boot module slurm
n-1<24863> ssi:boot:select: initializing boot module slurm
n-1<24863> ssi:boot:slurm: not running under SLURM
n-1<24863> ssi:boot:select: boot module not available: slurm
n-1<24863> ssi:boot:select: initializing boot module rsh
n-1<24863> ssi:boot:rsh: module initializing
n-1<24863> ssi:boot:rsh:agent: ssh -x
n-1<24863> ssi:boot:rsh:username: <same>
n-1<24863> ssi:boot:rsh:verbose: 1000
n-1<24863> ssi:boot:rsh:algorithm: linear
n-1<24863> ssi:boot:rsh:no_n: 0
n-1<24863> ssi:boot:rsh:no_profile: 0
n-1<24863> ssi:boot:rsh:fast: 0
n-1<24863> ssi:boot:rsh:ignore_stderr: 0
n-1<24863> ssi:boot:rsh:priority: 10
n-1<24863> ssi:boot:select: boot module available: rsh, priority: 10
n-1<24863> ssi:boot:select: initializing boot module globus
n-1<24863> ssi:boot:globus: globus-job-run not found, globus boot will not
run
n-1<24863> ssi:boot:select: boot module not available: globus
n-1<24863> ssi:boot:select: finalizing boot module slurm
n-1<24863> ssi:boot:slurm: finalizing
n-1<24863> ssi:boot:select: closing boot module slurm
n-1<24863> ssi:boot:select: finalizing boot module globus
n-1<24863> ssi:boot:globus: finalizing
n-1<24863> ssi:boot:select: closing boot module globus
n-1<24863> ssi:boot:select: selected boot module rsh
n-1<24863> ssi:boot:send_lamd: getting node ID from command line
n-1<24863> ssi:boot:send_lamd: getting agent haddr from command line
n-1<24863> ssi:boot:send_lamd: getting agent port from command line
n-1<24863> ssi:boot:send_lamd: getting node ID from command line
n-1<24863> ssi:boot:send_lamd: connecting to 10.1.78.138:43692, node id 0
n-1<24860> ssi:boot:base:server: got connection from 10.1.78.138
n-1<24860> ssi:boot:base:server: this connection is expected (n0)
n-1<24863> ssi:boot:send_lamd: sending dli_port 40815
n-1<24860> ssi:boot:base:server: remote lamd is at 10.1.78.138:40815
n-1<24860> ssi:boot:base:linear: booting n1 (alwolf01)
n-1<24860> ssi:boot:rsh: starting lamd on (alwolf01)
n-1<24860> ssi:boot:rsh: starting on n1 (alwolf01): hboot -t -c
lam-conf.lamd -d -v -s -I "-H 10.1.78.138 -P 43692 -n 1 -o 0"
n-1<24860> ssi:boot:rsh: launching remotely
n-1<24860> ssi:boot:rsh: attempting to execute: ssh -x alwolf01 -n 'echo
$SHELL'
n-1<24860> ssi:boot:rsh: remote shell /bin/bash
n-1<24860> ssi:boot:rsh: attempting to execute: ssh -x alwolf01 -n hboot -t
-c lam-conf.lamd -d -v -s -I '"-H 10.1.78.138 -P 43692 -n 1 -o 0"'
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-wolf_at_alwolf01/lam-killfile
tkill: f_kill = "/tmp/lam-wolf_at_alwolf01/lam-killfile"
tkill: nothing to kill: "/tmp/lam-wolf_at_alwolf01/lam-killfile"
hboot: performing tkill
hboot: tkill -d
hboot: booting...
hboot: fork /usr/bin/lamd
[1] 14315 lamd -H 10.1.78.138 -P 43692 -n 1 -o 0 -d
n-1<24860> ssi:boot:rsh: successfully launched on n1 (alwolf01)
n-1<24860> ssi:boot:base:server: expecting connection from finite list
n-1<24860> ssi:boot:base:server: got connection from 10.1.78.141
n-1<24860> ssi:boot:base:server: this connection is expected (n1)
n-1<24860> ssi:boot:base:server: remote lamd is at 10.1.78.141:52724
n-1<24860> ssi:boot:base:server: closing server socket
n-1<24860> ssi:boot:base:server: connecting to lamd at 10.1.78.138:35556
n-1<24860> ssi:boot:base:server: connected
n-1<24860> ssi:boot:base:server: sending number of links (2)
n-1<24860> ssi:boot:base:server: sending info: n0 (alwolf00)
n-1<24860> ssi:boot:base:server: sending info: n1 (alwolf01)
n-1<24863> ssi:boot:rsh: finalizing
n-1<24863> ssi:boot: Closing
n-1<24860> ssi:boot:base:server: finished sending
n-1<24860> ssi:boot:base:server: disconnected from 10.1.78.138:35556
n-1<24860> ssi:boot:base:server: connecting to lamd at 10.1.78.141:55884
n-1<24860> ssi:boot:base:server: connected
n-1<24860> ssi:boot:base:server: sending number of links (2)
n-1<24860> ssi:boot:base:server: sending info: n0 (alwolf00)
n-1<24860> ssi:boot:base:server: sending info: n1 (alwolf01)
n-1<24860> ssi:boot:base:server: finished sending
n-1<24860> ssi:boot:base:server: disconnected from 10.1.78.141:55884
n-1<24860> ssi:boot:base:linear: finished
n-1<24860> ssi:boot:rsh: all RTE procs started
n-1<24860> ssi:boot:rsh: finalizing
n-1<24860> ssi:boot: Closing
wolf_at_alwolf00:/Wulf$
On Fri, May 14, 2010 at 2:57 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On May 14, 2010, at 9:38 AM, Anirban Lahiri wrote:
>
> > I would be happy to use OpenMPI. However, I haven't been able to compile
> OpenMPI correctly on ARMv7. When I try to compile it, OpenMPI says that it
> cannot find the atomic operations for the specific architecture. I know it
> is possible to overcome this by implementing the ARMv7 specific parts in the
> OpenMPI libraries. The atomic primitives for ARMv7 already exist in Linux so
> it should not be too difficult to do. Unfortunately, I currently don't have
> this scheduled in.
>
> Fair enough. If you ever get the opportunity to send us a patch for Open
> MPI, that would be great. I don't think we have any current users on ARM;
> that's probably why it doesn't work.
>
> > However, there have been quite a few implementations of LAM on ARMv7
> platforms. Therefore thats an easier route for the time being.
>
> Sounds good.
>
> I don't have any insight of why you're not seeing the stdout. You might
> want to drop a file in each MPI process and see if the procs are actually
> running out on your nodes (they probably are, given that mpirun completes
> successfully, but it's good to check).
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|