LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-08-02 07:07:37


Is /usr/bin/lamboot your 6.5.9 installation?

The next thing to check is to see if the a.out that is found on all nodes
was compiled by the same version of LAM/MPI. Do you have a networked
filesystem? If so, this is probably a moot point, but if not, ensure that
a.out matches across all nodes.

When you mpirun across multiple nodes (including the problematic node), do
you see a.out in the process table on node 22? Can you verify that LAM
thinks that it launched on node 22 by using "mpirun -v"? Can you lamexec
non-MPI applications across multiple nodes (including 22), such as "lamexec
N hostname"?

Is it possible for you to upgrade? The 7.x series of LAM gives much more
diagnostic information than the 6.x series.

On 8/2/06 12:37 AM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:

> Thanks for the reply. I get the following output from lamboot -V
>
> node08.local.net 29: lamboot -V
>
> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> Arch: i386-redhat-linux-gnu
> RPI: usysv
> node08.local.net 30: exit
> rlogin: connection closed.
> node22.local.net 28: lamboot -V
>
> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>
> Arch: i386-redhat-linux-gnu
> RPI: usysv
>
> When I do "rsh node22.local.net which lamboot" i get
> /usr/bin/lamboot
>
>
>
> On 8/1/06, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>
>> The first thing that I would check for is a version mismatch of LAM
>> between
>> your nodes. It looks like when you have interactive shells, you're using
>> 6.5.9. Double check that non-interactive shells also get version 6.5.9
>> (e.g., "rsh node22.local.net which lamboot").
>>
>>
>> On 7/30/06 12:40 AM, "Zubair Anwar" <zubair.anwar_at_[hidden]> wrote:
>>
>>> I am having a strange problem to which I could not find answer on the
>> list
>>> or the web.
>>>
>>> I just replaced a compute node in a cluster with a new machine. The
>> cluster
>>> is behind a head node (that does not compute). Jobs are run by logging
>> into
>>> one of the compute nodes, then changing directories to the executable
>>> directory on the head node, following by lambooting a machinefile and
>> then
>>> mpirun.
>>>
>>> The problem is that while I can do mpirun from the machine in question,
>> any
>>> boot schema that contains other nodes hangs. I must mention that a boot
>>> schema with any combination of the other machines works fine. It is only
>> the
>>> new node (node22) that gives problems. Here is what I get when i run
>> code
>>> with boot schema containing just node 22.
>>>
>>> node22.local.net 27: lamboot -v -d mac
>>>
>>> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>>>
>>> lamboot: boot schema file: mac
>>> lamboot: opening hostfile mac
>>> lamboot: found the following hosts:
>>> lamboot: n0 node22
>>> lamboot: resolved hosts:
>>> lamboot: n0 node22 --> 192.168.0.22
>>> lamboot: found 1 host node(s)
>>> lamboot: origin node is 0 (node22)
>>> Executing hboot on n0 (node22 - 2 CPUs)...
>>> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
>>> 192.168.0.22 -P 33629 -n 0 -o 0 ""
>>> hboot: process schema = "/etc/lam/lam-conf.lam"
>>> hboot: found /usr/bin/lamd
>>> hboot: performing tkill
>>> hboot: tkill
>>> hboot: booting...
>>> hboot: fork /usr/bin/lamd
>>> hboot: attempting to execute
>>> [1] 10912 lamd -H 192.168.0.22 -P 33629 -n 0 -o 0 -d
>>> topology done
>>> lamboot completed successfully
>>>
>>> and when i do mpirun i get:
>>>
>>> node22.local.net 28: mpirun -np 4 a.out
>>> hello world from processor 3
>>> hello world from processor 0
>>> hello world from processor 1
>>> hello world from processor 2
>>>
>>> However, a bootschema with node08, node19 and node22 followed by mpirun
>> does
>>> the following (node08 and node19 are an example here; other nodes are
>> fine
>>> too.. it is just node22 that causes problems).
>>>
>>> node08.local.net 31: lamboot -v -d machinefile
>>>
>>> LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
>>>
>>> lamboot: boot schema file: machinefile
>>> lamboot: opening hostfile machinefile
>>> lamboot: found the following hosts:
>>> lamboot: n0 node08
>>> lamboot: n1 node19
>>> lamboot: n2 node22
>>> lamboot: resolved hosts:
>>> lamboot: n0 node08 --> 192.168.0.8
>>> lamboot: n1 node19 --> 192.168.0.19
>>> lamboot: n2 node22 --> 192.168.0.22
>>> lamboot: found 3 host node(s)
>>> lamboot: origin node is 0 (node08)
>>> Executing hboot on n0 (node08 - 1 CPU)...
>>> lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H
>>> 192.168.0.8 -P 33101 -n 0 -o 0 ""
>>> hboot: process schema = "/etc/lam/lam-conf.lam"
>>> hboot: found /usr/bin/lamd
>>> hboot: performing tkill
>>> hboot: tkill
>>> hboot: booting...
>>> hboot: fork /usr/bin/lamd
>>> hboot: attempting to execute
>>> [1] 18080 lamd -H 192.168.0.8 -P 33101 -n 0 -o 0 -d
>>> Executing hboot on n1 (node19 - 1 CPU)...
>>> lamboot: attempting to execute "rsh node19 -n echo $SHELL"
>>> lamboot: got remote shell /bin/tcsh
>>> lamboot: attempting to execute "rsh node19 -n hboot -t -c lam-conf.lam-d -v
>>> -s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0 ""
>>> hboot: process schema = "/etc/lam/lam-conf.lam"
>>> hboot: found /usr/bin/lamd
>>> hboot: performing tkill
>>> hboot: tkill
>>> hboot: booting...
>>> hboot: fork /usr/bin/lamd
>>> [1] 15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d
>>> Executing hboot on n2 (node22 - 1 CPU)...
>>> lamboot: attempting to execute "rsh node22 -n echo $SHELL"
>>> lamboot: got remote shell /bin/tcsh
>>> lamboot: attempting to execute "rsh node22 -n hboot -t -c lam-conf.lam-d -v
>>> -s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0 ""
>>> hboot: process schema = "/etc/lam/lam-conf.lam"
>>> hboot: found /usr/bin/lamd
>>> hboot: performing tkill
>>> hboot: tkill
>>> hboot: booting...
>>> hboot: fork /usr/bin/lamd
>>> [1] 10965 lamd -H 192.168.0.8 -P 33101 -n 2 -o 0 -d
>>> topology done
>>> lamboot completed successfully
>>> node08.local.net 33: mpirun -v -np 2 a.out
>>> 18088 a.out running on n0 (o)
>>> 15485 a.out running on n1
>>> hello world from processor 0
>>> hello world from processor 1
>>> node08.local.net 34: mpirun -v -np 3 a.out
>>> 18090 a.out running on n0 (o)
>>> 15486 a.out running on n1
>>>
>>> Suspended
>>> node08.local.net 35:
>>>
>>> I had to do Ctrl+Z to abort. I can rsh back and forth and also do tping
>>> before the run. Any ideas what's going wrong?
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>>
>> --
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems