It turns out that the problem was with our Torque installation.
The lam-7.1.2b26 is now able to boot successfully using the "tm"
boot module under Torque 1.2.0p6.
The simplest test of the Torque/PBS "tm" system, without ever
invoking any MPI daemons, is to run the following command from
within a PBS batch job:
pbsdsh hostname
This should simply list the node names allocated to your PBS job,
using the "tm" interface to connect to all nodes. If pbsdsh fails,
so should any LAM-MPI commands using the "tm" interface.
The discussion of the Torque problem can be read here:
http://www.supercluster.org/pipermail/torqueusers/2005-September/thread.html
The quick summary was that pbs_mom had the wrong path to the
pbs_demux executable built in, an error which came from the
building of RPMs.
Ole Holm Nielsen wrote:
> When we upgraded our test cluster from the Torque batch system
> version torque-1.2.0p4 to torque-1.2.0p6, parallel jobs using
> LAM-MPI beta version lam-7.1.2b22 would no longer boot the LAM
> daemons. I downloaded and rebuilt lam-7.1.2b26 with the new
> Torque libraries, but that didn't help any.
>
> The problem with Torque is specific to LAM-MPI (serial jobs run
> perfectly well). When LAM-MPI selects a boot schema in a Torque
> batch job, it defaults to the Torque/OpenPBS "tm" schema.
> Unfortunately, this tm schema is unable to boot correctly (see
> output below). If I force LAM-MPI to use the "rsh" boot schema
> (export LAM_MPI_SSI_boot_tm_priority=1), everything with LAM-MPI
> works just fine ! It is of course possible that LAM-MPI used
> to default to the "rsh" boot schema with torque-1.2.0p4, but we
> can't verify that any more.
>
> Question: Is Torque's LAM-MPI "tm" boot schema supposed to be
> working correctly with Torque ? I'd love to get it to
> work because of the performance improvements promised in the
> LAM-MPI documentation.
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
|