This sounds like a pretty reasonable request. Based on prior work with
Robin (thanks for access to your cluster for testing! :-), I am intimately
familiar with this problem.
We actually *don't* use gethostbyname with the tm module -- PBS gives us
the list of hostnames from a tm API call (tm_nodeinfo). It's going to
give us only the names that PBS knows about.
Keep in mind that we want to *boot* over these names/addresses. It's the
MPI SSI modules that we want to fool into using different hostnames. In
particular, the tcp RPI module gets its list of IP addresses from the
internal tables in the lamd (which originally came from lamboot -- so
they're the tm-supplied addresses). This may matter for other MPI SSI
modules that use IP addresses for communication -- but none yet (I'm
guessing that it's probably not worth updating the lamd RPI module).
So LAM would have to translate that somehow. I'm open to suggestions on
how to do that...
An off-the-top-of-my-head idea is that if the file
$sysconfdir/lam-host-map.txt exists, the tcp RPI module will read it at
run-time and covert the lamd-supplied addresses/names as per the file.
The file could be a very simple format, perhaps something like:
-----
# one entry per line
pbs1.example.com mpi=internal-fast1.example.com
pbs2.example.com mpi=internal-fast2.example.com
-----
The "mpi=" bit is there for two reasons:
- indicate that MPI communications should use that address (as opposed to
LAM out-of-band communications and whatnot)
- I can re-used the boot schema file parsing code and simply extract the
"mpi" key on each entry :-)
How does that sound? Would that satisfy your requirements?
(keep in mind that this would appear in 7.1 at the earliest)
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
On Wed, 17 Sep 2003, Robin Humble wrote:
> On Wed, Sep 17, 2003 at 04:34:04PM +0200, Jean-Marie Teuler wrote:
> >Is it possible to cheat the tm module so that it enlists instead
> >node1-1000... node4-1000?
>
> I second this request.
> [snipped]
|