LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Josh Lehan (jlehan_at_[hidden])
Date: 2005-04-26 16:51:16


Robert Becker wrote:
> Attached is the verbose output from attempting to start an ABAQUS job.
> There are two things I see that could be the problem. The first is
> ABAQUS uses the master hostname as one of the nodes automatically. It
> gets the second one from the config file.
>
> Other things I notice is during the boot up it is trying to reference a
> node -3. There is no node -3.
>
> Any info or help here would be highly appreciated.

Hello!

Yes, there is definitely a problem: it has resolved the boot schema into
the two node numbers -1 and -3, which is not correct.

What's the contents of the boot schema? I recommend using BProc-style
host names (node numbers preceded by a dot), like this example:

.-1 cpu=2
.0 cpu=1
.1 cpu=1

> n0<30469> ssi:boot:base: found boot schema: /home/becker/abatest/dmpT_CommTest.app
> n0<30469> ssi:boot:bproc: found the following hosts:
> n0<30469> ssi:boot:bproc: n0 -1 (cpu=2)
> n0<30469> ssi:boot:bproc: n1 edms-abaqus (cpu=1)
> n0<30469> ssi:boot:bproc: resolved hosts:
> n0<30469> ssi:boot:bproc: n0 -1 --> 192.168.0.1 (origin)
> n0<30469> ssi:boot:bproc: n1 edms-abaqus --> 192.168.0.101
> n0<30469> ssi:boot:bproc: starting RTE procs
> n0<30469> ssi:boot:bproc:vector: starting
> n0<30469> ssi:boot:bproc:vector: launching on nodes -1,-3
> n0<30469> ssi:boot:bproc:vector: starting wipe on -1,-3
> n0<30469> ssi:boot:bproc: execmoving tkill -d to -1,-3
> n0<30469> ssi:boot:bproc:vexecmove: index 0, node -1, child about to exec /usr/local/lam-7.0.4//bin/tkill
> n0<30469> ssi:boot:bproc:vexecmove: index 0, node -1, parent did fork of child as pid 30470
> n0<30469> ssi:boot:bproc:vexecmove: index 1, node -3, parent did fork of child as pid 30471
> n0<30469> ssi:boot:bproc:vexecmove: index 1, node -3, child about to exec /usr/local/lam-7.0.4//bin/tkill

LAM is reading and parsing the hostfile OK, but there's a problem when
looking up the BProc node number of LAM node n1.

If you're unable to get ABAQUS to let you change the hostfile, then you
might want to try taking a look at the local_bproc_resolve() function in
ssi_boot_bproc.c (in the share/ssi/boot/bproc/src directory). That
function's responsible for returning a correct node number when given a
hostname string. I wonder if there's a bug in that function?

Also, can you upgrade to the newest beta version of LAM?

I submitted a BProc patch to 7.1.2, and by looking at the strings in
your output, I see you've already patched it into your version of LAM.
There might be an incompatibility when using this patch against an older
7.0 version of LAM, as when making the patch, I only tested with 7.1.1
and the beta 7.1.2.

Josh Lehan
Scyld