Summary: * Running lamboot in a session started with slurm's
'srun -N 2 -A' results in an immediate segemntation
violation on a cluster with non-standardardised hostnames.
System: * Current Ubuntu 06.10 (aka "edgy") with lam 7.1.1a
but also upgraded to lam 7.1.3 with Moe Jette's patch
and also on the newest 7.1.4b2
* slurm is 1.2.1
* rest is vanilla ubuntu with a few packages backported from Debian
Context: * lam by itself works just fine, from R, with C/C++ apps, ...
* slurm by itself is fine
* lam inside slurm is fine IF AND ONLY IF I use hosts named
like foo100 and foo104, ie numeric suffixes to a common tail
Problem: * as soon as I add third host with a different name like dev-foo1
this stop to work and lamboot segfaults
I have exchanged numerous mails with slurm's Moe Jette on this for two days,
and added copious debugging output in the lam file
share/ssi/boot/slurm/src/ssi_boot_slurm.c
because I first suspected one of the free() calls in there to be the culprit.
I do no longer think so, but I am at a loss as to why/where this fails.
My best best has to do with the 'yes it works on foo100,foo104' and 'no it
fails on 'foo100,foo104,dev-foo1'.
Suggestions would be very welcome. I'm at a loss as to how to fix this.
Thanks, Dirk
--
Hell, there are no rules here - we're trying to accomplish something.
-- Thomas A. Edison
|