LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Dirk Eddelbuettel (edd_at_[hidden])
Date: 2007-03-10 18:39:55


Summary: * Running lamboot in a session started with slurm's
            'srun -N 2 -A' results in an immediate segemntation
            violation on a cluster with non-standardardised hostnames.

System: * Current Ubuntu 06.10 (aka "edgy") with lam 7.1.1a
              but also upgraded to lam 7.1.3 with Moe Jette's patch
              and also on the newest 7.1.4b2
          * slurm is 1.2.1
          * rest is vanilla ubuntu with a few packages backported from Debian

Context: * lam by itself works just fine, from R, with C/C++ apps, ...
          * slurm by itself is fine
          * lam inside slurm is fine IF AND ONLY IF I use hosts named
            like foo100 and foo104, ie numeric suffixes to a common tail

Problem: * as soon as I add third host with a different name like dev-foo1
            this stop to work and lamboot segfaults

I have exchanged numerous mails with slurm's Moe Jette on this for two days,
and added copious debugging output in the lam file
        share/ssi/boot/slurm/src/ssi_boot_slurm.c
because I first suspected one of the free() calls in there to be the culprit.
I do no longer think so, but I am at a loss as to why/where this fails.

My best best has to do with the 'yes it works on foo100,foo104' and 'no it
fails on 'foo100,foo104,dev-foo1'.

Suggestions would be very welcome. I'm at a loss as to how to fix this.

Thanks, Dirk

-- 
Hell, there are no rules here - we're trying to accomplish something. 
                                                  -- Thomas A. Edison