LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bryan O'Sullivan (bos_at_[hidden])
Date: 2005-02-04 01:53:18


I'm having trouble running lamboot with slurm. This is on a cluster of
dual-CPU Opteron systems running Fedora Core 3. LAM and SLURM were both
built on Fedora Core 1. The version of SLURM I'm using is 0.3.10.

        $ srun -A -n 2 -p tools
        $ printenv | grep SLURM
        SLURM_NODELIST=eng-24
        SLURM_NNODES=1
        SLURM_JOBID=155
        SLURM_TASKS_PER_NODE=2
        SLURM_NPROCS=2
        SLURM_DISTRIBUTION=block
        $ lamboot
        
        LAM 7.2b1svn02032005/MPI 2 C++/ROMIO - Indiana University
        
        Segmentation fault

As you can see, this is happening with LAM's nightly SVN snapshot. I
tried the nightly because 7.1.1 had the same problem.

Compiled with debug information, I get the following backtrace from gdb:

        (gdb) bt
        #0 0x000000304e46f6b0 in strcmp () from /lib64/tls/libc.so.6
        #1 0x0000002a95587c32 in lam_ssi_boot_slurm_allocate_nodes (nodes_arg=0x0,
            nnodes_arg=0x7fbffff2ec, origin_arg=0x7fbffff2f0)
            at ../../../../../../share/ssi/boot/slurm/src/ssi_boot_slurm.c:347
        #2 0x0000000000401973 in main (argc=1, argv=0x0)
            at ../../../tools/lamboot/lamboot.c:247

Looking at frame 1, I see that the second parameter (short_hostname) to
strcmp is valid, but the host_names array contains garbage:

        (gdb) p short_hostname
        $4 = "eng-25", '\0' <repeats 4450 times>, ...snipped...
        (gdb) p host_names
        $5 = (char **) 0x506c10
        (gdb) p host_names[0]
        $6 = 0x34322d676e65 <Address 0x34322d676e65 out of bounds>

The host_names variable is initialized at ssi_boot_slurm.c:330, from
slurm_nodelist:

        (gdb) p *slurm_nodelist
        $10 = {la_element_size = 8, la_num_allocated = 10, la_num_used = 1,
          la_array = 0x506c10 "eng-24", la_comp = 0}

This looks valid to me, but the trouble appears to be that la_array has
completely different contents depending on whether there's one element
in the array or more than one.

With one element in the array, la_array points to plain data here. In
other words, it's an array of char. However, with two elements in the
array (let's say I run with "srun -N 2" instead of "srun -n 2"),
la_array is an array of char*. The code using the data doesn't take
this into account. I assume it's a bug in the way lam_array_t is used,
rather than in the user of the code.

I believe I've shown a simple reproduction recipe (try to run
lamboot when you've allocated just one node) and pinpointed the problem
(inconsistent handling of lam_array_t), so I'll call myself done.

Good night,

        <b