I'm having trouble running lamboot with slurm. This is on a cluster of
dual-CPU Opteron systems running Fedora Core 3. LAM and SLURM were both
built on Fedora Core 1. The version of SLURM I'm using is 0.3.10.
$ srun -A -n 2 -p tools
$ printenv | grep SLURM
SLURM_NODELIST=eng-24
SLURM_NNODES=1
SLURM_JOBID=155
SLURM_TASKS_PER_NODE=2
SLURM_NPROCS=2
SLURM_DISTRIBUTION=block
$ lamboot
LAM 7.2b1svn02032005/MPI 2 C++/ROMIO - Indiana University
Segmentation fault
As you can see, this is happening with LAM's nightly SVN snapshot. I
tried the nightly because 7.1.1 had the same problem.
Compiled with debug information, I get the following backtrace from gdb:
(gdb) bt
#0 0x000000304e46f6b0 in strcmp () from /lib64/tls/libc.so.6
#1 0x0000002a95587c32 in lam_ssi_boot_slurm_allocate_nodes (nodes_arg=0x0,
nnodes_arg=0x7fbffff2ec, origin_arg=0x7fbffff2f0)
at ../../../../../../share/ssi/boot/slurm/src/ssi_boot_slurm.c:347
#2 0x0000000000401973 in main (argc=1, argv=0x0)
at ../../../tools/lamboot/lamboot.c:247
Looking at frame 1, I see that the second parameter (short_hostname) to
strcmp is valid, but the host_names array contains garbage:
(gdb) p short_hostname
$4 = "eng-25", '\0' <repeats 4450 times>, ...snipped...
(gdb) p host_names
$5 = (char **) 0x506c10
(gdb) p host_names[0]
$6 = 0x34322d676e65 <Address 0x34322d676e65 out of bounds>
The host_names variable is initialized at ssi_boot_slurm.c:330, from
slurm_nodelist:
(gdb) p *slurm_nodelist
$10 = {la_element_size = 8, la_num_allocated = 10, la_num_used = 1,
la_array = 0x506c10 "eng-24", la_comp = 0}
This looks valid to me, but the trouble appears to be that la_array has
completely different contents depending on whether there's one element
in the array or more than one.
With one element in the array, la_array points to plain data here. In
other words, it's an array of char. However, with two elements in the
array (let's say I run with "srun -N 2" instead of "srun -n 2"),
la_array is an array of char*. The code using the data doesn't take
this into account. I assume it's a bug in the way lam_array_t is used,
rather than in the user of the code.
I believe I've shown a simple reproduction recipe (try to run
lamboot when you've allocated just one node) and pinpointed the problem
(inconsistent handling of lam_array_t), so I'll call myself done.
Good night,
<b
|