LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Robert LeBlanc (leblanc_at_[hidden])
Date: 2005-02-04 10:55:58


I was having the same exact problem with our dual opteron Debian cluster. I
had been chugging on the problem for a week now and was about to ask the
list. I am glad to know that I was not the only one.

Robert LeBlanc
BioAg Computing
Brigham Young University

> -----Original Message-----
> From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
> Of Bryan O'Sullivan
> Sent: Friday, February 04, 2005 12:01 AM
> To: lam_at_[hidden]
> Subject: LAM: 7.1.1, SVN nightly lamboot segfaults with SLURM when run on
> one node
>
> I'm having trouble running lamboot with slurm. This is on a cluster of
> dual-CPU Opteron systems running Fedora Core 3. LAM and SLURM were both
> built on Fedora Core 1. The version of SLURM I'm using is 0.3.10.
>
> $ srun -A -n 2 -p tools
> $ printenv | grep SLURM
> SLURM_NODELIST=eng-24
> SLURM_NNODES=1
> SLURM_JOBID=155
> SLURM_TASKS_PER_NODE=2
> SLURM_NPROCS=2
> SLURM_DISTRIBUTION=block
> $ lamboot
>
> LAM 7.2b1svn02032005/MPI 2 C++/ROMIO - Indiana University
>
> Segmentation fault
>
> As you can see, this is happening with LAM's nightly SVN snapshot. I
> tried the nightly because 7.1.1 had the same problem.
>
> Compiled with debug information, I get the following backtrace from gdb:
>
> (gdb) bt
> #0 0x000000304e46f6b0 in strcmp () from /lib64/tls/libc.so.6
> #1 0x0000002a95587c32 in lam_ssi_boot_slurm_allocate_nodes
> (nodes_arg=0x0,
> nnodes_arg=0x7fbffff2ec, origin_arg=0x7fbffff2f0)
> at
> ../../../../../../share/ssi/boot/slurm/src/ssi_boot_slurm.c:347
> #2 0x0000000000401973 in main (argc=1, argv=0x0)
> at ../../../tools/lamboot/lamboot.c:247
>
> Looking at frame 1, I see that the second parameter (short_hostname) to
> strcmp is valid, but the host_names array contains garbage:
>
> (gdb) p short_hostname
> $4 = "eng-25", '\0' <repeats 4450 times>, ...snipped...
> (gdb) p host_names
> $5 = (char **) 0x506c10
> (gdb) p host_names[0]
> $6 = 0x34322d676e65 <Address 0x34322d676e65 out of bounds>
>
> The host_names variable is initialized at ssi_boot_slurm.c:330, from
> slurm_nodelist:
>
> (gdb) p *slurm_nodelist
> $10 = {la_element_size = 8, la_num_allocated = 10, la_num_used =
> 1,
> la_array = 0x506c10 "eng-24", la_comp = 0}
>
> This looks valid to me, but the trouble appears to be that la_array has
> completely different contents depending on whether there's one element
> in the array or more than one.
>
> With one element in the array, la_array points to plain data here. In
> other words, it's an array of char. However, with two elements in the
> array (let's say I run with "srun -N 2" instead of "srun -n 2"),
> la_array is an array of char*. The code using the data doesn't take
> this into account. I assume it's a bug in the way lam_array_t is used,
> rather than in the user of the code.
>
> I believe I've shown a simple reproduction recipe (try to run
> lamboot when you've allocated just one node) and pinpointed the problem
> (inconsistent handling of lam_array_t), so I'll call myself done.
>
> Good night,
>
> <b
>
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
>