Michael Madore wrote:
> Jeff Squyres wrote:
>
>>
>> The [] indicate that the process is running elsewhere, and I see from
>> your
>> bpsh output that it looks like it is actually running on each node. So
>> that looks good so far.
>>
>> The question, then, is why it doesn't finish. Can you attach to the
>> master with a debugger and see where it stopped?
>>
>>
>>
> Here is the output from gdb. I wasn't exactly sure what you needed,
> so I just did a backtrace:
>
> (gdb) bt
> #0 0x420cdb44 in read () from /lib/i686/libc.so.6
> #1 0x4003eb44 in __JCR_LIST__ () from /lib/i686/libpthread.so.0
> #2 0x08068d40 in sread ()
> #3 0x080697dc in lam_ssi_rpi_tcp_low_fastrecv ()
> #4 0x08067ae7 in lam_ssi_rpi_tcp_fastrecv ()
> #5 0x08058c5a in MPI_Recv ()
> #6 0x0804a21b in harvest ()
> #7 0x0804a0fb in main ()
> #8 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
As a further data point, I'm seeing similar behavior with the cpi example:
[mmadore_at_asl156 mmadore]$ mpirun n0-2 cpi
Process 0 of 3 on master
2 points: pi is approximately 3.1623529411764704, error = 0.0207602875866773
wall clock time = 0.002841
3 points: pi is approximately 3.1508492098656031, error = 0.0092565562758100
wall clock time = 0.000054
Process 2 of 3 on 1
Process 1 of 3 on 0
[mmadore_at_asl156 mmadore]$ mpirun C cpi
Process 0 of 5 on master
2 points: pi is approximately 3.1623529411764704, error = 0.0207602875866773
wall clock time = 0.001922
3 points: pi is approximately 3.1508492098656031, error = 0.0092565562758100
wall clock time = 0.000063
Process 4 of 5 on 3
Process 1 of 5 on 0
Process 3 of 5 on 2
Process 2 of 5 on 1
And the program doesn't go any further. The output from gdb looks
similar to the Mandelbrot example:
(gdb) bt
#0 0x420cdb44 in read () from /lib/i686/libc.so.6
#1 0x40060b44 in __JCR_LIST__ () from /lib/i686/libpthread.so.0
#2 0x080687b0 in sread ()
#3 0x0806924c in lam_ssi_rpi_tcp_low_fastrecv ()
#4 0x08067557 in lam_ssi_rpi_tcp_fastrecv ()
#5 0x0807483a in PMPI_Recv ()
#6 0x080611cf in lam_ssi_coll_lam_basic_reduce_lin ()
#7 0x08058c1c in MPI_Reduce ()
#8 0x08049ea0 in main ()
#9 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
The following examples seem to run correctly:
alltoall, ring, topology, wave1d, cxx, hello and trivial
I also successfully compiled and ran the cpi example using mpich, so I
know my setup isn't completely broken. :-)
Mike
|