>
>
>
> The sysv timing is perhaps the only one that makes sense. But it does
> show that the Origin is about 10-15 times slower than your Altrix
> machine (but there's a lot of other factors here -- I assume the
> machines were idle, you didn't oversubscribe the CPUs, etc.).
>
> But you're right; clearly, a 3 second job on the Altrix should not
> take over an hour on the Origin. It certainly suggests a tcp problem,
> but anything is suspect for a difference that egregious. A few random
> questions to look into: Are there any messages in the system logs that
> indicate that things were going wrong? Can you run tcp ping-pong
> tests (without MPI) to verify bandwidth and latency (e.g., Netpipe)?
> Was someone else using the Origin machine at the time? Was memory
> full, and therefore continually swapping?
>
A clean build of lam didnt do anything (the various test seems to
produce the same results.. not a big surprise really)
However continued testing reveals: If I lower the tcp short message size
in lam from 64K to 60K which is the systems defaults for tcp
send/receive buffer sizes, the problem goes away. I'm a bit confused by
that as I assumed that lam would
bump up the the tcp buffers to whatever short message or tcp buffer size
was requested by the user as long as it was within the limits of the OS.
In order to see whether the problem would exist on other IRIX machine I
tested it on an O2 (irix 6.5.21 MipsPro
4.71, system tcp/rec space 60K and the problem seems to go aways (it
works OK independently of the short message size... tested it with
60K,64K,200K, all provides sensible mpi throughput for MPI)
For me, I can get through this by supplying the sensible parameters to
mpirun in the script that runs my executables,
but it would be nice to know what triggers it.
-Morten
|