I think the key culprit here is cache size -- I forgot that these P-III machines have a whopping 2MB of L2 transfer cache. Futhermore, I'm fairly sure they have some additional on-board L3.
I suspect that the bottom-line is that my messages probably hit main memory far less frequently on the older machines than on the single processor trying to manage 4 independent processes.
Thanks -- My knee-jerk reaction was to make sure there wasn't a silver bullet out there to improve single-processor performance (e.g., making the shared memory pool size smaller than cache or something like that)
-Tony
Manish Chablani <mchablan_at_[hidden]> wrote:Hi,
There are many other factors that might have had influence on your
measured performance like
- RAM size
- cache size
- what exactly your program is doing (is it continually hitting memory? ,
is it causing millions of cache reloads in the single CPU case?, is it
thrashing the RAM on the single CPU case? )
In general, you will definitely be trashing process scheduling more on the
1 CPU box.
I would advice you to study behavior of your program what it does (Amount
of memory it uses and how frequently it references memory) and come up
with reason based on above factors.
hope this helps,
Manish Chablani
------------------------------------------------------
Graduate Student, CS Department, Indiana University.
http://www.cs.indiana.edu/~mchablan
LAM/MPI Developer
Make today a LAM/MPI day !!!
http://www.lam-mpi.org
------------------------------------------------------
On Mon, 31 Mar 2003, Tony Caola wrote:
>
> Bear with me if the answer to this is in the archives. . . :-)
>
> I would like to run a parallel code of mine on a single-processor machine for the purposes of troubleshooting and prepping production runs. However, I am seeing *terrible* performance on a simulated '4-processor' run -- relative to running the same '4-processor' code on a much slower dual SMP machine. Some details follow, but my basic question is this: Are single-CPU linux boxes just awful at simulating multi-processor runs -- so I'm simply out-of-luck?
>
> A tale of 2 boxes:
>
> Box 1: Pentium 4 2.4b GHz-based system with DDR333 and 533 MHz FSB
> Linux kernel: 2.4.18
> (approximate memory bandwidth: 1200MB/sec -- based on STREAM benchmark)
> LAM 6.5.9 with sysv RPI
> Time for 4-processor solver test: 150 seconds
> (Note: same performance observed under LAM 6.3.2)
>
> Box 2: Dual processor Pentium 3 Xeon 500Mhz-based system
> Linux kernel: 2.4.2
> (approximate memory bandwidth: 300MB/sec -- based on STREAM benchmark)
> LAM 6.3.2 with usysv RPI
> Time for (same) 4-processor solver test: 20 seconds
>
> Since the single processor machine has ~twice the processing speed of the dual machine, I expected that (with some degradation) the timings would be about equal. I am stunned that the faster machine is taking an order of magnitude longer than the dual-processor box for the same '4-processor' run. Has anyone seen this kind of poor performance running multi-processor code on a uniprocessor? Any suggestions?
>
> Thanks,
>
> Tony
>
>
>
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - File online, calculators, forms, and more
|