> http://www.cs.umass.edu/~emery/hoard/ <http://www.cs.umass.edu/%7Eemery/hoard/>
>
> If it helps, please, tell us about your experience. I've heard some
> people who had similar problems, and Hoard saved the day for them.
>
> --
> Andriy Fedorov
>
Hello, Andriy,
thanks for the info, and sorry for being so late in replying.
I've been waiting for the headed node to get free (users tend
to put the headed node 100% CPU and leave remaining nodes 0% :-).
All are biprocessors, but most slave nodes have an older kernel
2.4.20, one of them has the newer 2.6.6 but somehow misconfigured
(reports 1GB instead of 2GB), and the main node had 1 CPU 100%
so I couldn't run the benchmark as suggested.
I must probably doing something wrong, either that or my
problem has nothing to do with malloc but with pure memory
contention. BTW, good point the malloc idea, I had not thought
of it. Octave is built in C++ and I suppose it will use the STL,
and anyways, the test scripts certainly will use malloc.
Thanks for the idea. Hope you can find some error in what I've
done that explains my not getting positive results.
I downloaded the heaplayers-3.0.3c from the provided URL
changed to allocators/hoard
run the compile-hoard script suggested in readme.pdf
obtained the libhoard.so shared library
since I couldn't find scripts to compile the linux-scalability
benchmark, I downloaded a previous RPM package libhoard-2.0.-1
(RedHat contrib libc6) and found a Makefile containing among others
the targets:
linux-scalability-hoard:
c++ -DNDEBUG -D_REENTRANT=1 -fno-exceptions -Wall -O6 -fexpensive-optimizations -finline-functions -fomit-frame-pointer -ffast-math $(DEFS) -O linux-scalability.c ../../libhoard.a -lpthread -o linux-scalability-hoard
linux-scalability:
c++ -DNDEBUG -D_REENTRANT=1 -fno-exceptions -Wall -O6 -fexpensive-optimizations -finline-functions -fomit-frame-pointer -ffast-math $(DEFS) linux-scalability.c -lpthread -o linux-scalability
Is the -O flag right in the 1st target? Shouldn't -O6 suffice?
Is linux-scalability a good test? Should I try others?
There is a larson.cpp, testme.cpp and malloc-test.c, but I cannot find Makefiles for them.
So I run both targets as suggested in the README from the libhoard-2.0-1, and I get:
$ ./linux-scalability 512 10000000 1
Starting test...
Thread 16386 adjusted timing: 3.634673 seconds for 10000000 requests of 512 bytes.
$ ./linux-scalability 512 10000000 2
Starting test...
Thread 16386 adjusted timing: 3.609622 seconds for 10000000 requests of 512 bytes.
Thread 32771 adjusted timing: 3.663946 seconds for 10000000 requests of 512 bytes.
$ ./linux-scalability-hoard 512 10000000 1
Starting test...
Thread 16386 adjusted timing: 3.175249 seconds for 10000000 requests of 512 bytes.
$ ./linux-scalability-hoard 512 10000000 2
Starting test...
Thread 32771 adjusted timing: 3.156159 seconds for 10000000 requests of 512 bytes.
Thread 16386 adjusted timing: 3.180477 seconds for 10000000 requests of 512 bytes.
It seems it has some impact indeed, but I'm not sure those are the expected results.
I must be doing something wrong. Is that the expected scaling? It scales exactly
as malloc, roughly the same time for 2 threads than for 1 thread. Certainly, better
response time.
I also tried the other method of running, using LD_PRELOAD
$ export LD_PRELOAD=\
> $HOME/heaplayers-3.0.3c/allocators/hoard/libhoard.so:/usr/lib/libdl.so
$ ./linux-scalability 512 10000000 1
Starting test...
Thread 16386 adjusted timing: 3.211785 seconds for 10000000 requests of 512 bytes.
$ ./linux-scalability-hoard 512 10000000 2
Starting test...
Thread 32771 adjusted timing: 3.161546 seconds for 10000000 requests of 512 bytes.
Thread 16386 adjusted timing: 3.262856 seconds for 10000000 requests of 512 bytes.
So any application can be made to take advantage of hoard. I tried then
with Octave, exporting LD_PRELOAD before invoking it in two terminals,
but the results are identical as with malloc:
octave:1> mono=mem(2E6)
time = 0.727685
t1 = 0.063575 -- 251.671 MB/s allocating
t2 = 0.254559 -- 62.854 MB/s counting
t3 = 0.409551 -- 39.067 MB/s computing
mono = 0.40955
octave:2> while !exist('mem.lock'), end; bi=mem(1E6) % this repeated in other octave at same time
% and then "touch mem.lock" in other terminal
time = 0.450245
t1 = 0.042926 -- 186.367 MB/s allocating
t2 = 0.150123 -- 53.290 MB/s counting
t3 = 0.257196 -- 31.105 MB/s computing
bi = 0.25720
octave:3> mono/bi
ans = 1.5924
I get the same 1.6 speedup for computing half the workload on each CPU.
So the chances are:
1.- I'm doing something wrong, and the solution is indeed in malloc
(I hope that's the workaround :-)
2.- Not doing wrong, that's the expected behaviour (see linux-scalability).
Then I'm out of luck, since I don't know howto prove that it's
a memory contention problem
Even if you don't know the answer, thanks a lot for making me know about
Hoard. Sorry for the lengthy e-mail, but... hey, you asked for feedback :-)
Thanks again for your help
-javier
P.S.: tools for monitoring memory contention... anybody knows one?!? :-)
|