LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Anthony J. Ciani (aciani1_at_[hidden])
Date: 2005-04-22 17:45:23


Hello LAM Users,

On Fri, 22 Apr 2005, Eugene BT wrote:
> Thanks Anthony,
>
> That was very usefull. To elaborate I was running the tests on a 8-way
> opteron with 64Gb of memory (8Gb per cpu) and Suse Proffessional 9.2 and each
> test typically only uses 1-8Gb of memory total. What you are saying regarding
> the non-functioning of NUMA makes a lot of sense, especially in light of my
> last test. This test used 8 times more memory than the first one, but should
> have been much less communications intensive, yet I saw significantly reduced
> performance (-20% efficiency). This of course utterly confused me, but if as
> you suggest, the NUMA system has been disabled then more read/writes would
> have been non-local leading to the drop. As far as I know though, the 2.6 smp
> kernel is NUMA aware, so perhaps node interleaving is enabled in the BIOS. I
> will report back once we have our own machines in any case.

You can check both the NUMA and node interleaving from kernel messages, if
you can access /var/log/messages or if you can run dmesg (and the message
queue hasn't filled up past the start). Look for lines like these near
the start, just after the BIOS provided physical memory mapping:

Apr 21 21:41:58 wolf2 kernel: Scanning NUMA topology in Northbridge 24
Apr 21 21:41:58 wolf2 kernel: Number of nodes 2 (10010)
Apr 21 21:41:58 wolf2 kernel: Node 0 MemBase 0000000000000000 Limit 000000003fffffff
Apr 21 21:41:58 wolf2 kernel: Node 1 MemBase 0000000040000000 Limit 000000007fff0000
Apr 21 21:41:58 wolf2 kernel: Using node hash shift of 24
Apr 21 21:41:58 wolf2 kernel: Bootmem setup node 0 0000000000000000-000000003fffffff
Apr 21 21:41:58 wolf2 kernel: Bootmem setup node 1 0000000040000000-000000007fff0000

There is also a /sys interface to NUMA that can provide information on how
well the policy is being adhered to in:
/sys/devices/system/node/nodeX/numastat

The values are explained in
/usr/src/linux/Documentation/numastat.txt

If there are a large number of misses (compared to hits), then you are
either getting a lot of CPU migration, or the memory is node interleaved.

SUSE Pro 9.2's smp-kernel does have NUMA support enabled.

>> It would certainly be nice to somehow guarauntee that each task ran with
>> its memory locally allocated, but this would require some sort of user
>> space interface in the NUMA kernel code, as I believe there is in IRIX.
>
> I assume this is not the case in the current linux kernels?

As far as the scheduler automatically locking a process down on the first
available, least-used CPU, it doesn't. However, the O(1) scheduler
natively attempts to spread out the processes, and maintain processor
continuity (the O(0) scheduler in the 2.4.X kernel ping-pong'ed A LOT).
With NUMA. I believe it also monitors where a process has allocated memory
when making a scheduling decision, but I'm not really sure. NUMA
awareness causes the kernel to attempt to allocate memory on (or near) the
current processor. So there is a natural, or "soft" CPU ownership in the
2.6.X kernels.

Since 2.5.8, you CAN set which CPUs a process may be scheduled on by using
the sched_setaffinity() system call. This creates a "hard" CPU ownership;
however, there is no facility to set the mask to the current CPU. For
testing, you could just create a mask using the rank, so that each rank
runs on a different CPU. If you do this before allocating bunches of
memory, then you should be fairly well guaranteed that each task is using
locally allocated memory.

In the sysv and usysv RPI's, it might be possible to make a routine to
collect the CPU masks and usage for all processes in the system, and then
determine the best masks to use for the child processes, but due to the
natural CPU affinity of the current kernels, and lots of other
headaches, I guess there really isn't much use for this.

------------------------------------------------------------
               Anthony Ciani (aciani1_at_[hidden])
            Computational Condensed Matter Physics
    Department of Physics, University of Illinois, Chicago
               http://ciani.phy.uic.edu/~tony
------------------------------------------------------------