Return to the Kernel page

LWN.net Weekly Edition for May 23, 2013

LWN.net Weekly Edition for May 16, 2013

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

A potential NUMA scheduling solution

ByJonathan Corbet
October 31, 2012

Earlier this year, two different developers set out to create a solution to the problem of performance (or the lack thereof) on non-uniform memory access (NUMA) systems. The Linux kernel's scheduler will freely move processes around to maximize CPU utilization on large systems; unfortunately, on NUMA systems, that can lead to processes being separated from their memory, reducing performance considerably. Two very different solutions to the problem were posted, leaving no clear path toward a single solution that could be merged into the mainline. Now, perhaps, that single solution exists, but the way that solution came about raises some questions.

The first approach was Peter Zijlstra's sched/numa patch set. It added a『lazy migration』mechanism (implemented by Lee Schermerhorn) that uses soft page faults to move useful pages to the NUMA node where they were actually being used. On top of that, it implemented a new "home node" concept that keeps the scheduler from moving processes between NUMA nodes whenever possible; it also tries to make memory allocations happen on the allocating process's home node. Finally, there was a pair of system calls allowing a process to change its home node and to form groups of processes that should all run on the same home node.

Andrea Arcangeli's AutoNUMA patch set, instead, was more strongly focused on migrating pages to the nodes where they are actually being used. To that end, it created a tracking mechanism (again, using page faults) to figure out where page accesses were coming from; there was a new kernel thread to perform this tracking. Whenever the generated statistics revealed that too many pages were being accessed from remote nodes, the kernel would consider either relocating the processes performing those accesses or relocating the pages; either way, the goal was to get both the processes and the pages on the same node.

To say that the two developers disagreed on the right solution is to understate the case considerably. Peter claimed that AutoNUMA abused the scheduler, added too much memory overhead, and slowed scheduling decisions unacceptably. Andrea responded that sched/numa would not work well, especially for larger jobs, without manual tweaking by developers and/or system administrators. The conversation was rather less than polite at times — until it went silent altogether. Peter last responded to the AutoNUMA discussion at the end of June — this example demonstrates the level of the discussion at that time — and the last sched/numa posting happened at the end of July.

The silence ended on October 25 with Peter's posting of the numa/core patch set. The patch introduction reads:

Here's a re-post of the NUMA scheduling and migration improvement patches that we are working on. These include techniques from AutoNUMA and the sched/numa tree and form a unified basis - it has got all the bits that look good and mergeable....

These patches will continue their life in tip:numa/core and unless there are major showstoppers they are intended for the v3.8 merge window. We believe that they provide a solid basis for future work.

It is worth noting that the value of "we" is not well defined anywhere in the patch set.

Numa/core brings in much of the sched/numa patch set, including the lazy migration scheme, the memory policy changes, and the home node concept. The core scheduler change tries to keep processes on their home node by adding resistance to moving a process away from that node, and by trying to push misplaced processes back to the home node during load balancing. There is also a feature to wake sleeping processes on the home node regardless of where they were running before, but it is disabled because "we found this to be far too aggressive." Missing from this patch set is the proposed numa_tbind() and numa_mbind() system calls; it's not clear whether those are meant to be added later.

The patch set also includes some ideas from AutoNUMA. The page structure gains a new last_nid field to record the ID of the NUMA node last observed to access the page. That new field will cause struct page to grow on 32-bit systems, which is never a popular thing to do. It is expected, though, that most systems where better NUMA scheduling really matters will be 64-bit.

Scanning of memory is still done: pages are marked as being absent so that usage patterns can be observed from the resulting soft faults. But the kernel thread to perform this scanning no longer exists; it is, instead, done by each process in its own context. The number of pages scanned is proportional to each process's run time, so little effort is put into the scanning of pages belonging to processes that rarely run. Scanning does not start until a given process has accumulated at least one second of run time. It makes sense that there is little value in optimizing the NUMA placement of short-lived processes; in this case, that intuition was confirmed with an improvement in the all-important kernel-compilation benchmark. Most of the memory overhead added by the original AutoNUMA patches has been removed.

Thus far, there has been little in the way of reviews of this large patch set, and no benchmark results posted. Things will have to pick up on that front if a patch set of this size is going to be ready by the time the 3.8 merge window opens. The numa/core patches may improve NUMA scheduling, and they may be the right basis to move forward with, but the development community as a whole does not know that yet.

There is one other thing that jumps out at an attentive observer. These patches credit Andrea's work with a set of Suggested-by and Based-on-idea-by tags, but none of them are signed off by Andrea. It would appear that, while some of his ideas have found their way into this patch set, his code has not. But, despite the fact that he did not write this code, Andrea has been conspicuously absent from the review discussion.

In the absence of any further information, it is hard not to conclude that Andrea has removed himself from this particular project. Certainly Red Hat cannot be faulted if it is unable to feel entirely comfortable when some of its highest-profile engineers are fighting among themselves in a public forum. So it is not hard to imagine that the developers involved were given clear instructions to resolve the situation. If that were the case, we would have a solution that was arrived at as much by Red Hat management as by the wider development community.

Such speculation (and it certainly is no more than that), of course, says nothing about the quality of the current patch set. That will be judged by the development community, presumably between now and when the 3.8 merge window opens. Assuming the patches pass this review, we should have an improved NUMA scheduler and an end to an ongoing dispute. As the number of NUMA (and NUMA-like) systems grows, that can only be a good thing.

(Log in to post comments)

A potential NUMA scheduling solution

Posted Nov 2, 2012 13:21 UTC (Fri) by marcH (subscriber, #57642) [Link]

In this BigTable day and age, which applications still (think they...) need massive SMPs? SQL databases, others?

A potential NUMA scheduling solution

Posted Nov 3, 2012 2:41 UTC (Sat) by Trelane (subscriber, #56877) [Link]

Not massive, but any number of physical packages > 1 is NUMA anymore.

A potential NUMA scheduling solution

Posted Nov 3, 2012 2:43 UTC (Sat) by Trelane (subscriber, #56877) [Link]

E.g. dual-package six-core is not uncommon at all for HPC. That is 2-zone. It's not a larg number of hops, nut local is always better.

A potential NUMA scheduling solution

Posted Nov 4, 2012 23:56 UTC (Sun) by marcH (subscriber, #57642) [Link]

Indeed: http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect

Thanks for correcting me!

A potential NUMA scheduling solution

Posted Nov 15, 2012 18:24 UTC (Thu) by slamb (guest, #1070) [Link]

I don't know about need, but anyone who needs a lot of computing power might prefer N M-package machines to N*M 1-package machines for efficiency's sake.

At Google, we certainly do. I won't reveal anything new, so take a look at this public talk (by Jeff Dean and Sanjay Ghemawat, always interesting people to follow). Multiple things per machine; their thing alone uses "144 machines, ~2300 cores", or 16 cores/machine.

This sort of configuration makes sense if you have to bin-pack applications with different needs, like a memory-heavy memcached, a CPU-heavy application server, and a disk/SSD-heavy filesystem daemon. Or if you have processes with a lot of fixed overhead to run - better to get as much out of each one as possible. Or if the N M-package machines use less power than the N*M 1-package machines.

A potential NUMA scheduling solution

Posted Nov 15, 2012 20:45 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

It depends how much communication you need between the different processes that you are using.

Using a bunch of 1-package machines imposes a much higher communication cost than multi-package machines.

There's also a question of the features provided on the motherboards. There are very few 1-package motherboards out there that have 'server' features like ECC memory, serial or network consoles, etc. If any of these features are desirable to you, you will probably be using 2-package motherboards at the smallest.

In terms of cost for computing power, 2-package machines seem to still be the sweet spot, you avoid having to pay for multiple power supplies, drives, and other per-system infrastructure, but you don't have the price premium that larger multi-package systems tend to have.

But even a 2-package machine can see advantages with proper NUMA handling (not as drastic as larger systems, but there is still an advantage)

In addition, a large amount of workload on large systems nowdays is virtual machine based with the hardware systems being oversubscribed. While a VM can be moved from one machine to another, it can be a fairly expensive thing to do, so larger systems work better in practice as they allow for the peaks and valleys of demand to average out better, so you can oversubscribe your hardware more.

A potential NUMA scheduling solution

Posted Nov 15, 2012 23:12 UTC (Thu) by intgr (subscriber, #39733) [Link]

> There are very few 1-package motherboards out there that have 'server' features like ECC memory, serial or network consoles, etc

Even most desktop motherboards support ECC these days and have serial port pin headers at the very least. All Intel server motherboards (and probably HP and IBM too) come with BMC and an optional IP KVM addon.

A potential NUMA scheduling solution

Posted Nov 15, 2012 23:39 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

serial port pin headers are not he same thing as having a serial console.

how many of these Intel server motherboards are single socket?

I've purchased single-package systems from HP, IBM, Sun and other tier-1 vendors. In every case the system that I received had a multi-package motherboard with only one package installed in it.

In that situation, the most cost-effective way to get compute capability is to go ahead and fully populate the motherboard. Even if that (usually second) package isn't as efficient as it would have been in a separate system (with no contention for RAM bandwidth, etc) it's still the best thing to do because the marginal cost of adding it to an existing system is so small.

A potential NUMA scheduling solution

Posted Nov 16, 2012 1:26 UTC (Fri) by intgr (subscriber, #39733) [Link]

> how many of these Intel server motherboards are single socket?
See R1304BTLSHBN for example (S1200BTL), it has a single-socket ATX motherboard in a 1RU chassis, ECC memory and an optional IP KVM addon.

Apr	MAY	Jun
	24
2012	2013	2014