|
Avoiding - and fixing - memory fragmentation
[Posted November 28, 2006 by corbet]
Memory fragmentation is a kernel programming issue with a long history. As
a system runs, pages are allocated for a variety of tasks with the result
that memory fragments over time. A busy system with a long uptime may have
very few blocks of pages which are physically-contiguous. Since Linux is a
virtual memory system, fragmentation normally is not a problem; physically
scattered memory can be made virtually contiguous by way of the page
tables.
But there are a few situations where physically-contiguous memory
is absolutely required. These include large kernel data structures (except
those created with vmalloc()) and any memory which must appear
contiguous to peripheral devices. DMA buffers for low-end devices (those
which cannot do scatter/gather I/O) are a classic example. If a large
("high order") block of memory is not available when needed, something will
fail and yet another user will start to consider switching to BSD.
Over the years, a number of approaches to the memory fragmentation problem
have been considered, but none have been merged. Adding any sort of
overhead to the core memory management code tends to be a hard sell. But
this resistance does not mean that people stop trying. One of the most
persistent in this area has been Mel Gorman, who has been working on an
anti-fragmentation patch set for some years. Mel is back with version 27 of his
patch, now rebranded "page clustering." This version appears to have
attracted some interest, and may yet get into the mainline.
The core observation in Mel's patch set remains that some types of memory
are more easily reclaimed than others. A page which is backed up on a
filesystem somewhere can be readily discarded and reused, for example,
while a page holding a process's task structure is pretty well nailed
down. One stubborn page is all it takes to keep an entire large block of
memory from being consolidated and reused as a physically-contiguous
whole. But if all of the easily-reclaimable pages could be kept together,
with the non-reclaimable pages grouped into a separate region of memory, it
should be much easier to create larger blocks of free memory.
So Mel's patch divides each memory zone into three types of blocks:
non-reclaimable, easily reclaimable, and movable. The "movable" type is a
new feature in this patch set; it is used for pages which can be easily
shifted elsewhere using the kernel's page migration mechanism. In
many cases, moving a page might be easier than reclaiming it, since there
is no need to involve a backing store device. Grouping pages in this way
should also make the creation of larger blocks "just happen" when a process
is migrated from one NUMA node to another.
So, in this patch, movable pages (those marked with __GFP_MOVABLE)
are generally those belonging to user-space processes. Moving a user-space
page is just a matter of copying the data and changing the page table
entry, so it is a relatively easy thing to do. Reclaimable pages
(__GFP_RECLAIMABLE), instead, usually belong to the kernel. They
are either allocations which are expected to be short-lived (some kinds of
DMA buffers, for example, which only exist for the duration of an I/O
operation) or can be discarded if needed (various types of caches).
Everything else is expected to be hard to reclaim.
By simply grouping different types of allocation in this way, Mel was able
to get some pretty good results:
In benchmarks and stress tests, we are finding that 80% of memory
is available as contiguous blocks at the end of the test. To
compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the
end of stress tests.
Linus has, in the past, been generally opposed to efforts to reduce memory
fragmentation. His comments this time
around have been much more detail-oriented, however: should allocations be
considered movable or non-movable by default? The answer would appear to
be "non-movable," since somebody always has to make some effort to ensure
that a specific allocation can be moved. Since the discussion is now
happening at this level, some sort of fragmentation avoidance might just
find its way into the kernel.
A related approach to fragmentation is the lumpy reclaim mechanism posted
by Andy Whitcroft but originally by Peter Zijlstra. Memory reclaim in
Linux is normally done by way of a least-recently-used (LRU) list; the hope
is that, if a page must be discarded, going after the least recently used
page will minimize the chances of throwing out a page which will be needed
soon. This mechanism will tend to free pages which are scattered randomly
in the physical address space, however, making it hard to create larger
blocks of free memory.
The lumpy reclaim patch tries to address this problem by modifying the LRU
algorithm slightly. When memory is needed, the next victim is chosen from
the LRU list as before. The reclaim code then looks at the surrounding
pages (enough of them to form a higher-order block) and tries to free them
as well. If it succeeds, lumpy reclaim will quickly create a larger free
block while reclaiming a minimal number of pages.
Clearly, this approach will work better if the surrounding pages can be
freed. As a result, it combines well with a clustering mechanism like Mel
Gorman's. The distortion of the LRU approach could have performance
implications, since the neighboring pages may be under heavy use when the
lumpy reclaim code goes after them. In an attempt to minimize this effect,
lumpy reclaim only happens when the kernel is having trouble satisfying a
request for a larger block of memory.
If - and when - these patches may be merged is yet to be seen. Core memory
management patches tend to inspire a high level of caution; they can easily
create chaos when exposed to real-world workloads. The problem doesn't go
away by itself, however, so something is likely to happen, sooner or later.
(Log in to post comments)
you know, guys, i love articles like these. they are easy to read and
understand for non-kernel hackers, while providing an interesting insight
in certain kernel-development issues.
even tough you're supporting Novell*, i'm really happy with my LWN
subscription...
* yeah, that's me trying to make a joke
Lumpy reclaim might distort the LRU a bit, but think you'll see a nice pattern emerge. If a "hot" page gets reclaimed early because it shares a higher order frame with a "cold" page, the hot page will refault soon and get a new page as an order-0 allocation. I believe this should tend to migrate hot pages to share higher-order frames. Thus, I'd imagine the performance issues might be temporary, and if combined with Mel's code, rare.
Or, I could just be full of it.
Combining both of the described methods would indeed be excellent. Page Clustering probably represents much more of a contributor than lumpy reclaim; Page Clustering keeps memory in a more optimal state, and creates a self-stabilizing system.
What interests me with Page Clustering is the movable pages. even if memory gets fragmented after strange stress (it's never impossible), movable pages can be moved into their respective areas. In other words, when memory enters a bad state, the kernel can actually shift it back towards a good state as needed instead of just flailing.
We know from file systems that the Page Clustering technique works wonders when memory is not full; slow memory (disk) with more than 5% free space rarely experiences fragmentation due to excellent file system drivers that work by clustering related data in the best-fit gap. When file systems get more full, they start fragmenting; similarly, if you can actually fill your memory more than 95% with non-reclaimable memory, you'll cause this sort of fragmentation.
Movable memory in this analogy would be something like a file system that contiguates files and directories into minimum-size chunks. My home directory has 185 inodes in it and due to bad management I've managed to make it take between 15 and 30 seconds to parse when the disk isn't busy and it's not cached; if those dentries were moved back together, at least into 64K chunks, it'd be only 5 or 6 8mS seeks and perform as fast as the 1000+ file directories that open in under 2 seconds.
Movable memory DOES allow this in the Page Clustering model, and such an implementation would let memory recover from a sudden spike as needed; a sudden spike of failures finding high-order memory would cause non-releasable memory to be shifted around and grouped back together, returning the system to a state where high-order blocks are again easy to come by.
No real point here, just felt like talking about self-stabilizing systems and tipping my hat to Mel for his excellent design.
|