|
Kernel development
The current development kernel is 3.9-rc5, released on March 31. Linus says:
"Nothing really peculiar stands out. Exynos DRM updates, IBM RamSan
driver updates are a bit larger, l2tp update... The rest is pretty much
small patches spread out all over. Mostly drivers (block, net, media, tty,
usb), networking, and some filesystem updates (btrfs, nfs). Some arch
updates (x86, arc). Things seem to be calming down a bit, and everything
seems largely on track for a 3.9 release in a few weeks."
Stable updates: 3.8.5, 3.4.38,
and 3.0.71 were released on March 28
with the usual set of fixes. 3.5.7.9 was
also released on the 28th.
Another big set of fixes is coming with
3.8.6,
3.4.39, and
3.0.72, which are due on or after
April 4.
Comments (none posted)
It would be nice if we can avoid the feature-removal.txt step. I
mean, I heard some doleful whispers once when my pointer selected
that file. Instead of opening the file I just closed the directory
instantly. I never told anybody about that before, this is the
first time. I also heard that somebody added an entry to that file
to schedule the removal of that file? This is devilry.
— Frederic Weisbecker
< system in single state : everyone sees cat = alive >
insert_into_box(cat);
< system in dual state : new calls see cat == dead, but
current calls see cat == alive >
open_box();
< system is back to single state: everyone sees cat = dead >
funeral(cat);
— Steven Rostedt explains RCU
Comments (none posted)
On behalf of the ZFS-on-Linux project, Brian Behlendorf has announced
the availability of version 0.6.1 of this Solaris-derived filesystem.
"Over two years of use by real users has convinced us ZoL is ready
for wide scale deployment on everything from desktops to super
computers." The project's home
page offers binary modules for a wide variety of distributions. (See
the FAQ for the project's take
on licensing issues.)
Comments (17 posted)
Kernel development news
ByMichael Kerrisk April 3, 2013
Minchan Kim's recent patch series to
provide user-space-triggered reclaim of a process's pages represents one
more point in a spectrum that increasingly sees memory management on Linux
as a task that is indirectly influenced or even directly controlled from
user space.
Approaches such as Android's low-memory killer represent one end of the
page-reclaim spectrum, where memory management is primarily under kernel
control. When memory is low, the low-memory killer picks a victim process
and kills it outright; applications that live in such an environment have
to work with the possibility that they may disappear from
one moment to the next. As Minchan pointed out via an amusing example, the
effects of the low-memory killer's approach to page reclaim can be extreme:
[Having a process killed to free memory] was really terrible
experience because I lost my best score of game I had ever after I
switch the phone call while I enjoyed the game
Jon Stultz's volatile ranges work and Minchan's own work on a similar
feature (both described in this article)
represent a middle point in the spectrum. The volatile ranges approach,
inspired by Android's ashmem, provides a process with a way to inform the
kernel that a certain range of its own virtual address space can be
preferentially reclaimed if memory pressure is high. Under this approach,
the kernel takes no responsibility for the contents of the reclaimed pages:
if the kernel needs the memory, the page contents are discarded, and it is
assumed that the application has sufficient information available to
re-create those pages with the right contents if they are needed. As with
the low-memory killer, the decision about if and when to reclaim the pages
remains with the kernel.
By contrast, Minchan's patch set places the decision about when to
reclaim pages directly under the control of user space. The interface provided by
these patches is simple. A /proc/PID/reclaim file is
provided for each process. A process with suitable permissions—that
is, a process owned by root or one with the same user ID as the
process PID—can write one of the following values to
the file, to cause some or all of the process's pages to be reclaimed:
-
Reclaim all process pages in file-backed mappings.
-
Reclaim all process pages in anonymous
(MAP_ANONYMOUS) mappings.
-
Reclaim all pages of the process (i.e., the combination of 1 and 2).
As currently implemented, all of the process's pages that match the
specified criterion are reclaimed. Your editor
wondered
whether there might be benefit in allowing some control over the
range of pages that are reclaimed from the target process, by allowing an
address range to be written to the /proc/PID/reclaim file.
By contrast with volatile ranges and the low-memory killer,
modifications in pages reclaimed via /proc/PID/reclaim are
not lost. Modified pages are written to the underlying file in the case of
shared (MAP_SHARED) file mappings or to swap in other cases. Thus,
if the process touches the reclaimed page later, it will be faulted into
memory with the contents at the time it was reclaimed. The patches also
include some logic to handle the case where multiple processes are sharing
the same pages; in that case, the pages are reclaimed only after all of the
processes have marked them for reclaim. Like the low-memory killer,
/proc/PID/reclaim can be used to reclaim all of the pages
in a process, but without needing to kill the process to do so.
The idea behind Minchan's proposal is that a user-space task manager
could take over some part of the job of memory management. In some cases,
this may be more effective than allowing the kernel to make
memory-management decisions, since the user-space task manager can bring
application-specific intelligence to decisions about whether to reclaim a
process's pages. For example, some application processes may be in the
foreground while others are in the background. It may desirable to
preferentially reclaim pages from one of the background processes, even if
it has some frequently accessed pages. Of course, the task manager would
somehow need to know when the system is under memory pressure. To that end,
a mechanism like Anton Vorontsov's proposed vmpressure_fd() API might be useful.
Minchan's patches apply against Michal Hocko's MMOTM
(memory management of the moment) tree. The patches came out on March
25, but have so far seen little review. Nevertheless, they present an idea
that will probably be of particular interest for the developers of mobile
and embedded devices and thus it seems likely that they will get some
attention at some point in the future.
Comments (4 posted)
ByMichael Kerrisk April 3, 2013
Dave Jones continues to exercise his Trinity fuzz tester and uncover interesting
bugs in kernel code. One recent find was a long-standing bug in
the implementation of network namespaces.
The discussion of the bug began when
Dave posted a note to the linux-kernel
mailing list with stack traces that showed a kernel deadlock in the VFS
code. Dave's report prompted Al Viro
to wonder how a Trinity instance was
managing to sit blocked on two locks (a situation that should never
be able to happen), as shown in the
lockdep output posted by Dave (the output
has some key pieces highlighted):
Showing all locks held in the system:
4 locks on stack by trinity-child2/7669:
#0: blocked: (sb_writers#4){.+.+.+},
instance: ffff8801292d17d8, at: [<ffffffff811df134>] mnt_want_write+0x24/0x50
#1: held: (&type->s_vfs_rename_key){+.+.+.},
instance: ffff8801292d1928, at: [<ffffffff811c6f5e>] lock_rename+0x3e/0x120
#2: held: (&type->i_mutex_dir_key#2/1){+.+.+.},
instance: ffff880110b3a858, at: [<ffffffff811c701e>] lock_rename+0xfe/0x120
#3: blocked: (&type->i_mutex_dir_key#2/2){+.+.+.},
instance: ffff880110b3a858, at: [<ffffffff811c7034>] lock_rename+0x114/0x120
Al also noted that the output suggested
that a directory inode in the inode cache was mapped by two different
dentries, since lockdep showed two i_mutex_dir_key locks on the
same address. A dentry (directory entry) is a data structure representing a
filename in the kernel directory entry cache (dcache); a brief overview of
dentries and the dcache can be found in this
article. As will become clear shortly, it should normally never happen
that a directory inode is mapped twice in the dcache.
Some suggestions ensued regarding suitable debugging statements to add
to the kernel's lock_rename() function to further investigate the
problem. In particular, when two locks were held on the same inode
address, Linus wanted to see the filenames
corresponding to the inode and Al was
interested to know the name of the filesystem holding the two inodes.
Further runs of Trinity with those debugging statements in place revealed that the locks in question were occurring
for various entries under the /proc tree. At that point Linus refined the observation to note that the entries in
question were for directories under /proc/net, but, like Al, he was
puzzled as to how that could occur.
Here, a little background is probably in order. Once upon a time,
/proc/net was single directory. But, with the invention of network
namespaces, it is now a symbolic link to the /proc/self/net
directory; in other words, each process now has its own
network-namespace-specific view of networking information under
/proc.
With the output from the kernel debugging statements, the pieces
started falling rapidly into place. Dave realized that he had started seeing the
Trinity failure reports after he had enabled kernel namespaces support
following a recent bug fix by Eric Biederman. Al began looking more closely at some of the
subdirectories under the /proc/PID/net directories, and
made an unhappy discovery:
al <at> duke:~/linux/trees/vfs$ ls -lid /proc/{1,2}/net/stat
4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/1/net/stat
4026531842 dr-xr-xr-x 2 root root 0 Mar 21 19:33 /proc/2/net/stat
That discovery prompted a small explosion:
WE CAN
NOT HAVE SEVERAL DENTRIES OVER THE SAME DIRECTORY INODE. […]
Sigh... Namespace kinds - there should've been only one...
Those with a long memory, or at least careful attention when reading a recent LWN article, might smile with the
realization that, to begin with and for many years thereafter, there was
only one class of namespace—mount namespaces, as implemented by one
Al Viro.
Humor aside, Al had discovered the origin of the problem. The
directory listing above shows two directory entries linked to the same
inode. More generally, for all of the processes that share a network
namespace, each of the corresponding entries in
/proc/PID/net is implemented as a hard link to the same
(virtual) /proc file.
Implementing corresponding /proc entries as hard links to the
same inode is a technique used in various places in the implementation of
namespaces. Indeed, allowing multiple hard links to a file is a normal
feature of UNIX-type systems. Except in one case: Linux, like other UNIX
systems, forbids multiple hard links to a directory. The reliability of
various pieces of kernel and user-space code is predicated on that
assumption. However, a
Linux 2.6.25 patch made early in the implementation of network
namespaces set in train some changes that quietly broke the assumption for
the directories under /proc/PID/net.
Having determined the cause of the problem, the developers then needed
devise a suitable fix. At this point, pragmatic factors come into play,
since the task is not only to fix the kernel going forward, but also going
backward. In other words, the ideal solution would be one that could be
applied not only to the current kernel source tree and but also to the
stable and long-term kernel series. That led Linus to speculate about the possibility of allowing
an exception to the rule that directory inodes are not allowed to have
multiple links. Since the locks in question are placed at the inode level,
why not change lock_rename() to replace the check on whether that
function is dealing with the same dentries with a check on whether it is
dealing with the same inodes?
However, Al was quick to point out that
while modifying the check would solve the particular deadlock problem found
by Dave, other problems would remain. The kernel code that deals with
those locks depends upon a topological sort
based on the hierarchical relationship between entries in the dcache; the
presence of multiple directory entries that link to the same inode renders
that sort unreliable.
Alwent on to describe what he
considered to be the full and proper solution: creating
/proc/PID/net files as symbolic links to
per-network-namespace directories of the form
/proc/netns-ID/net, where netns-ID is a
per-namespace identifier. Alternatively,
the existing /proc/PID/net trees could be kept, but the
subdirectories could be created as duplicate subtrees rather than hard
links to a single directory subtree. Al was, however, unsure about the feasibility of implementing
this solution as a patch that could be backported to past stable kernel
series.
In the meantime, Linus came up with another
proposal. proc_get_inode(), the kernel function for allocating
inodes in the /proc filesystem, has the following form:
struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
{
struct inode *inode = iget_locked(sb, de->low_ino);
if (inode && (inode->i_state & I_NEW)) {
...
/* Populate fields in newly allocated cache entry pointed
to by 'inode' */
...
unlock_new_inode(inode);
} else
pde_put(de);
return inode;
}
The iget_locked() function searches the kernel's inode cache
for an inode whose number corresponds to that recorded in the dentry
structure de. It returns either a pointer to an existing entry,
or, if no entry could be found, it allocates a new uninitialized cache
entry that it returns to the caller. The proc_get_inode() function
then populates the fields of the newly allocated inode cache entry using
information from the dentry.
The deadlock problem is a result of the fact that—because
multiple dentries map to the same inode—multiple locks may be placed
on the same entry in the inode cache. Conversely, deadlocks could be
avoided if it was possible to avoid placing multiple locks on the inode
entries returned from the cache. As
Linus noted, in the case of /proc
files, it is not really necessary to find an existing entry in the cache,
because there is no on-disk representation for the inodes
under /proc. Instead, proc_get_inode() could simply
always create a new cache entry via a call to new_inode_pseudo()
and populate that cache entry. Since a new cache entry is always created,
it will not be visible to any other process, so that there will be no
possibility of lock conflicts and deadlocks. In other words, the logic
of proc_get_inode() can be modified to be:
struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
{
struct inode *inode = new_inode_pseudo(sb);
if (inode) {
inode->i_ino = de->low_ino;
...
/* Populate fields in newly allocated cache entry pointed
to by 'inode' */
...
} else
pde_put(de);
return inode;
}
Here, it is worth noting that the kernel uses two different allocation
schemes for the inodes under /proc: one scheme that is generally
employed for inodes under the /proc/PID directories and
another for the inodes in the remainder of /proc. Linus's patch
affects only inode allocations for entries in the second category. However,
as a consequence of the implementation history, whereby /proc/net
was migrated to /proc/PID/net, the inodes under
/proc/PID/net are allocated in the same fashion as inodes
outside /proc/PID, and so the patch also affects those
inodes.
In the subsequent commit
message, Linus noted that the patch could have been refined so that the
new behavior was applied only to directory entries, rather than all
entries, under /proc. However, in the interests of keeping the
change simple, no such differentiation was made.
The effect of Linus's patch is to prevent multiple locks (and thus
deadlocks) on the same inode. Al agreed that the change should not be a
problem from a correctness perspective. On the other hand, this change also
has the effect of nullifying the benefits of inode caching for
/proc files outside /proc/PID. Al wondered about
the performance impact of that change. However, some casual instrumentation
of the kernel suggested that the benefits
of inode caching for /proc are low in any case. In addition, Dave
reported that with the fix applied,
Trinity was no longer hitting the deadlock problem.
Comments (none posted)
This article was contributed by Dan Magenheimer
Amdahl's law tells us that there is always a bottleneck in any computing
system. Historically,
the bottleneck in many workloads on many systems is the CPU and
so system designers have made CPUs faster and more efficient and
also continue to increase the number of CPU cores even in low-end
systems. So now, increasingly, RAM is
the bottleneck; CPUs wait idly while data is moved
from disk to RAM and back again. Adding
more RAM is not always a timely or cost-effective option and sometimes
not an option at all. Faster I/O buses and solid-state disks reduce
the bottleneck but don't eliminate it.
Wouldn't it be nice if it were possible to increase the effective amount
of data stored in RAM? And, since those CPUs are waiting anyway,
perhaps we could use those spare CPU cycles to contribute towards
that objective? This is the goal of in-kernel compression: We keep more
data — compressed — in RAM and use otherwise idle CPU cycles to execute
compression and decompression algorithms.
With the recent posting of zswap, there are
now three in-kernel compression
solutions proposed for merging into the kernel memory management (MM)
subsystem: zram, zcache, and zswap. While a casual observer might
think that only one is necessary, the three have significant differences
and may target different user bases. So, just as there are many filesystems
today co-existing in the kernel, someday there may be multiple
compression solutions. Or maybe not... this will be decided by Linus
and key kernel developers. To help inform that decision,
this article compares and contrasts the different solutions, which
we will call, for brevity, the "zprojects". We will first introduce
some of the key concepts and challenges of compression. Then
we will identify three design layers and elaborate on the different
design choices made by the zprojects. Finally, we will discuss how
the zprojects may interact with the rest of the kernel and then
conclude.
Compression basics
For in-kernel compression to work, the kernel must take byte sequences
in memory and compress them, and then keep the compressed version in RAM
until some point in the future when the data is needed again.
While the data is in a compressed state, it is impossible to
read any individual bytes from it or write any individual bytes
to it. Then, when the data is needed again,
the compressed sequence must be decompressed so that individual bytes
can again be directly accessed.
It is possible to compress any number of sequential bytes but
it is convenient to use a fixed "unit" of compression. A
standard storage unit used throughout the kernel is a "page"
which is made up of a fixed constant PAGE_SIZE bytes (4KB on
most architectures supported by Linux). If this page is aligned at a
PAGE_SIZE address boundary, it is called a "page frame";
the kernel maintains a corresponding "struct page" for every
page frame in system RAM. All three zprojects use a page as
the unit of compression and allocate and manage page frames
to store compressed pages.
There are many possible compression algorithms. In general,
achieving a higher "compression ratio" takes a larger number
of CPU cycles, whereas less-effective compression can be done faster.
It is important to achieve a good balance between time and
compression ratio. All three zprojects, by default, use the
LZO(1X) algorithm in the kernel's lib/ directory, which is
known to deliver a good balance. However, it is important that
the choice of algorithm remain flexible; possibly the in-CPU
algorithm might be replaced in some cases by an architecture-specific
hardware compression engine.
In general, the number of cycles required to compress, and to
later decompress, a sequence of bytes is roughly proportional
to the number of bytes in the sequence. Since the size of a page
is fairly large, page compression and decompression are expensive
operations and, so, we wish to limit the number of those operations. Thus
we must carefully select which pages
to compress, choosing pages that are likely to be used again but
not likely to be used in the near future, lest we spend all the
CPU's time repeatedly compressing and then decompressing pages.
Since individual bytes in a compressed page are inaccessible,
we must not only ensure that normal CPU linear byte addressing
is not attempted on any individual byte, but also ensure that the
kernel can clearly identify the compressed page so that it can
be found and decompressed when the time comes to do so.
When a page is compressed, the compression algorithm is applied
and the result is a sequence of bytes which we can refer to as a
"zpage".
The size of a zpage, its "zsize", is highly data-dependent,
hard to predict, and highly variable. Indeed the expected distribution of
zsize drives a number of key zproject design decisions so it
is worthwhile to understand it further. The "compression ratio"
for a page is zsize divided by PAGE_SIZE. For nearly all pages,
zsize is less than PAGE_SIZE so the compression ratio is less than one,
but odd cases may occur where compression actually results in
the zsize being larger than PAGE_SIZE. A good compression solution must
have a contingency plan to deal
with such outliers. At the other extreme, a data page containing
mostly zeroes or ones may compress by a factor of 100x or more.
For example, LZO applied to an all-zero page results in a
zpage with zsize equal to 28, for a compression ratio of 0.0068.
As a rough rule of thumb, "on average", compression has a ratio of about
0.5 (i.e. it reduces a page of data by about a factor of two),
so across a wide set of workloads, it is useful to envision the distribution
as a bell curve, centered at PAGE_SIZE/2.
We will refer to zpages with zsize less than PAGE_SIZE/2 as "thin"
zpages, and zpages with zsize greater than PAGE_SIZE/2 as "fat" zpages.
For any given workload, of course, the distribution may "skew thin",
(if many mostly-zero pages are compressed, for example) or "skew fat",
as is true if the data is difficult to compress (already-compressed data
like a JPEG image would be an example).
A good compression
solution must thus be able to handle zsize distributions from a wide range of
workloads.
With this overview and terminology, we can now compare and
contrast the three zprojects. At a high level, each uses
a "data source layer" to provide pages to compress and a
"data management layer" to organize and store the zpages.
This data management layer has two sublayers, the first
for data organization, which we will call the "metadata layer",
and the second for determining how best to utilize existing
kernel memory and kernel allocation code (such as alloc_page())
to optimally store the zpages, which we will call the
"zpage allocator layer".
Data source layer — overview
As previously noted, the nature of compression limits the
types and quantities of pages to which it can be applied.
It doesn't make sense to compress a page for which there is
a high probability of imminent reuse. Nor does it make
sense to compress a page for which there is a high probability
the data it contains will never again be reused. All three
zprojects assume one or more data sources that provide
a "pre-qualified" sequence of data pages. In one zproject,
an existing kernel subsystem is the data source. For
the other two zprojects, previous kernel hooks added as part of
the "transcendent memory" projects
known as cleancache and frontswap,
source the data page stream.
Each data source must also provide a unique identifier,
or "key", for each page of data so that the page, once stored
and associated with that unique key, can be later retrieved by
providing the key.
It's also important to note that different data sources
may result in streams of pages with dramatically different
contents and thus, when compressed, different
zsize distributions.
Swap/anonymous pages as a data source
The Linux swap subsystem provides a good opportunity for compression
because, when the system is under memory pressure, swap acts as a
gatekeeper between (a) frequently used anonymous pages which are kept
in RAM, and (b) presumably less frequently used anonymous pages that
"overflow", or can be "swapped", to a very-much-slower-than-RAM swap disk.
If, instead of using a slow swap disk, we can cause the gatekeeper to
store the overflow pages in RAM, compressed, we may be
able to avoid swapping them entirely. Further, the swap subsystem already
provides a unique key for each page so that the page
can be fetched from the swap disk.
So, unsurprisingly, all three zprojects consider swap pages as
an ideal data source for compression.
One zproject, zram, utilizes the existing swap device model. A zram swap
device is explicitly created and enabled in
user space (via mkswap and swapon) and this zram device is prioritized
(presumably highest) against other configured swap devices. When the swap subsystem sends
a data page to the zram device, the page goes through the
block I/O subsystem to the zram "driver". All swap pages sent
to zram are compressed and associated with a key created by concatenating
the swap device identifier and page offset. Later, when the swap subsystem
determines (via a page fault) that the page must be brought back
in from the swap device to fill a specific page frame, the zram
device is notified; it then decompresses the zpage matching the swap
device number and page offset into that page frame.
The other two zprojects, zcache and zswap, use the kernel's frontswap
hooks (available since Linux 3.5) in the swap subsystem. Frontswap
acts as a "fronting store" — a type of a cache — to an existing
swap device. It avoids the block I/O subsystem completely;
swapped pages instead go directly to
zcache/zswap where they are compressed. On the fetch path, when the
page fault occurs, the block I/O subsystem is again bypassed and then,
as with zram, the page is decompressed directly into the faulting page frame.
There are some subtle but important side effects of the different
approaches:
-
The block I/O subsystem is designed to work with fixed-size block
devices and is not very happy when a block write results in a failure.
As a result, when a poorly-compressible page ("very fat" zpage) is
presented to zram, all PAGE_SIZE bytes of the uncompressed page are
stored by zram in RAM, for no net space savings. The frontswap-based
zprojects have more flexibility; if a very fat zpage is presented,
zcache/zswap can simply reject the request, in which case the swap subsystem
will pass on the original data page to the real swap disk.
-
Since zram presents itself as a swap disk, user-space configuration is
required, but this also means that zram can work on systems with no
physical swap device, a configuration common for embedded Linux
systems. On the other hand, zcache and zswap depend entirely on
already-configured swap devices and so are not functional unless at
least one physical swap device is configured and sized adequately. As
a result, one might think of zram as being a good match for an
embedded environment, while the others are better suited for a more
traditional server/data-center environment.
-
It is important to note that all three zprojects must never discard
any of these data pages unless explicitly instructed, otherwise
user-space programs may experience data loss. Since all three
zprojects have a finite capacity on any given system, it is prudent to
have a "pressure relief valve". Both zcache and zswap have the
ability to "writeback" data to the swapdisk to which the data was
originally intended to be written; this cannot be done with zram.
Clean page cache pages as a data source
It is not uncommon for the vast majority of page frames in a running
system to contain data pages that are identical
to pages already existing on
a disk in a filesystem. In Linux, these pages are stored in
the page cache; the data in these pages is kept in RAM in anticipation
that it will be used again in the future.
When memory pressure occurs and the kernel is looking to free
RAM, the kernel will be quick to
discard some of this duplicate "clean" data because it can be fetched from
the filesystem later if needed.
A "cleancache" hook, present in the kernel since Linux 3.0, can
divert the data in these page frames, sending the pages to zcache
where the data can be compressed and preserved in RAM. Then, if the
kernel determines the data is again needed in RAM, a page fault
will be intercepted by another cleancache hook
which results in the decompression of the data.
Because modern filesystems can be large, unique identification
of a page cache page is more problematic than is unique identification of
swap pages. Indeed, the key provided via cleancache must uniquely
identify: a filesystem; an "exportable" inode value representing a file
within that filesystem; and a page offset within the file. This
combination requires up to 240 bits.
Zcache was designed to handle cleancache pages, including the full
range of required keys. As a result, the data management layer
is much more complex, consisting of a combination of different data
structures which we will describe below. Further, the location
of some cleancache hooks in the VFS part of the kernel results
in some calls to the data management layer with interrupts disabled
which must be properly handled.
Even more than with swap data, filesystem data can proliferate
rapidly and, even compressed, can quickly fill up RAM. So,
as with swap data, it is prudent to have a pressure-relief valve
for page cache data. Unlike swap data, however, we know that
page cache data can simply be
dropped whenever necessary. Since zcache
was designed from the beginning to manage page cache pages, its
data management layer was also designed to efficiently handle
the eviction of page cache zpages.
Zram can maintain entire fixed-size filesystems in compressed form
in RAM, but not as a cache for a non-RAM-based filesystem. In essence,
part of RAM can be pre-allocated as a "fast disk" for a filesystem;
even filesystem metadata is compressed and stored in zram.
While zram may be useful for small filesystems that can entirely fit
in RAM, the caching capability provided via cleancache combined with
the compression provided by zcache is useful even for large filesystems.
Zswap is singularly focused on swap and so does not make use of the
page cache at all or filesystem data as a data source. As a result, its
design is simpler because it can ignore the complexities required
to handle the tougher page cache data source.
The metadata layer
Once the zproject receives a data page for compression, it must
set up and maintain data structures so that the zpage can be stored
and associated with a key for that page, and then later found
and decompressed. Concurrency must also be considered to ensure
that unnecessary serialization bottlenecks do not occur. The three
zprojects use very different data structures for different reasons,
in part driven by the requirements of data sources to be maintained.
Since the number of swap devices in a system is small, and the
number of pages in each is predetermined and fixed, a small, unsigned
integer representing the swap device combined with a page offset is sufficient to identify
a specific page. So zram simply uses a direct table lookup (one
table per configured zram swap device) to find and manage its data.
For example, for each data page stored, the zram table contains a
zsize, since zsize is useful for decompression validation, as well
as information to get to the stored data. Concurrency is restricted,
per-device, by a reader-writer semaphore.
Zswap must manage the same swap-driven address space, but it utilizes
one dynamic red-black tree (rbtree) per swap device
to manage its
data. During tree access, a spinlock on the tree must be held; this
is not a bottleneck because simultaneous access within any one
swap device is greatly limited by other swap subsystem factors.
Zswap's metadata layer is intentionally minimalist.
Zcache must manage both swap data and pagecache data, and the latter
has a much more extensive key space which drives the size and
complexity of zcache data structures. For each filesystem,
zcache creates a "pool" with a hash table of red-black tree root
pointers; each node in the rbtree represents a filesystem
inode — the inode space in an exportable filesystem is
often very sparsely populated which lends itself to management
by an rbtree. Then each rbtree node contains the head of a radix
tree — the data structure used throughout the kernel for page
offsets — and this radix tree is used to look up a specific page
offset. The leaf of the radix tree points to a descriptor which,
among other things, references the zpage.
Each hash table entry has a spinlock to manage concurrency; unlike
swap and frontswap, filesystem
accesses via cleancache may be highly concurrent so contention must
be aggressively avoided.
Impact of writeback on the metadata layer
Earlier it was noted that zcache and zswap support "writeback";
this adds an important twist to the data structures in that,
ideally, we'd like to write back pages in some reasonable order, such as
least recently
used (LRU). To that end, both zcache and
zswap maintain a queue to order the stored data; zswap queues
each individual zpage and zcache queues the page frames used to store
the zpages. While lookup requires a key and searches the data
structures from the root downward, writeback chooses one or more zpages from
the leaves of the data structure and then must not only remove or
decompress the data, but must also remove its metadata from the leaf upward.
This requirement has two ramifications: First, the leaf nodes of
data structures must retain the key along with information necessary
to remove the metadata. Second, if writeback is allowed to execute
concurrently with other data structure accesses (lookup/insert/remove),
challenging new race conditions arise which must be understood
and avoided.
Zswap currently implements writeback only as a side effect of
storing data (when kernel page allocation fails) which limits the
possible race conditions. Zcache implements writeback as a shrinker
routine that can be run completely independently and, thus,
must handle a broader set of potential race conditions. Zcache must also
handle this same broader set when discarding page cache pages.
Other metadata layer topics
One clever data management technique is worth mentioning here: Zram
deduplicates "zero pages" — pages that contain only zero bytes —
requiring no additional storage. In some zsize distributions,
such pages may represent a significant portion of the zpages
to be stored. While the data of the entire page may need to be
scanned to determine if it is a zero page, this overhead may
be small relative to that of compression and, in any case,
it pre-populates the cache with bytes that the compression algorithm
would need anyway. Zcache and zswap should add this feature.
More sophisticated deduplication might also be useful. Patches are
welcome.
Zpage allocator layer
Memory allocation for zpages can present some unique challenges.
First, we wish to maximize the number of zpages stored in a given
number of kernel page frames, a ratio which we will call "density".
Maximizing the density
can be challenging as we do not know a priori the number of
zpages to be stored, the distribution of zsize of those zpages,
or the access patterns which insert and remove those zpages. As a result,
fragmentation can be a concern. Second, allocating memory to store
zpages may often occur in an environment of high-to-severe memory
pressure; an effective allocator must handle frequent failure and
should minimize large (i.e. multiple-contiguous-page) allocations. Third,
if possible, we wish the allocator to support some mechanism to enable
ordered writeback and/or eviction as described above.
One obvious alternative is to use an allocator already present and
widely used in the kernel: kmalloc(), which is optimized for
allocations that nicely fit into default sizes of 2N byte chunks.
For allocations of size 2N+x, however, as much as 50% of the
allocated memory is wasted. kmalloc() provides an option for
"odd" sizes but this requires pre-allocated caches, often of
contiguous sets of pages ("higher order" pages) which are difficult
to obtain under memory pressure. Early studies showed that kmalloc()
allocation failures were unacceptably frequent and density was insufficient,
so kmalloc() was discarded and other options were pursued, all based
on alloc_page(), which always allocates a single page frame.
To improve density, the xvmalloc allocator was written, based on
the two-level sequential fit (TLSF) algorithm. Xvmalloc provided
high density
for some workloads and was the initial default allocator for zram and for
the frontswap-driven portion of the initial zcache. It was found, however,
that the high density led to poor fragmentation characteristics,
especially for zsize distributions that skewed fat. Xvmalloc has
since been discarded (though remnants of it survive through Linux 3.8.)
A new allocator, zbud, was written for the cleancache-driven portion of the
original zcache code. The initial version of zbud had less focus on high
density and more on the ability to efficiently evict cleancache-provided
zpages. With zbud, no more than two zpages are contained
in a page frame, buddying up a pair of zpages, one fat and one thin.
Page frames, rather than zpages, are entered into an LRU queue so
that entire page frames can be easily "freed" to the system.
On workloads where zsize skews thin, a significant amount of space
is wasted, but fragmentation is limited, and eviction is still easy.
Zsmalloc, an allocator intended to "rule them all", was then written.
Zsmalloc takes the best of kmalloc() but with the ability to provide
the complete range
of "size classes" suitable to store zpages; zsmalloc also never requires
higher-order allocations. All zpages with zsize within
a "class" (e.g. between 33-48 bytes) are grouped together and stored
in the same "zspage". A clever feature (especially useful for fat zpages)
allows discontiguous page frames to be "stitched" together, so that
the last bytes in the first page frame contain the first part of the zpage
and the first bytes of the second page frame contain the second part.
A group of N of these discontiguous pages are allocated on demand
for each size class, with N chosen by the zsmalloc code
to optimize density.
As a result of these innovations, zsmalloc achieves great density
across a wide range of zsize distributions: balanced, fat, and thin.
Alas, zsmalloc's strengths also proves to be its Achilles heel:
high density and page crossings make it very difficult for eviction
and writeback to concurrently free page frames, resulting in a different
kind of fragmentation (and consequent lower density) and rendering it
unsuitable for management of cleancache-provided pages in zcache.
So zbud was revised to utilize some of the techniques pioneered
by zsmalloc.
Zbud is still limited to two zpages per page frame, but when under
heavy eviction or writeback, its density compares favorably
with zsmalloc. The decision by zcache to use zbud for swap pages
sadly led to a fork of zcache, first to "old zcache" (aka zcache1)
and "new zcache" (aka zcache2) and then to a separate zproject
entirely, zswap, because the author prefers the density of zsmalloc
over the ability to support cleancache pages and the predictability
and page frame-reclaim of zbud.
So, today, there are two zpage allocators, both still evolving.
The choice for best zpage allocator depends
on use models and unpredictable workloads. It may be possible that
some hybrid of the two will emerge, or perhaps some yet unwritten
allocator may arise to "rule them all".
Memory management subsystem interaction
In some ways, a zproject is like a cache for pages temporarily removed
from the MM subsystem. However it is a rather unusual cache in that,
to store its data, it steals capacity (page frames) from its backing
store, the MM subsystem. From the point-of-view of the MM subsystem,
the RAM is just gone, just as if the page frames were absorbed by
a RAM-voracious device driver. This is a bit unfortunate, given that
both the zproject and the MM subsystem are respectively, but separately,
managing compressed and uncompressed anonymous pages (and possibly
compressed and uncompressed page cache pages as well).
So it is educational to consider how
a zproject may "load balance" its memory needs with the MM subsystem,
and with the rest of the kernel.
Currently, all three zprojects obtain page frames using the standard
kernel alloc_page()
call, with the "GFP flags" parameter chosen to ensure that the kernel's
emergency reserve pages are never used.
So each zproject must be (and is) resilient to the fairly frequent
situation where an alloc_page() call returns failure.
Zram makes no other attempt to limit its total use of page frames but,
as we've seen,
it is configured by the administrator as a swap device, and swap parameters do
limit the number of swap pages, and thus zpages, that can be accepted.
This indirectly
but unpredictably limits the number of page frames used. Also,
the configuration is independent of and not cognizant of the total RAM
installed in the system, so configuration errors are very possible.
Zswap uses a slightly modified version of zsmalloc to track the
total number of page frames allocated. It ensures this number does
not exceed a certain percent of total RAM in the system, and this
limit is configurable via sysfs. As previously noted, using
writeback, zswap can reduce the number of (anonymous) zpages stored
and this, indirectly, can reduce the number of page frames,
though at the likely cost of fragmentation.
Zcache and zbud were designed from the beginning with much more dynamic
constraints on page frame utilization in mind, intended eventually to
flexibly mesh with the dramatically time-varying memory balancing algorithms
that the MM subsystem uses.
While the exact interaction model and best future control policy are not
yet determined, zcache's and zbud's underlying page frame-based writeback
and eviction mechanisms support a shrinker-like interface that can
be told to, for example, reduce the number of page frames used for
storing anonymous zpages from 10000 to 8937 and the number of page frames
used for storing page cache zpages from 3300 to 463, and zcache can do that.
Since zbud is much more predictable (two zpages per page frame), the
MM subsystem can estimate and manage both the total number of anonymous
and pagecache pages (compressed and uncompressed) in the system and the
total number of page frames used to manage them.
Status and conclusions
Zram, zcache, and zswap are all advancing the concept of in-kernel compression
in different ways. Zram and zcache are in-tree but both still in
the staging tree. While the staging tree has served to expose
the concept of in-kernel compression to a wide group of kernel
developers and even to the users of a few leading-edge distributions,
as Andrew Morton says,
the staging tree "is where code goes to be ignored."
Staging tree maintainer Greg Kroah-Hartman has become reluctant to accept any new
zproject functionality into the staging tree. This creates a
conundrum for these zprojects: evolution in the staging tree
has resulted in rapid improvements in the design and implementation
and exposure of the zprojects, but because of this evolution
they are not stable enough for promotion into the core kernel
and further evolution is now stymied.
Zswap is attempting to circumvent the staging tree entirely by
proposing what is essentially a simplified frontswap-only fork of
zcache using zsmalloc instead of zbud, for direct merging into the
MM subsystem. Since it is also much simpler than zcache, it has
garnered more reviewers
which is a valuable advantage for to-be-merged code. But it is
entirely dependent on the still-in-staging zsmalloc, does not
support page cache pages at all, and does not support page-frame-based
writeback, so its simplicity comes at a cost. If zswap is
merged, it remains to be seen if it will ever be extended
adequately.
So, in-kernel compression has clear advantages and real users.
It remains unclear though when (or if) it will be merged
into the core kernel and how it should interact with
the core kernel. If you are intrigued, the various zproject
developers encourage your ideas and contributions.
Acknowledgments
Nitin Gupta was the original author of compcache and I
was the original author of transcendent memory. While both projects
were initially targeted to improve memory utilization in a multi-tenant
virtualized environment, their application to compression in a single
kernel quickly became apparent. Nitin designed and wrote the Linux
code for zram, xvmalloc, and zsmalloc, and wrote an early prototype
implementation of zcache. I wrote frontswap and cleancache (with the
cleancache hook placement originally done by Chris Mason), and the
Linux code for zcache, zbud, and ramster.
Seth Jennings contributed
a number of improvements to zcache and is the author of zswap.
Kernel mm developer Minchan Kim was a helpful influence for all the
zprojects, and none of the zprojects would have been possible without
Greg Kroah-Hartman's support — and occasional whippings — as
maintainer of the staging drivers tree.
As with any open-source project, many others have contributed ideas,
bug fixes, and code improvements and we are thankful for everyone's
efforts.
Comments (30 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>
|
|