|
What ever happened to chunkfs?
This article was contributed by Valerie Aurora (formerly Henson)
"What ever happened to chunkfs?" This is a question I hear every few
months, and I always answer the same way,『Chunkfs works, the overhead
is reasonable, and it is only practical if it is part of the file
system design from the beginning, not tacked on after the fact. I
just need to write up the paper summarizing all the data.』 Thanks to
your benevolent LWN editor, you are now reading that paper.
Background
Before we describe chunkfs, we should first talk about the problems
that led to its creation. Chunkfs was motivated by the growing
difficulty of reading and processing all the data necessary to check
an entire file system. As the capacity of storage grows, the capacity
of the rest of the system to read and check that data is not keeping
pace with that growth. As a result, the time to check a "normal" file
system grows, rather than staying constant as it would if the memory,
bandwidth, seek time, etc. of the system grew in proportion to its
storage capacity. These differential rates of growth in hardware -
like the differences between RAM capacity, bandwidth, and latency -
are part of what keeps systems research interesting.
To understand the change in time to check and repair a file system, a
useful analogy is to compare the growth of my library (yes, some
people still own books) with the growth of my ability to read,
organize, and find the books in the library. As a child, my books fit
all on one shelf. I could find the book I wanted in a matter of
seconds and read every single one of my books in a week. As an adult,
my books take up several bookshelves (in addition to collecting on any
flat surface). I can read many times faster than I could when I was a
kid, and organize my books better, but finding a book can take several
minutes, and reading them all would take me a year or more. If I was
trying to find a twenty dollar bill I left in a book, it would take me
several hours to riffle through all the books to find it. Even though
I am a better reader now, my library grew faster than my ability to
read or search my library. Similarly, computers are faster than ever,
but storage capacity grew even faster.
The consequence of this phenomenon that first caught my attention was
the enormous disparity between seek time and capacity of disks. I
calculated a projected change in fsck time given
these predictions
[PDF]
from a storage industry giant in 2006:
|
Improvement in disk performance, 2006 - 2013
|
| Capacity: | 16.0x |
| Bandwidth: | 5.0x |
| Seek time: | 1.2x |
| => at least 10x increase in fsck time! |
Since then, commodity flash-based storage finally became a practical
reality. Solid-state disk (SSDs) have much faster "seeks", somewhat improved bandwidth,
and drastically reduced capacity compared to disks, so the performance
of fsck ought to be extremely good. (I am unable to find any
measurements of fsck time on an SSD, but I would estimate it to be on
the order of seconds. Readers?) However, SSDs don't solve all our
problems. First, SSDs are an element in the cache hierarchy of a
system, layered between system RAM and disk, not a complete
replacement for disks. The sweet spot for SSDs will continue to
expand, but it will be many years or decades before disks are
completely eliminated - look at the history of tape. It's easy for a
laptop user to wave their hands and say, "Disks are obsolete," but try
telling that to Google, your mail server sysadmin, or even your
MythTV box.
Second, remember that the whole reason we care about fsck in the first
place is because file systems get corrupted. One important source of
file system corruption is failures in the storage hardware itself.
Ask any SSD manufacturer about the frequency of corruption on its
hardware and they will inundate you with equations, lab
measurements, and simulations showing that the flash in their SSDs will
never wear out in "normal" use - but they also won't reveal the
details of their wear-leveling algorithms or the workloads they used
to make their predictions. As a result, we get surprises in
real-world use, both in performance and in reliability. When it comes
to failure rates, we don't have good statistics yet, so I have to fall
back on my experience and that of my kernel hacker friends. What we
see in practice is that SSDs corrupt more often and more quickly than
disks, and prefer to devour the tastiest, most crucial parts of the
file system. You can trust the hardware manufacturers when they wave
their hands and tell you not to worry your pretty little head, but I
put more weight on the money I've made consulting for people with
corrupted SSDs.
So, if corruption is a concern, an important benefit of the chunkfs
approach is that file system corruption is usually limited to a small part
of the file system, regardless of the source of the
corruption. Several other repair-driven file system principles - such
as duplicating and checksumming important data - make recovery of data
after corruption much more likely. So if you don't care about fsck
time because you'll be using an SSD for the rest of your life, you
might still read on because you care about getting your data back from
your SSD.
The conclusion I came to is that file systems should be designed with
fast, reliable check and repair as a goal, from the ground up. I
outline some useful methods for achieving this goal in a short
paper: Repair-driven
File System Design [PDF].
Chunkfs design
Chunkfs is the unmusical name for a file system architecture designed
under the assumption that the file system will be corrupted at some
point. Therefore the on-disk format is optimized not only for
run-time performance but also for fast, reliable file system check and
repair. The basic concept behind chunkfs is that a single logical
file system is constructed out of multiple individual file systems
(chunks), each of which can be checked and repaired (fscked)
individually.
This is great and all, but now we have a hundred little file systems,
with the namespace and disk space fragmented amongst them - hardly an
improvement. For example, if we have a 100GB file system divided into
100 1GB chunks, and we want to create a 2GB file, it won't fit in a
single chunk - it has to somehow be shared across multiple chunks. So
how do we glue all the chunks back together again so we can share the
namespace and disk space while preserving the ability to check and
repair each chunk individually? What we want is a way to connect a
file or directory in one chunk and another file or directory in
another chunk in such a way that the connection can be quickly and
easily checked and repaired without a full fsck of the rest of the
chunk. The solution is something we named a "continuation inode". A
continuation inode "continues" a file or directory into another chunk.
You'll notice that there are two arrows in the picture to the left, one
pointing
from the original inode to the continuation inode, and another
pointing back from the continuation inode to the original inode. When
checking the second chunk, you can check the validity of the
continuation inode quickly, by using the back pointer to look up the
original inode in its chunk. This is a check of a『cross-chunk
reference』- any file system metadata that makes a connection between
data in two different chunks. Cross-chunk references must always have
forward and back pointers so that they can be verified starting from
either chunk.
Cross-chunk references must satisfy one other requirement: You must be
able to quickly find all cross-references from chunk A to arbitrary
chunk B, without searching or checking the entire chunk A. To see why
this must be, we'll look at an example. Chunk A has an inode X which
is continued into chunk B. What if chunk A was corrupted in such a
manner that inode X was lost entirely? We would finish checking and
repairing chunk A, and it would be internally self-consistent, but
chunk B would still have a continuation inode for X that is now
impossible to look up or delete - an orphan continuation inode. So as
the second step in checking chunk A, we must then find all references
to chunk A from chunk B (and all the other chunks) and check that they
are in agreement. If we couldn't, we would have to search every other
chunk in the file system to check a single chunk - and that's not
quick.
A chunkfs-style file system can be checked and repaired incrementally
and online, whereas most file systems must be checked all at once
while the file system is offline. Some exceptions to this general
rule do exist, such as the BSD FFS snapshot-based online check, and an
online self-healing mechanism for NTFS, but in general these
facilities are hacked on after the fact and are severely limited in
scope. For example, if the BSD online check actually finds a problem,
the file system must be unmounted and repaired offline, all at once,
in the usual way, including a rerun of the checking stage of fsck.
The chunkfs design requires solutions to several knotty problems we
don't have room to cover in this article, but you can read about in our
2006 Hot Topics in
Dependability paper:
Chunkfs: Using
divide-and-conquer to improve file system reliability and
repair [PDF].
Measurements
Repair-driven file system design, chunkfs in particular, sounds like a
neat idea, but as a file system developer, I've had a lot of neat
ideas that turned out to be impossible to implement, or had too much
overhead to be practical. The questions I needed to answer about
chunkfs were: Can it be done? If it can be done, will the overhead of
continuation inodes outweigh the benefit? In particular, we need to
balance the time it takes to check an individual chunk with the time
it takes to check its connections to other chunks (making sure that
the forward and back pointers of the continuation inodes agree with
their partners in the other chunk). First, let's take a closer look
at file system check time on chunkfs.
The time to check and repair a file system with one damaged chunk is
the sum of two different components, the time to check one chunk
internally and the time to check the cross references to and from this
chunk.
Tfs = Tchunk + Tcross
The time to check one chunk is a function of the size of the chunk,
and the size of the chunk is the total size of the file system divided
by the number of chunks.
Tchunk = f(sizechunk)
sizechunk = sizefs / nchunks
The time to check the cross-chunk references to and from this chunk
depends on the number of those cross references. The exact number of
cross-chunk references will vary, but in general larger chunks will
have fewer cross references and smaller chunks will have more - that
is, cross chunk check time will grow as the number of chunks grow.
Tcross = f(nchunks)
So, per-chunk check time gets smaller as we divide the file system
into more chunks, and at the same time the cross-chunk check time
grows larger. Additionally, the extra disk space taken up by
continuation inodes grows as the number of chunks grows, as does the
overhead of looking up and following continuation inodes during normal
operation. We want to find a sweet spot where the sum of the time to
check a chunk and its cross-references is minimized, while keeping the
runtime overhead of the continuation inodes small.
Does such a sweet spot exist? The answer depends on the layout of
files and directories on disk in real life. If cross-chunk references
are extremely common, then the overhead of continuation inodes will
outweigh the benefits. We came up with an easy way to estimate the
number of cross-chunk references in a chunkfs file system: Take a
"real" in-use ext3 file system and for each file, measure the number
of block groups containing data from that file. Then, for each
directory, measure the number of block groups containing files from
that directory. If we add up the number of block groups less one for
all the files and directories, we'll get the number of cross-chunk
references in a similar chunkfs file system with chunks the size of
the ext3 file system's block groups
(details
here).
Karuna Sagar wrote a tool to measure these cross-references,
called cref, and I added some code to do a worst-case
estimate of the time to check the cross-references (assuming one seek
for every cross-reference).
The results
were encouraging; assuming disk hardware progresses as predicted, the
average cross-chunk cross-reference checking time would be about 5
seconds in 2013, and the worst case would be about 160 seconds (about
2.5 minutes). This is with a 1GB chunk size, so the time to check the
chunk itself would be a few seconds. This estimate is worst-case in
another way: the ext3 allocator is in no way optimized to reduce
cross-chunk references. A chunkfs-style file system would have a more
suitable allocator.
Implementations
Chunkfs was prototyped three times. The first
prototype, written by Amit Gud
for his master's
thesis [PDF] in 2006-2007, implemented chunkfs as modifications
to the ext2 driver in FUSE. Note that the design of chunkfs described
in the thesis is out of date in some respects, see the chunkfs paper [PDF]
for the most recent version. He also ported this
implementation to the kernel in mid-2007, just for kicks. The results
were encouraging. Our main concern was that continuation inodes would
proliferate wildly, overwhelming any benefits. Instead, files with
continuation inodes were uncommon - 2.5% of all files in the file systems -
and no individual file had more than 10 continuation inodes. The test
workload included some simulation of aging by periodically deleting files
while filling the test file system. (It would be interesting to see the
results from using Impressions
[PDF], a very sophisticated tool for generating realistic file system
images.)
These implementations were good first steps, but they were based on an
earlier version of the chunkfs design, before we had solved some
important problems. In these implementations, any chunk that had been
written to since the file system was mounted had to be fscked after a
crash. Given that most of the file system is idle in common usage,
this reduced check time by about 2/3 in the test cases, but we are
looking for a factor of 100, not a factor of 3. It also lacked a
quick way to locate all of the references into a particular damaged
chunk, so it only checked the references leading out of the chunk. It
used an old version of the solution for hard links which would allow
hard links to fail if the target's chunk ran out of space, instead of
growing a continuation for the target into a chunk that did have
space. It was a good first step, but lacked a key feature:
drastically reduced file system repair time.
In mid-2007, I decided to write a prototype of chunkfs as a layered
file system, similar to unionfs or ecryptfs, that was completely
independent of the client file system used in each chunk.
Continuation inodes are implemented as a regular files, using extended
attributes to store the continuation-related metadata (the forward and
back pointers and the offset and length of the file stored in this
continuation inode). When the file data exceeds an arbitrary size
(40K in my prototype), a continuation inode is allocated in another
chunk and the data after that point in the file is stored in that
inode's data. All of the continuation inodes emanating from a
particular chunk are kept in one directory so that they can be quickly
scanned during the cross-chunk pass of fsck.
To test that the prototype could recover from file system corruption
in one chunk without checking the entire file system, I implemented
fsck as a simple shell script. First, it fscks the damaged chunk by
running the ext2 fsck on it. This may end up deleting or moving
arbitrary files in the chunk, which could make it out of sync with the
other chunks. Second, it mounts the now-repaired file systems and
reads their /chunks directories to find all connections
to the damaged chunk and consistency check them. If it finds an
orphaned continuation - a continuation whose origin inode in the
damaged chunk was destroyed or lost - then it deletes that
continuation inode.
The fsck script is deficient in one particular aspect: it checks all
the chunks because I didn't write the code to mark a chunk as damaged.
This is a difficult problem in general; sometimes we know that a chunk
has been damaged because the disk gave us an IO error, or we found an
inconsistency while using the file system, and then we mark the file
system as corrupted and needing fsck. But plenty of corruption is
silent - how can we figure out which chunk was silently corrupted? We
can't, but we can quietly "scrub" chunks by fscking them in the background.
Currently, ext2/3/4 triggers a paranoia fsck every N mounts or M days
since the last check. Unfortunately, this introduces an unexpected
delay of minutes or hours at boot time; if you're lucky, you can go
have a cup of coffee while it finishes, if you're unlucky, you'll be
figuring out how to disable it while everyone who has come to your
talk about improvements in Linux usability watches you. (N.B.: Reboot
and add "fastboot" to the kernel command line.) With chunkfs, we could
run fsck on a few chunks at every boot, adding a few seconds to every
boot but avoiding occasional long delays. We could also fsck inactive
chunks in the background while the file system in use.
Results
I gave a talk at LCA in January 2008 about chunkfs which included a
live demo of
the final
chunkfs prototype in action: creating files, continuing them into
the next chunk, deliberately damaging a chunk, and repairing the file
system. Because I carefully prepared a typescript-based fallback in
case things went sideways, the demo went perfectly. A video
of the talk [Ogg Theora] is available.
Note that I have never been able to bring myself to watch this talk,
so I don't know if it contains any embarrassing mistakes. If you
want, you can follow along during the demo section with
the annotated
output from the demo.
The collection-of-file-systems approach ended up being more complex
than I had hoped. The prototype used ext2 as the per-chunk file
system - a reasonable choice for some throw-away code, but not
workable in production. However, once you switch to any reliable file
system, you end up with one journal per chunk, which is quite a lot of
overhead. In addition, it seemed likely we'd need another journal to
recover from crashes during cross-chunk operations. Another source of
complexity was the use of a layered file system approach - basically,
the chunkfs layer pretends to be a client file system when it's
talking to the VFS, and pretends to be the VFS when it's talking to
the per-chunk file system. Since none of this is officially
supported, we end up with a long list of hacks and workarounds, and
it's easy to run into surprise reference counting or locking bugs.
Overall, the approach worked for a prototype, but it didn't seem worth
the investment it would to make it production quality, especially with
the advent
of btrfs.
Future work
When I began working on chunkfs, the future of Linux file systems
development looked bleak. The chunkfs in-kernel prototype looked like
it might be the only practical way to get repair-driven design into a
Linux file system. All that has changed; file system development is a
popular, well-funded activity and Linux has a promising
next-generation file system, btrfs, which implements many principles
of repair-driven file system design, including checksums, magic
numbers, and redundant metadata. Chunkfs has served its purpose as a
demonstration of the power of the repair-driven design approach and I
have no further development plans for it.
Conclusions
The three chunkfs prototypes and our estimates of cross-chunk
references using real-world file systems showed that the chunkfs
architecture works as advertised. The prototypes also convinced us
that it would be difficult to retrofit existing journaling file
systems to the chunkfs architecture. Features that make file system
check and repair are best when designed into the architecture of the
file system from the beginning. Btrfs is an example of a file system
designed from the ground up with the goal of fast, reliable check and
repair.
Credits
Many people and organizations generously contributed to the design and
prototyping of chunkfs. Arjan van de Ven was the co-inventor of the
original chunkfs architecture. Theodore Ts'o and Zach Brown gave
invaluable advice and criticism while we were thrashing out the
details. The participants of the Hot Topics in Dependability Workshop
gave us valuable feedback and encouragement. Karuna Sagar wrote the
cross-reference measuring tool that gave us the confidence to go
forward with an implementation. Amit Gud wrote two prototypes while a
graduate student at Kansas State University. The development of the
third prototype was funded at various points by Intel, EMC,
and VAH Consulting. Finally,
too many people to list made valuable contributions during discussions
about chunkfs on mailing lists, and I greatly appreciate their help.
(Log in to post comments)
Maybe it's just me, but I don't
think I'd be happy with a file system for which there is no reasonable
fsck program.
If this is actually true, it is one more reason to bury Reiserfs deeply,
with a wooden stake driven through its heart.
Yup. They did one mistake, but that mistake was grave: they assumed they can trust reiserfs. I've seen few cases where tiny nimber of badblocks killed reiserfs completely: nothing except this "cobble it together if you can" option worked and "cobbled together" filesystem was a mess (because there were some virtual images on that filesystem). Note: this exactly type of corruption SSD shows in real world. If reiserfs's "gentle" fsck does not work and if "last resort" approach is unusable then the only solution is to switch to other filesystem...
But the bad blocks DO exist in real life - you can not just ignore them! Reiserfs design NEVER ever considered this facet of life: if you have ONE bad block in wrong place - you are screwed 100%. When disk size is 512 bytes and HDD size of 2TiB loss of a 0.0000000003% of your data means 100% of your stuff is lost. This is not even funny.
XFS is also not a good idea (I was biten by it too) - but here we have bad implementation, not bad design. Implementation can be fixed, design mistake is unfixable.
I've had rock solid way to reproduce this effect: run bittorrent client on 100% full filesystem. Sure, this is not nice thing to do for a filesystem (and currently btrfs does not handle this case all that well), but stuff happens. If I can not trust my filesystem in such conditions how can I trust it at all?
It looks like XFS problems are in the past but trust is easy to lose, hard to resurrect - now I'm firmly in ext3 camp.
That's the theory, and it's quite plausible, if there are spare blocks
on the disk, but I have seen several drives (from different
manufacturers) with read errors that were also write errors, and none
where the error went away by writing. And it's not that these drives
had run out of spare blocks or something; the errors apparently were
caused by the head running amok in unusual power supply conditions.
It shoulds to me like you need to discover write-intent bitmaps.
Such a bitmap is effectively a set of 'dirty' bits, one for each
chunk of the array (and you can choose the chunk size).
So if you set the chunk size to 50GB (I would probably set it a bit
smaller) you get the same functionality as you describe, only with
much less hassle.
So just create a raid1 or - if you have more than 2 drives - raid10,
and
mdadm --grow /dev/md0 --bitmap=internal --bitmap-chunk=1000000
and you will be much happier.
There's a good bit of FUD in the middle of the article when it talks about SSDs. Of course, it is certainly true that overused individual disk sectors in SSDs tend to fail. However, a fairly large percentage of these failures occur on write, meaning that you know something went wrong, but your old data is still there. In such situations, the corruption resilience properties of chunkfs are not really that useful.
If we now turn our attention to regular hard drives with mechanical platters, one of the common ways that such drives can fail involves the entire drive dying at once. (This can happen to SSDs too, but much more rarely.) Chunkfs won't really help very much in this case either. I should also add that mechanical failure of mechanical drives is more common than SSD failure. Moreover, when a mechanical drive fails for whatever reason, it almost always fails on read instead of (or in addition to) on write, meaning that you've lost old data. So, even though SSDs aren't perfect, they're still better than the alternatives if your goal is reliability. And even though the chunkfs concept is handy for a large class of disk failures, you're still better off with ext3 on an SSD than chunkfs (or any FS really) on a mechanical drive.
|
|