Return to the Kernel page

LWN.net Weekly Edition for May 23, 2013

LWN.net Weekly Edition for May 16, 2013

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

A better btrfs

ByJonathan Corbet
January 15, 2008

Chris Mason has recently released Btrfs v0.10, which contains a number of interesting new features. In general, Btrfs has come a long way since LWN first wrote about it last June. Btrfs may, in some years, be the filesystem most of us are using - at least, for those of us who will still be using rotating storage then. So it bears watching.

Btrfs, remember, is an entire new filesystem being developed by Chris Mason. It is a copy-on-write system which is capable of quickly creating snapshots of the state of the filesystem at any time. The snapshotting is so fast, in fact, that it is used as the Btrfs transactional mechanism, eliminating the need for a separate journal. It supports subvolumes - essentially the existence of multiple, independent filesystems on the same device. Btrfs is designed for speed, and also provides checksumming for all stored data.

Some kernel patches show up and quickly find their way into production use. For example, one year ago, nobody (outside of the -ck list, perhaps) was talking about fair scheduling; but, as of this writing, the CFS scheduler has been shipping for a few months. KVM also went from initial posting to merged over the course of about two kernel release cycles. Filesystems do not work that way, though. Filesystem developers tend to be a cautious, conservative bunch; those who aren't that way tend not to survive their first few encounters with users who have lost data. This is all a way of saying that, even though Btrfs is advancing quickly, one should not plan on using it in any sort of production role for a while yet. As if to drive that point home, Btrfs still crashes the system when the filesystem runs out of space. The v0.10 patch, like its predecessors, also changes the on-disk format.

The on-disk format change is one of the key features in this version of the Btrfs patch. The format now includes back references on almost all objects in the filesystem. As a result, it is now easy to answer questions like "to which file does this block belong?" Back references have a few uses, not the least of which is the addition of some redundant information which can be used to check the integrity of the filesystem. If a file claims to own a set of blocks which, in turn, claim to belong to a different file, then something is clearly wrong. Back references can also be used to quickly determine which files are affected when disk blocks turn bad.

Most users, however, will be more interested in another new feature which has been enabled by the existence of back references: online resizing. It is now possible to change the size of a Btrfs filesystem while it is mounted and busy - this includes shrinking the filesystem. If the Btrfs code has to give up some space, it can now quickly find the affected files and move the necessary blocks out of the way. So Btrfs should work nicely with the device mapper code, growing or shrinking filesystems as conditions require.

Another interesting feature in v0.10 is the associated in-place ext3 converter. It is now possible to non-destructively convert an existing ext3 filesystem to Btrfs - and to go back if need be. The converter works by stashing a copy of the ext3 metadata found at the beginning of the disk, then creating a parallel directory tree in the free space on the filesystem. So the entire ext3 filesystem remains on the disk, taking up some space but preserving a fallback should Btrfs not work out. The actual file data is shared between the two filesystems; since Btrfs does copy-on-write, the original ext3 filesystem remains even after the Btrfs filesystem has been changed. Switching to Btrfs forevermore is a simple matter of deleting the ext3 subvolume, recovering the extra disk space in the process.

Finally, the copy-on-write mechanism can be turned off now with a mount option. For certain types of workloads, copy-on-write just slows things down without providing any real advantages. Since (1) one of those workloads is relational database management, and (2) Chris works for Oracle, the only surprise here is that this option took as long as it did to arrive. If multiple snapshots reference a given file, though, copy-on-write is still performed; otherwise it would not be possible to keep the snapshots independent of each other.

For those who are curious about where Btrfs will go from here, Chris has posted a timeline describing what he plans to accomplish over the coming year. Next on the list would appear to be "storage pools," allowing a Btrfs filesystem to span multiple devices. Once that's in place, striping and mirroring will be implemented within the filesystem. Longer-term projects include per-directory snapshots, fine-grained locking (the filesystem currently uses a single, global lock), built-in incremental backup support, and online filesystem checking. Fixing that pesky out-of-space problem isn't on the list, but one assumes Chris has it in the back of his mind somewhere.

(Log in to post comments)

A better btrfs

Posted Jan 17, 2008 9:21 UTC (Thu) by Klavs (subscriber, #10563) [Link]

I always hate when developers try to be the "do-all" solution. Btrfs sounds very nice, but it
is a but worrying that he wants to let the filesystem handle multiple device spanning and
striping and mirroring over these. 
We already have Logical Volume Manager for the multiple device spanning (which can also
mirror) and Linux also has raid support.
If, for some reason, they are lacking something he'd like to have - then it would be better to
extend those solutions to support it, to the benefit of all users of it - and not just
implement his own solutions for the benefit of Btrfs only.
I also expect that it would be hard to get LWM and mirroring support in the kernel - when it's
already there. But perhaps I'm missing something :)

A better btrfs

Posted Jan 17, 2008 9:23 UTC (Thu) by Klavs (subscriber, #10563) [Link]

s/but worrying/bit worrying/ 

I look forward to the "edit your post" feature :)

the "edit your post" feature?

Posted Jan 18, 2008 12:40 UTC (Fri) by bignose (subscriber, #40) [Link]

> I look forward to the "edit your post" feature :)

It's already there. You get to edit your post as many times as you see fit; you're even denied
the opportunity to publish without a separate preview first. Fix any errors before anyone else
sees them.

Once you've published the post, I consider it a very desirable feature that you can't edit it
-- that way, the discussion follows chronologically instead of people going back and changing
what they said.

the "edit your post" feature?

Posted Jan 19, 2008 18:55 UTC (Sat) by i3839 (guest, #31386) [Link]

Editing should only be allowed as long as no one replied yet, and as long as it's not too long
ago. That way silly typo's can be fixed (or embarrassing posts removed), but history not
rewritten.

A better btrfs

Posted Jan 17, 2008 9:31 UTC (Thu) by Felix.Braun (subscriber, #3032) [Link]

Especially those "rampant layering violations" remind me a lot of Solaris' ZFS. Actually, the feature set promised by btrfs sounds like a re-implementation of ZFS. Seeing that btrfs is being developped at Oracle makes the competition-scenario between Oracle and Sun mentioned on the front page sound all the more plausible. I guess that means, ZFS won't be easily portable to Linux anytime soon... (Isn't ZFS GPLv3 only?)

A better btrfs

Posted Jan 17, 2008 9:55 UTC (Thu) by intgr (subscriber, #39733) [Link]

No, ZFS is CDDL only (Sun's "Common Development and Distribution License"), which is
incompatible with any version of GPL.

A better btrfs

Posted Jan 18, 2008 3:05 UTC (Fri) by Cato (subscriber, #7643) [Link]

The one ZFS feature I really want is checksumming of all disk blocks, which can detect disk
failures, bad cables, controller failures, etc.  As disks climb towards 1 TB being an average
size, the error rate per disk becomes surprisingly high...

A better btrfs

Posted Jan 18, 2008 23:25 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

As disks climb towards 1 TB being an average size, the error rate per disk becomes surprisingly high...

But who cares about error rate per disk?

Error rate per system administrator or per year or per gigabyte seems more interesting.

A better btrfs

Posted Jan 19, 2008 0:41 UTC (Sat) by nix (subscriber, #2304) [Link]

Hm, the former of those is interesting. `Why yes, you *can* reduce your 
error rate: just hire more sysadmins!'

Pair admining? Test driven administration? XA (eXtreme Admin'ing)?

Posted Jan 25, 2008 21:00 UTC (Fri) by AnswerGuy (guest, #1256) [Link]

"Just" hiring more sysadmins clearly wouldn't reduce your error rate. By itself it would
almost certainly *increase* the number of systems administration errors.

However, I have been mulling over the application of agile development methods to systems
administration. (In fact I even gave a brief talk on that idea at last years LinuxWorld in
San Francisco).

Consider some of the possibilities:

Test driven administration:

Configure monitoring for that new server before you configure the server. Alarms should go
off; response procedures should be executed ... a service window should be scheduled (with
estimated date of completion), which should defer further alarms from that source. (Same
applies to each service that's to be deployed). Now you know that the monitoring is doing
something useful. When the monitoring shows the service "going green" then you know you have
configured the service correctly (with respect to the monitoring system --- i.e. DNS or other
directory services, IP addressing, routing, etc). (If you find a corner case --- where
monitoring gives a false "green" status --- try to improve the monitoring to more closely
model a service's *correct* functionality).

Integrate imaging and system's restoration. Image a system, configure it, backup
configuration and initial (test) data, then create a new imaging profile to facilitate
automated re-imaging of the system with automated restore of the configuration and data. Then
wipe the system and re-image it using that profile. Repeat until the system's complete
configuration and data is restored automatically. THEN put the system into production.

There are a number of other ideas along similar veins. One of them is that we might want to
institute a policy ... for critical production servers ... of having our admins work in pairs
(perhaps over a shared GNU screen session) where one of the admins types each command, then
the other confirms that it's safe/correct and hits [Enter] when they both concur. (Better
admins among us have learned to pause before hitting [Enter] when working "live" on mission
critical servers ... take a deep breath ... re-read that command ... perhaps try the "echo" or
"--dry-run" version of it first ... consider the risks ... and *THEN* (maybe) hit [Enter].
But even the best of us gets in a hurry, gets flustered or tired, or just experiences cognitive
hiccoughs).

(In my case I was an electrician for years before embarking on my IT career --- working with
potentially live wiring offers similar lessons with potentially lethal and immediately painful
consequences for any lapse in due care! And yes, despite all that I did occasionally get
zapped!)

JimD

Pair admining? Test driven administration? XA (eXtreme Admin'ing)?

Posted Jan 26, 2008 3:42 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

"Just" hiring more sysadmins clearly wouldn't reduce your error rate.

It would if the error rate you're using is disk errors per year per sysadmin, which is what we were talking about.

It underscores the point that there are lots of error rates you can define, and you have to pay attention to your denominators.

Nonetheless, your ideas about reducing errors per something by improving system administration methods are interesting.

A better btrfs

Posted Jan 19, 2008 12:19 UTC (Sat) by Cato (subscriber, #7643) [Link]

I care about error rate per disk - if each disk is very likely to have a bad block at any
time, as seems more likely to be the case with today's larger disks, then you start to really
need RAID, block checksumming, etc, simply to avoid losing data.

I believe that people are storing more and more data on a given system, and the fact that
p(error on this system) is going up should be of concern.

A better btrfs

Posted Jan 19, 2008 17:54 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

if each disk is very likely to have a bad block at any time, as seems more likely to be the case with today's larger disks, then you start to really need RAID, block checksumming, etc, simply to avoid losing data.

That risk would be the same if you had 10 disks, each with one tenth the data and one tenth the error rate.

I believe that people are storing more and more data on a given system, and the fact that p(error on this system) is going up should be of concern.

Now you're talking about error rate per system, not per disk.

And I'm not convinced that's important either. Spreading data out across 10 systems doesn't make the data loss hurt any less.

A better btrfs

Posted Jan 20, 2008 18:08 UTC (Sun) by Cato (subscriber, #7643) [Link]

Error rate per system is a better metric as you say - your original post said 'error rate per
system administrator' which was a bit confusing.

A better btrfs

Posted Jan 20, 2008 20:57 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

I think error rate per system is more useful than error rate per disk, but as I said, even error rate per system isn't terribly useful. Error rate per system administrator is considerably more useful.

If a major cost of disk errors is a system administrator having to replace a disk, restore from backup, recreate data, etc., then you care how many times a year the system administrator has to do that. Consolidating data from two systems onto one doubles your error rate per system, but doesn't mean you have to increase your RAID redundancy and such because the error rate per system administrator is still the same. On the other hand, piling a terabyte of movies onto the systems managed by a system administrator increases that error rate and might require some new method of dealing with the errors.

Error-rate / disk

Posted Jan 30, 2008 3:11 UTC (Wed) by Max.Hyre (subscriber, #1054) [Link]

When the error requires taking the disk offline to fix it, I care. Until I can access all of a Tbyte disk in the time it takes to access a, say, 200 Gbytes, the downtime per spindle will be greater, and what you really care about is the downtime, for any problem.*

It's bad enough waiting for a fsck on a 100 GB partition.

* Well, data loss figures in there somewhere, but lost data typically hurts a few users, but being offline affects everyone.

A better btrfs

Posted Jan 17, 2008 10:35 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>We already have Logical Volume Manager for the multiple device spanning (which can also
mirror) and Linux also has raid support.

*Clap* LVM or not, you need a filesystem that is shrinkable before you can make use of LVM
shrinking.

A better btrfs

Posted Jan 17, 2008 16:29 UTC (Thu) by ttonino (subscriber, #4073) [Link]

Doing mirroring in the file system would also allow mirroring of metadata and files that
warrant it. That big video temporary file could be striped on the other hand - and later
perhaps converted to mirrored with some kind of chattr command.

Cutting out the layering also enables better layout of data on the disk. When writing, a
single small write would be kept on the same disk, and not straddle a stripe boundary which
would lead to unneeded extra IOs. One could even write a file as raid5 or raid6 while other
files on the same FS are mirrored or striped, or just on a single device.

However, writing multiple files each to its own disk, and keeping track of the parity
independently is much more complex. That would be very good for read performance (only one
head moves for each file read), and raid 6 is better at protecting data than mirroring.

A better btrfs

Posted Jan 18, 2008 23:36 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Though I'm usually a great champion of layering, I like RAID in the filesystem better than RAID under the filesystem.

In addition to the benefits already stated, it speeds up builds and rebuilds because the filesystem knows which blocks contain no useful data. And the filesystem can avoid read-modify-write if it knows the stripe size.

RAID above the filesystem (RAIF) also has many of these features.

I think the main value of RAID in the block layer is that it works with a dumb filesystem. That, of course, isn't an argument against putting RAID in the filesystem.

Wrong priorities / features

Posted Jan 17, 2008 11:09 UTC (Thu) by rvfh (subscriber, #31018) [Link]

Agreed! I think fine-grained locking should be second on the list, just after the out-of-space
bugfix, then let's try it out on a big scale before adding any more features (I start sounding
like Andrew Morton, amn't I? :)

A better btrfs

Posted Jan 17, 2008 12:54 UTC (Thu) by sveinrn (guest, #2827) [Link]

To me it seems to solve all the stupid problems with md/lvm/ext3. Just as I am writing this,
the users of my system are waiting for a "resize2fs" to complete. With file systems at several
TB on a rather slow MSA 1500 this can take hours. SLES10 has a command for doing the resize
online, but every time I try, it complaints about an inode or data block in the wrong location
or something like that. (I have not tried lately, so I don't remember the exact error
message.) 

Being able to create a mirror of something that is not on an MD-device sounds great,
especially with lvm where a filesystem spans multiple volumes and is mixed up with lots of
other filesystems that don't need mirroring. pvmove offers some of the functionality, but has
caused some horrible system crashes, and I'm only allowed to use it during the night after the
backup has completed. 

One of the reasons I was allowed to replace HPUX as our NFS-server was that I told management
that Linux was at least equally flexible... :(

A better btrfs

Posted Jan 17, 2008 13:07 UTC (Thu) by sveinrn (guest, #2827) [Link]

OK, I see that in SLES10 there is a command called lvconvert which creates mirrors. I have not
needed the functionality after we upgraded from SLES9...

But still, the general feeling is that Linux lags several years behind other operating systems
when it comes to handling of large filesystems.

A better btrfs

Posted Jan 17, 2008 15:59 UTC (Thu) by masoncl (subscriber, #47138) [Link]

Working nicely with LVM is a long term goal for Btrfs.  However, Btrfs will (optionally)
maintain duplicate copies of metadata so that it can find a valid copy when it notices
checksumming errors, or errors reading from the device.

This is something LVM cannot provide because it cannot maintain consistent checksums for the
FS.  Also, many admins will want to enable metadata mirroring even on single volume
filesystems where LVM isn't used.

When multiple devices are present, Btrfs will want to know it is mirroring on different
physical spindles.  This is a challenge with LVM since the locations of physical extents can
change without the FS knowing about it.  Even if there were hooks so the FS could know the
current extent mappings, it would end up duplicating a copy of the mappings internally.

So, the storage pool features are not trying to reimplement MD/LVM, but they are trying to
carve out the critical features that LVM cannot provide as efficiently as the FS can.  

Multipathing, raid3/5 etc, dm-crypt and most of the other fun things you can do with the
storage stack are not planned Btrfs features...

A better btrfs

Posted Jan 26, 2008 13:22 UTC (Sat) by anton (guest, #25547) [Link]

It's nice to see that Btrfs is going forward. I would really like to see snapshots done properly.

One other feature I would like to see in a file system, and which should be relatively easy to implement in a copy-on-write file system is (what I call) in-order semantics:

The file system after recovery represents all write()s (or other changes) that occurred before a specific point in time, and no write() (or other change) that occurred afterwards. I.e., at most you lose a minute or so of work.
The value of this guarantee may not be immediately obvious. It means that if an application ensures file data consistency in case of its own unexpected termination by performing writes and other file changes in the right order, its file data will also be consistent (but possibly not up-to-date) in case of a system failure

I note that your announcement and the timeline contains a feature "data=ordered support", but it's not clear what kind of consistency guarantee this entails (apart from "preventing null bytes in a file after a crash"). Maybe you can implement in-order semantics as part of the data=ordered support, or add it as another feature to the ToDo list.

A better btrfs

Posted Jan 17, 2008 10:33 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>Most users, however, will be more interested in another new feature which has been enabled by
the existence of back references:

Namely inotify. No more need to place one inotify-watch per directory. At least in theory.

A better btrfs

Posted Jan 17, 2008 10:55 UTC (Thu) by Velmont (guest, #46433) [Link]

This will make a very good and easy backup solution for Desktops then? No-fuss and
«transparent» backup. Feel stupid to say it; but we might easily get something like OS X's
time machine. :-)

A better btrfs

Posted Jan 17, 2008 11:08 UTC (Thu) by njs (guest, #40338) [Link]

OS X's Time Machine product is pretty straightforward -- it does do a weird trick with
hard-linking directories, but one could write a perfectly great desktop backup program without
any such hacks.  It's not really complicated and btrfs doesn't really make it any easier.  The
problem is that for some reason, people seem to only be interested in writing backup systems
that are either really grungy and low-level and often inadequate in practice, or all-singing
all-dancing incredibly complex things.  I guess this is because the two markets for backup so
far are "geeks who want to hack something together for personal use" and "full-time sysadmins
who get paid big-bucks to manage large networks".

Backup progs

Posted Jan 17, 2008 15:34 UTC (Thu) by rvfh (subscriber, #31018) [Link]

Can you please try mine then, and privately tell me (my e-mail is in the README file in the
source code tree) what you feel is positive/negative?

http://sf.net/projects/hbackup

Note: it's work in progress and a GUI should come one day...

Apr	MAY	Jun
	24
2012	2013	2014