Front page
Security
Kernel development
Distributions
Development
Linux in the news
Announcements
->One big page

LWN.net Weekly Edition for November 13, 2014

LWN.net Weekly Edition for November 6, 2014

Weekly edition	Kernel	Security	Distributions	Contact Us	Search
Archives	Calendar	Subscribe	Write for LWN	LWN.net FAQ	Sponsors

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.24-rc1, released by Linus on October 23. As Linus noted when tagging the release:

The patch is big. Really big. You just won't believe how vastly hugely mindbogglingly big it is. I mean you may think it's a long way down the road to the chemist, but that's just peanuts to how big the patch from 2.6.23 is.

See the article below for this week's update on what was merged before the window closed. There's also the short-form changelog for a list of patches or the long changelog for all the details.

As of this writing, no patches have been merged into the mainline git repository since the -rc1 release. There have also been no -mm releases over the last week.

For older kernels: 2.6.20.21 was released on October 17 with a few dozen patches including a couple of security fixes. 2.6.16.56-rc1 came out on October 20; it has a dozen or so patches, again with a couple of security fixes.

Comments (none posted)

Kernel development news

Quotes of the week

mb() is the new lock_kernel(). Sigh.

-- Andrew Morton

So I *really* don't want to throw any stones in a glass house here. Quite the reverse. I'd like to get rid of some of the glass, and replace it with padding. Because you all know we'd all fit better in a padded room than a glass house..

-- Linus Torvalds

I was in bugfix-only mode from a week prior to 2.6.24 release and during the merge window. Partly caused by the already-idiotic amount of stuff we had queued for 2.6.24, partly because we needed to concentrate on stabilising the 2.6.25 patchpile rather than writing new stuff.

And partly to send the signal that rather than beavering away on new features all the time, we should also be spending some (more) time testing, reviewing and bugfixing the current and soon-to-be-current code.

-- Andrew Morton

I don't currently know of any common piece of hardware in use today that is not supported on Linux. And since these vendors do not know, and I don't, I'm asking the world to help out.

-- Greg Kroah-Hartman looks for driver projects

Comments (12 posted)

New maintainers for the x86 architecture

At the kernel summit in September, Andi Kleen, the maintainer of the i386 and x86_64 architecture code, stated that he would not maintain that code if it was merged into the unified x86 architecture. He appears to have not changed his mind on that score; a patch merged for 2.6.24 states that the x86 maintainers will be Thomas Gleixner, Ingo Molnar, and H. Peter Anvin. The x86 code is clearly in good hands, but it is sad to see Andi bow out; we owe him a lot of thanks for maintaining the architectures that most of us use for so long.

Comments (19 posted)

Merged for 2.6.24, part 2

ByJonathan Corbet
October 24, 2007

The 2.6.24 merge window has now closed; more than 7000 changesets were merged before 2.6.24-rc1 was released. The bulk of the new features for 2.6.24 were described last week. Here's a summary of patches merged since then, starting with user-visible changes:

There are new drivers for Marvell Libertas 8385/6 wireless chips, Freescale 3.0Gbps SATA controllers, Fujitsu laptops (LCD brightness in particular), and TI AR7 watchdog devices.
Another set of old Open Sound System drivers has been removed from the kernel.
The "uninitialized block groups" feature has been merged into the ext4 filesystem. UBG helps to speed filesystem checks by keeping track of which parts of a disk partition have never been used, and, thus, do not require checking.
As was discussed back in August, the binary sysctl() interface has been marked deprecated, and the code for many of the sysctl targets (much of which appears to not have worked for some time) has been removed. There is a new checker which looks for problematic sysctl definitions; according to Eric Biederman,『As best as I can determine all of the several hundred errors spewed on boot up now are legitimate.』
The semantics of the CAP_SETPCAP capability have been changed. In previous kernels, this capability gave a process the ability to bestow new capabilities upon another process; now, instead, it allows a process to set capabilities within its own "inherited" mask.
Process CPU time accounting (via taskstats) has been augmented with information allowing CPU usage time to be scaled by CPU frequency.
The Control Groups (formerly process containers) patch set has been merged. Control groups will allow the CFS group scheduling feature to be used; it will also be the control mechanism used for containers in general.
Process ID namespaces have been added; this feature lets container implementations create a different view of the list of processes on the system for every container.
The kernel markers patch set has been merged.
The CIFS filesystem now has access control list (ACL) support.
The old, unmaintained Fibre Channel support code has been removed.

Changes visible to kernel developers include:

The process of merging the i386 and x86_64 architectures continues, with many files having been merged by the time the window closed. This job is far from complete, though. For the curious, this message from Ingo Molnar talks a bit about what is going on there. "The x86 architecture is the most common Linux architecture after all - and users care much more about having a working kernel than they care about cleanups and unifications.... This cannot be realistically finished in v2.6.24, without upsetting the codebase."
The paravirt_ops structure has been split into several smaller, more specialized operations vectors. These include pv_init_ops (boot-time operations), pv_time_ops (for time-related operations), pv_cpu_ops (privileged instructions), pv_irq_ops (interrupt handling), pv_mmu_ops (page table management), and a few others.

There are some new bit operations which have been added:

    int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
    void clear_bit_unlock(unsigned long nr, unsigned long *addr);
    void __clear_bit_unlock(unsigned long nr, unsigned long *addr);

These operations are intended to be used in the creation of single-bit locks; they work without the need for any additional memory barriers.

There is a new KERN_CONT priority level for printk(). It is, in fact, empty; it is meant to serve as a marker for printk() calls which continue a previous (not terminated with a newline) printed line.
The watchdog device drivers have been moved to a new home at drivers/watchdog.
A notifier mechanism for console events has been added; this feature is aimed at accessibility tools (like Speakup) which need to know when something has changed on the console display.
The filesystem export operations, used to make filesystems available over protocols like NFS, have been reworked. Two new methods (fh_to_dentry() and fh_to_parent()) replace the old get_dentry() interface. There is a new structure (struct fid) used to describe file handles. This work is aimed at making the export interface easier to use and (eventually) supporting 64-bit inode numbers.
The virtio patches - providing an infrastructure for I/O into and out of virtualized guests - have been merged.

Now the stabilization period begins.

Comments (4 posted)

Various topics, all related to interrupts

ByJonathan Corbet
October 24, 2007

An interrupt handler is the portion of a device driver which is charged with responding to interrupts from the hardware; at a minimum it should shut the hardware up and initiate any processing which needs to be performed. When your editor worked on the second edition of Linux Device Drivers, the prototype for interrupt handlers looked like this:

    void handler(int irq, void *dev_id, struct pt_regs *regs);

The kernel development process is not particularly kind to book authors who, as a rule, prefer to see the ink dry on their creations before the text becomes obsolete. True to form, the handler prototype has changed a couple of times since LDD2, with the result that the 2.6.23 version looks like:

    irqreturn_t handler(int irq, void *dev_id);

Along the way, interrupt handlers gained a return type (used to tell the kernel whether an interrupt was actually processed or not) and lost the processor registers argument. One would think that this interface (along with those who attempt to document it) had suffered enough, but, it seems, there will be no rest in the near future.

In particular, Jeff Garzik has proposed that the irq argument be removed from the interrupt handler prototype. There are very few interrupt handlers which actually use that argument currently. And, as it turns out, most of the remaining handlers do not actually need it; they are often using the interrupt number to identify the interrupting device, but the dev_id pointer already exists for just that purpose. Still, getting this patch into the kernel would require a significant amount of work, since every in-tree interrupt handler will have to be audited and fixed up.

So Jeff is taking it slowly; this is not a patch set which is aimed at being merged for 2.6.24. Before it goes in, there is room for a lot of useful work cleaning up the current use of the irq argument in drivers, all of which would ease the eventual transition to the new call. Handlers which really need the IRQ number can call the new get_irqfunc_irq() function. But, says Jeff,『I am finding a ton of bugs in each get_irqfunc_irq() driver, so I would rather patiently sift through them, and push fixes and cleanups upstream.』 Quite a few interrupt handler fixes resulting from this work have already been posted.

Eric Biederman worries that converting all of the drivers could be a challenge; he has posted a proposal which would create two different interrupt registration and handler interfaces, allowing drivers which really need the IRQ number to continue to receive it. Jeff is confident that the extra structure will not be necessary, though. Thomas Gleixner, instead, would like to see the patches merged immediately, but it is almost certain that this patch set will be given one more development cycle to mature before going into the mainline.

Alexey Dobriyan, meanwhile, would like to fix up the interrupt-safe spinlock interface. Most code which requires a spinlock in the presence of interrupts calls:

    void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);

The flags variable is used by the (architecture-specific) code to save any interrupt state which may be needed when spin_unlock_irqrestore() is called. The problem with this interface is that it is not particularly type-safe. Developers have been known to use an int type instead of unsigned long; that usage will generate no errors and it will work fine on the x86 architecture. It will, however, fail in ugly ways on some other architectures.

So Alexey would like to turn flags into a new type (irq_flags_t). This type would initially be defined to be unsigned long, so the change would not break compilation. It would be annotated, though, so that the sparse utility could point out all of the places where spin_lock_irqsave() is called with an incorrect type. In the more distant future, when the changeover is complete, architecture maintainers would be able to redefine the type to whatever works best on their systems, be it a structure or a single byte.

Andrew Morton had a mixed response to the patch:

Yes, it's always been ugly that we use unsigned long for this rather than abstracting it properly.

However I'd prefer that we have some really good reason for introducing irq_flags_t now. Simply so that I don't needlessly spend the next two years wrestling with literally thousands of convert-to-irq_flags_t patches and having to type『please use irq_flags_t here』in hundreds of patch reviews.

As an alternative, it was suggested that most calls of spin_lock_irqsave() should be changed to spin_lock_irq() instead. The latter version disables interrupts without saving the previous state; the accompanying spin_unlock_irq() call will then unconditionally re-enable interrupts. Those functions can be made to work, but only if it is known that interrupts will not have already been disabled when spin_lock_irq() is called. Otherwise the spin_unlock_irq() call risks enabling interrupts when some other part of the kernel expects them to still be disabled. The resulting random behavior is generally seen as undesirable by most computer users. So, in other words, spin_lock_irqsave() is a safer interface, which is why there is not a great deal of support for removing it. The prospect of well-intentioned kernel janitors changing code to spin_lock_irq() without really understanding the broader context is just too scary.

Finally, there was a discussion involving synchronize_irq() which illustrates just how hard it can be to get a handle on race conditions on multiprocessor systems. This function:

    void synchronize_irq(unsigned int irq);

is intended to help coordinate actions between a driver's interrupt and non-interrupt code. At its core, it is a simple loop:

    while (desc->status & IRQ_INPROGRESS)
 cpu_relax();

In other words, synchronize_irq() will busy-wait until it is known that no handlers are running for the given interrupt. The idea is that any interrupt handler which might have been running before the call to synchronize_irq() will have completed when that function returns. The typical usage pattern is something like this:

    some_important_flag = a_new_value;
    synchronize_irq();
    /* Code which depends on IRQ handler seeing a_new_value here */

With code like this, after the synchronize_irq() call, any interrupt handler will be guaranteed to see a_new_value - or so people think.

The problem is that contemporary processors will happily reorder memory operations to avoid pipeline stalls and improve performance; the what every programmer should know about memory series currently being serialized by LWN describes these issues in detail. What is relevant here is that the change to some_important_flag might be reordered (delayed) such that it does not become visible to other processors on the system until sometime after synchronize_irq() returns. During the window when the change is not visible, the promise of synchronize_irq() is not kept - an interrupt handler could run and see the old value, possibly creating mayhem as a result. That is the sort of obscure, one-in-a-billion race condition which keeps kernel hackers up at night.

Actually, kernel hacking and coffee keep kernel hackers up at night, but your editor's point should be clear.

Benjamin Herrenschmidt, upon finding this race, attempted to fix it with a memory barrier. After some discussion, though, it became clear that the memory barrier was not sufficient. Barriers can affect the order in which operations become visible, but they cannot, in the absence of corresponding barriers on another processor, guarantee that a specific change becomes visible to that processor at any given time. That sort of guarantee requires the use of a locked operation which forces synchronization between processors - the sort of operation which is typically used to implement spinlocks.

So the real solution appears to be this patch by Linus Torvalds and Herbert Xu. The while loop shown above persists in the new version, and it continues to run with no locks held - holding the interrupt descriptor lock when the interrupt subsystem may want it could lead to deadlocks. But, once it appears that no handlers are running, the descriptor lock is acquired and the status is checked one more time. If no handlers are running, the synchronize operation is complete; otherwise the code goes back to busy-waiting. The acquisition of the descriptor lock guarantees that memory barriers will have been executed on both sides of any potential race condition; that, in turn, will force the ordering of the memory operations. So, with this change in place, synchronize_irq() will truly synchronize with IRQ handlers and one more difficult race condition will have been eliminated.

Comments (1 posted)

LSM: loadable or static?

ByJake Edge
October 24, 2007

The ever-contentious Linux Security Modules (LSM) API is being debated once again on linux-kernel, not its removal, which Linus Torvalds came down firmly against, but whether it should allow security modules to be loaded dynamically. As part of 2.6.24, Torvalds merged a patch to convert LSM into a static interface, but has indicated a willingness to revert it. The key sticking point is whether there are real security modules that require the ability to be runtime-loaded.

Acomplaint by Thomas Fricaccia about the change caused Torvalds to put out a call for folks using module loading with their LSM code. The patch could be reverted if there are "real-world" uses for that ability. Torvalds again questions the sanity of security developers, but is clearly looking for someone to step up:

I'd like to note that I asked people who were actually affected, and had examples of their real-world use to step forward and explain their use, and that I explicitly mentioned that this is something we can easily re-visit.

Jan Engelhardt responded with information about his MultiAdmin module, which allows multiple root users on a system, each with their own UID. This allows separate tracking of file ownership, resource usage and the like for each administrator. MultiAdmin also allows for the creation of sub-administrators who can perform some root activities for processes and files owned by a subset of users. The use case he cites is for professors being allowed to administer their students' accounts without getting full root privileges.

James Morris, who proposed the static LSM change, responded that MultiAdmin seemed to qualify as a real-world use under Torvalds's criteria. Though it is not clear that MultiAdmin requires a loadable interface, it does use it. The venerable root_plug security module – which only allows root processes to start if a particular USB device is plugged in – also implements loading and unloading. In both cases, configuration could be done via sysfs parameters with an enable flag to turn them on or off.

To some extent, for the examples offered so far, loading is a convenience for administrators, but the major users for unloading are developers. Crispin Cowan sums it up:

Why would you want to dynamically unload a module: because it is convenient for debugging. Ok, so it is unsafe, and sometimes wedges your kernel, which sometimes forces you to reboot. With this patch in place, it forces you to *always* reboot when you want to try a hack to the module.

Other justifications for leaving the LSM loadable interface in the kernel have been less compelling. It is hard to imagine that the US Sarbanes-Oxley regulation would allow loading security modules into a running kernel, but not allow the kernel to be rebuilt as Fricaccia suggested. Inserting proprietary security modules that are provided from the vendor in a binary-only form seems foolhardy – this kind of potential abuse is the kind of hole Morris's patch was meant to close – but could be seen as a reason to allow LSM loading.

A compromise may have been found in a patch posted by Arjan van de Ven, which converts LSM to be either static or loadable depending on a compile-time kernel option. A consensus seems to be building that this is a reasonable approach, allowing distributions and users to decide for themselves whether they will allow security modules to be loaded. As of this writing, Torvalds has not weighed back in with a decision and the newly released 2.6.24-rc1 kernel has the static patch.

Dynamic loading of security modules is a potential source of problems – what better place for a rootkit to hide? – but there are valid reasons that someone might want to use it. Linux strives to be open to many uses, including some that the kernel hackers might find distasteful; dynamic security modules would seem to be one of those uses.

Comments (8 posted)

Patches and updates

Kernel trees