42 captures
02 Nov 2010 - 18 Dec 2025
Apr MAY Jun
24
2012 2013 2014
success
fail

About this capture

COLLECTED BY

Organization: Internet Archive

The Internet Archive discovers and captures web pages through many different web crawls. At any given time several distinct crawls are running, some for months, and some every day or longer. View the web archive through the Wayback Machine.

Collection: Wide Crawl started April 2013

Web wide crawl with initial seedlist and crawler configuration from April 2013.
TIMESTAMPS

The Wayback Machine - http://web.archive.org/web/20130524082655/http://lwn.net/Articles/411590/
 
LWN.net Logo

Log in now

Create an account

Subscribe to LWN

Return to the Kernel page

LWN.net Weekly Edition for May 23, 2013

An "enum" for Python 3

An unexpected perf feature

LWN.net Weekly Edition for May 16, 2013

A look at the PyPy 2.0 release

2.6.37 merge window, part 1

ByJonathan Corbet
October 27, 2010
The 2.6.36 kernel was released on October 20, and the 2.6.37 merge window duly started shortly thereafter. As of this writing, some 6450 changes have been merged for the next development cycle, with more surely to come. Some of the more significant, user-visible changes merged for 2.6.37 include:

  • The first parts of the inode scalability patch set have been merged, but, as of this writing, the core locking changes have not yet been pushed for inclusion. See this article for more information on the inode scalability work.

  • The x86 architecture now uses separate stacks for interrupt handling when 8K stacks are in use. The option to use 4K stacks has been removed.

  • The big kernel lock removal process continues; the core kernel is almost entirely BKL-free. There is now a configuration option which may be used to build a kernel without the BKL. File locking still requires the BKL, though; schemes are afoot to fix it before the close of the merge window, but this work is not yet complete. If file locking can be cleaned up, it will be possible for many (or most) users to run a BKL-free 2.6.37 kernel.

  • The "rados block device" has been added. RBD allows the creation of a special block device which is backed by objects stored in the Ceph distributed system.

  • The GFS2 cluster filesystem is no longer marked "experimental." GFS2 has also gained support for the fallocate() system call.

  • A new sysfs file, /sys/selinux/status, allows a user-space application to quickly notice when security policies have changed. The intended use is evidently daemons which cache the results of access-control decisions and need to know when those results might change. A separate file, called policy, has been added for those simply wanting to read the current policy from the kernel.

  • The scheduler now works harder to avoid migrating high-priority realtime tasks. The scheduler also will no longer charge processor time used to handle interrupts to the process which happened to be running at the time.

  • VMware's VMI paravirtualization support has been deprecated by the company and, as scheduled, removed from the 2.6.37 kernel.

  • Some hibernation improvements have been merged, including the ability to compress the hibernation image with LZO,

  • The ARM architecture has gained support for the seccomp (secure computing) feature.

  • The block layer can now throttle I/O bandwidth to specific devices, controlled by the cgroup mechanism. This is the second piece of the I/O bandwidth controller puzzle which allows the establishment of specific bandwidth limits which will be enforced even if more I/O bandwidth is available.

  • The new "ttyprintk" device allows suitably-privileged user space to feed messages through the kernel by way of a pseudo TTY device.

  • The kernel has gained support for the point-to-point tunneling protocol (PPTP); see the accel-pptp project page for more information.

  • The NFS server client has a new "idmapper" implementation for the translation between user and group names and IDs. The new code is more flexible and performs better; see Documentation/filesystems/nfs/idmapper.txt for details.

  • There is a new -olocal_lock= mount option for the NFS client which can cause it to treat either (or both) of flock() and POSIX locks as local.

  • Most of the functions of the nfsservctl() system call have been deprecated and marked for removal in 2.6.40. There is a new configuration option for those who would like to remove this functionality ahead of time.

  • Simple support for the pNFS protocol has been merged.

  • Huge pages can now be migrated between nodes like normal memory pages.

  • There is the usual pile of new drivers:

    • Systems and processors: Flexibility Connect boards, Telechips TCC ARM926-based systems, Telechips TCC8000-SDK development kits, Vista Silicon Visstrim_m10 i.MX27-based boards, LaCie d2 Network v2 NAS boards, Qualcomm MSM8x60 RUMI3 emulators, Qualcomm MSM8x60 SURF eval boards, Eukrea CPUIMX51SD modules, Freescale MPC8308 P1M boards, APM APM821xx evaluation boards, Ito SH-2007 reference boards, IBM "SMI-free" realtime BIOS's, MityDSP-L138 and MityDSP-1808 systems, OMAP3 Logic 3530 LV SOM boards, OMAP3 IGEP modules, and taskit Stamp9G20 CPU modules.

    • Block: Chelsio T4 iSCSI offload engines.

    • Input: Roccat Pyra gaming mice, UC-Logic WP4030U, WP5540U and WP8060U tablets, several varieties of Waltop tablets, OMAP4 keyboard controllers, NXP Semiconductor LPC32XX touchscreen controllers, Hanwang Art Master III tablets, ST-Ericsson Nomadik SKE keyboards, ROHM BU21013 touch panel controllers, and TI TNETV107X touchscreens.

    • Miscellaneous: Freescale eSPI controllers, Topcliff platform controllher hub devices, OMAP AES crypto accelerators, NXP PCA9541 I2C master selectors, Intel Clarksboro memory controller hubs, OMAP 2-4 onboard serial ports, GPIO-controlled fans, Linear Technology LTC4261 Negative Voltage Hot Swap Controller I2C interfaces, TI BQ20Z75 gas gauge ICs, OMAP TWL4030 BCI chargers, ROHM ROHM BH1770GLC and OSRAM SFH7770 combined ALS and proximity sensors, Avago APDS990X combined ALS and proximity sensors, Intersil ISL29020 ambient light sensors, and Medfield Avago APDS9802 ALS sensor modules.

    • Network: Brocade 1010/1020 10Gb Ethernet cards, Conexant CX82310 USB ethernet ports, Atheros AR9170 "otus" 802.11n USB devices, and Topcliff PCH Gigabit Ethernet controllers.

    • Sound: Marvell 88pm860x codecs, TI WL1273 FM radio codecs, HP iPAQ RX1950 audio devices, Native Instruments Traktor Kontrol S4 audio devices, Aztech Sound Galaxy AZT1605 and AZT2316 ISA sound cards, Wolfson Micro WM8985 and WM8962 codecs, Wolfson Micro WM8804 S/PDIF transceivers, Samsung S/PDIF controllers, and Cirrus Logic EP93xx AC97 controllers.

    • USB: Intel Langwell USB OTG transceivers, YUREX "leg shake" sensors, and USB-attached SCSI devices.

  • The old ieee1394 stack has been removed, replaced at last by the "firewire" drivers.

Changes visible to kernel developers include:

  • The jump label optimization mechanism has been merged; its initial purpose is to reduce the overhead of inactive tracepoints.

  • Yet another RCU variant has been added: "tiny preempt RCU" is meant for uniprocessor systems. "This implementation uses but a single blocked-tasks list rather than the combinatorial number used per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies processing. This version also takes advantage of uniprocessor execution to accelerate grace periods in the case where there are no readers."

  • New tracepoints have been added in the network device layer, places where sk_buff structures are freed, softirq_raise(), workqueue operations, and memory management LRU list shrinking operations. There is also a new script for using perf to analyze network device events.

  • The wakeup latency tracer now has function graph support.

  • There is a new mechanism for running arbitrary code in hardware interrupt context.

  • The power management layer now has a formal concept of『wakeup sources』which can bring the system out of a sleep state. Among other things, it can collect statistics to help the user determine what is keeping a system awake. Wakeup events can abort the freezing of tasks, reducing the time required to recover from an aborted suspend or hibernate operation.

  • A new mechanism for managing the automatic suspending of idle devices has been added.

  • There is a new set of functions for managing the『operating performance points』of system-on-chip components. (commit).

  • A long list of changes to the memblock (formerly LMB) low-level management code has been merged, and the x86 architecture now uses memblock for its early memory management.

  • The default handling for lseek() has changed: if a driver does not provide its own llseek() function, the VFS layer will cause all attempts to change the file position to fail with an ESPIPE error. All in-tree drivers which lacked llseek() functions have been changed to use noop_llseek(), which preserves the previous behavior.

  • There is a new way to create workqueues:

        struct workqueue_struct *alloc_ordered_workqueue(const char *name, 
                                                         unsigned int flags);
    

    Items submitted to the resulting workqueue will be run in order, one at a time. It's meant to eventually replace the old singlethreaded workqueues.

    Also added is:

        bool flush_work_sync(struct work_struct *work);
    

    This function will wait until a specific work item has completed.

  • The ALSA ASoC API has been significantly extended to support sound cards with multiple codecs and DMA controllers. (commit).

  • The stack-based kmap_atomic() patch has been merged, with an associated API change. See the new Documentation/vm/highmem.txt file for details.

  • There are two new memory allocation helpers:

        void *vzalloc(unsigned long size);
        void *vzalloc_node(unsigned long size, int node);
    
    Both behave like the equivalent vmalloc() calls, but they also zero the allocated memory.

  • Most of the work needed to remove the concept of hard barriers from the block layer has been merged. This task will probably be completed before the closing of the merge window.

Linus has let it be known that he expects this merge window to be shorter than usual so that it can be closed before the 2010 Kernel Summit begins on November 1. Expect patches to be merged at a high rate until the end of October; an update next week will cover the changes merged in the last part of the 2.6.37 merge window.


(Log in to post comments)

2.6.37 merge window, part 1

Posted Oct 28, 2010 13:16 UTC (Thu) by i3839 (guest, #31386) [Link]

> The x86 architecture now uses separate stacks for interrupt handling
> when 8K stacks are in use. The option to use 4K stacks has been removed.

Why? I've been running with 4K stacks for years without any problems,
why disable it for the people not running complex filesystem/block layer
stacks?

2.6.37 merge window, part 1

Posted Oct 28, 2010 14:55 UTC (Thu) by nix (subscriber, #2304) [Link]

Because you don't need to run very complex stacks at all to exceed 4K. IIRC, NFS-served XFS can break the limit: it doesn't require much.

2.6.37 merge window, part 1

Posted Oct 28, 2010 20:34 UTC (Thu) by i3839 (guest, #31386) [Link]

Well, then they should fix the stacking happening at all, and make XFS and NFS less stack hungry, instead of pushing 8K stacks and pretending the problem doesn't exist. If there is a stack shortage then 8K might not be enough either. Or let XFS and NFS and others select 8K stacks.

What I dislike is that they take away the 4K stack option altogether.

2.6.37 merge window, part 1

Posted Oct 29, 2010 3:01 UTC (Fri) by nevets (subscriber, #11875) [Link]

You want to see how much stack is being used in the kernel?

Just run the stack_tracer (if enabled).

# mount -t debugfs nodev /sys/kernel/debug
# echo 1 > /proc/sys/kernel/stack_trace_enabled
# cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (43 entries)
        -----    ----   --------
  0)     4064     112   __slab_alloc+0x38/0x3f1
  1)     3952      80   kmem_cache_alloc+0x82/0x103
  2)     3872      16   mempool_alloc_slab+0x15/0x17
  3)     3856     144   mempool_alloc+0x5e/0x110
  4)     3712      16   scsi_sg_alloc+0x48/0x4a [scsi_mod]
  5)     3696     112   __sg_alloc_table+0x62/0x103
[...]
 38)      768     320   load_elf_binary+0x8a6/0x174a
 39)      448      96   search_binary_handler+0xc0/0x24d
 40)      352     112   do_execve+0x1d0/0x2ba
 41)      240      64   sys_execve+0x43/0x5a
 42)      176     176   stub_execve+0x6a/0xc0

I just about hit 4K immediately after enabling it. That first number is the stack depth (4064 bytes). That is 42 calls deep. This also shows the stack size of each function (the Size field).

2.6.37 merge window, part 1

Posted Oct 29, 2010 15:35 UTC (Fri) by nix (subscriber, #2304) [Link]

Yeah, it was really libata that broke this camel's back. It pulls in the SCSI midlayer for virtually everything: a good idea, because this stuff really *is* SCSI-like, but it makes the call stacks a good bit deeper.

(I stopped using 4kstacks a few years ago when I figured out that it was the cause of my hard lockups when running executables over NFS. That was pre-libata...)

2.6.37 merge window, part 1

Posted Oct 30, 2010 19:00 UTC (Sat) by i3839 (guest, #31386) [Link]

Well, I'd argue that a call stack of 42 is insane and should never happen, but who am I... In such cases any guarantees are off and it's time to actually detect and prevent stack shortage. And not with tracing and such, but a guard page or something.

2.6.37 merge window, part 1

Posted Oct 31, 2010 12:44 UTC (Sun) by nix (subscriber, #2304) [Link]

Uh, a guard page? That would make your 4K stack equivalent to 8K again, only you couldn't use half of it. Not so terribly useful, I think.

Guard pages only make sense if the guarded data is generally much bigger than a page.

2.6.37 merge window, part 1

Posted Oct 31, 2010 23:16 UTC (Sun) by i3839 (guest, #31386) [Link]

Well, I mean reserving a guard page in the virtual address space, not allocating a physical page for it. It would cause a page fault, so I guess it can't work when interrupts are disabled, but the rest of the time it should work now interrupt handlers got their own stack. Except if I'm missing something.

2.6.37 merge window, part 1

Posted Nov 1, 2010 0:18 UTC (Mon) by nix (subscriber, #2304) [Link]

Hm, yeah, that would work, I think... kernel stacks are physically contiguous, but I don't see an obvious reason why they couldn't have a merely-virtually-contiguous unmapped guard page. (There probably is a reason, or they'd have done it.)

2.6.37 merge window, part 1

Posted Nov 1, 2010 10:05 UTC (Mon) by i3839 (guest, #31386) [Link]

Well, I'm pretty sure the kernel doesn't want a virtually mapped stack, so extending it could get a bit tricky. All in all it might be not worth the complexity compared to just using a 8kB stack.

The main advantage of 4kB stack is not the saving of one page, but the added pressure of keeping bloat down. Things like 42 nested function calls are just not good to have.

nevets, I think you could post that trace as a bug somewhere. :-/

2.6.37 merge window, part 1

Posted Nov 1, 2010 10:29 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

I thought that the big advantage of the 4K page was the ability to allocate a single page instead of needing to allocate a pair of pages (order 0 allocation instead of order 1 allocation), greatly reducing the problem of memory fragmentation.

2.6.37 merge window, part 1

Posted Nov 1, 2010 17:33 UTC (Mon) by i3839 (guest, #31386) [Link]

The chance that you can't allocate two contiguous pages is fairly small. we're talking about the stack page here, so it's one per task, which isn't much. Fragmentation is more a problem for bigger allocations than order 1, for allocations that may not fail, and for very frequent allocations. The task stack is neither of those, so it's fine.

2.6.37 merge window, part 1

Posted Nov 7, 2010 11:24 UTC (Sun) by kevinm (guest, #69913) [Link]

I wonder, now that interrupt context now uses its own stack, whether the task stacks couldn't be vmalloc()ed?

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds