|
Ticket spinlocks
ByJonathan Corbet February 6, 2008
Spinlocks are the lowest-level mutual exclusion mechanism in the Linux
kernel. As such, they have a great deal of influence over the safety and
performance of the kernel, so it is not surprising that a great deal of
optimization effort has gone into the various (architecture-specific)
spinlock implementations. That does not mean that all of the work has been
done, though; a patch merged for 2.6.25 shows that there is always more
which can be done.
On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by
an integer value. A value of one indicates that the lock is available.
The spin_lock() code works by decrementing the value (in a
system-wide atomic manner), then looking to see whether the result is
zero; if so, the lock has been successfully obtained. Should, instead, the
result of the decrement option be negative, the spin_lock() code
knows that the lock is owned by somebody else. So it busy-waits ("spins")
in a tight loop until the value of the lock becomes positive; then it goes
back to the beginning and tries again.
Once the critical section has been executed, the owner of the lock releases
it by setting it to 1.
This implementation is very fast, especially in the uncontended case (which
is how things should be most of the time). It also makes it easy to see
how bad the contention for a lock is - the more negative the value of the
lock gets, the more processors are trying to acquire it. But there is one
shortcoming with this approach: it is unfair. Once the lock is released,
the first processor which is able to decrement it will be the new owner.
There is no way to ensure that the processor which has been waiting the
longest gets the lock first; in fact, the processor which just released the
lock may, by virtue of owning that cache line, have an advantage should it
decide to reacquire the lock quickly.
One would hope that spinlock unfairness would not be a problem; usually, if
there is serious contention for locks, that contention is a performance
issue even before fairness is taken into account. Nick Piggin recently
revisited this issue, though, after noticing:
On an 8 core (2 socket) Opteron, spinlock unfairness is extremely
noticable, with a userspace test having a difference of up to 2x
runtime per thread, and some threads are starved or "unfairly"
granted the lock up to 1 000 000 (!) times.
This sort of runtime difference is certainly undesirable. But lock
unfairness can also create latency issues; it is hard to give latency
guarantees when the wait time for a spinlock can be arbitrarily long.
Nick's response
was a new spinlock implementation which he calls『ticket
spinlocks.』 Under the initial version of this patch, a spinlock became a
16-bit quantity, split into two bytes:
Each byte can be thought of as a ticket number. If you have ever been to a
store where customers take paper tickets to ensure that they are served in
the order of arrival, you can think of the "next" field as being the number
on the next ticket in the dispenser, while "owner" is the number appearing
in the "now serving" display over the counter.
So, in the new scheme, the value of a lock is initialized (both fields) to
zero. spin_lock() starts by noting the value of the lock, then
incrementing the "next" field - all in a single, atomic operation. If the
value of "next" (before the increment) is equal to "owner," the lock has
been obtained and work can continue. Otherwise the processor will spin,
waiting until "owner" is incremented to the right value. In this scheme,
releasing a lock is a simple matter of incrementing "owner."
The implementation described above does have one small disadvantage in that
it limits the number of processors to 256 - any more than that, and a
heavily-contended lock could lead to multiple processors thinking they had
the same ticket number. Needless to say, the resulting potential for
mayhem is not something which can be tolerated. But the 256-processor
limit is an unwelcome constraint for those working on large systems, which
already have rather more processors than that. So the add-on "big
ticket" patch - also merged for 2.6.25 - uses 16-bit values when the
configured maximum number of processors exceeds 256. That raises the
maximum system size to 65536 processors - who could ever want more than
that?
With the older spinlock implementation, all processors contending for a
lock fought to see who could grab it first. Now they wait nicely in line
and grab the lock in the order of arrival. Multi-thread run times even
out, and maximum latencies are reduced (and, more to the point, made
deterministic). There is a slight cost to the new implementation, says
Nick, but that gets very small on contemporary processors and is
essentially zero relative to the cost of a cache miss - which is a common
event when dealing with contended locks. The x86 maintainers clearly
thought that the benefits of eliminating the unseemly scramble for
spinlocks exceeded this small cost; it seems unlikely that others will disagree.
(Log in to post comments)
The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree.
Not only do I not disagree, I'd love to run a kernel that has this cost, well actually I want the advantages, not so much the disadvantages.
It sounds like a really nice improvement.
Isn't there a naming inconsistency in the algorithm description as it now stands (at 05:59:44 UTC)? The box diagram and the following paragraph refer to fields "next" and "owner", but the next paragraph after that talks about "next" and "current". I assume "owner" and "current" refer to the same thing?
That was a slip of the brain as I was running out of time to get the article done. Of course the limit is 65536; the article has been fixed. Sorry for the confusion.
The more sophisticated the spinlock allocation mechanism, the greater the overheads. Simply having a wide enough table (and it can be as wide as the UPID mechanism) would be fast, but costly on memory. Spinlocks are also not going to be under equal demand. Perhaps spinlocks can be created with a given ticket width (to reflect expected demand) or have the ticket width grow once a high water mark is exceeded (to reflect actual demand).
There's also the problem of what to do if a maximum of N processes can control something. Do you allocate N spinlocks? That ruins any quality of service you're trying to attain. Or if a process has claimed one spinlock, do you bar it from reclaiming it (or claiming one of the other N) until some sort of fair service guarantee has been achieved?
There seem to be many semi-independent problems involved here, with spinlocks ending up used for 1:1, 1:N, M:1 and M:N situations. in practice, some of these may never really apply, but even if one of the other options applies once, you've a candidate for leading the Department of Headaches.
Do we need smarter spinlocks? Something on top of "dumb" spinlocks to add any extra capabilities? Something that divides the problem-space up such that smarter spinlocks aren't needed? Big spinlocks, such that they can stay as they are and survive in more generalized situations?
|