Ingo Molnar's
completely fair
scheduler (CFS) patch continues to develop; the current version, as of
this writing, is
v18. One
aspect of CFS behavior is seen as a serious shortcoming by many potential
users, however: it only implements fairness between individual processes.
If 50 processes are trying to run at any given time, CFS will carefully
ensure that each gets 2% of the CPU. It could be, however, that one of
those processes is the X server belonging to Alice, while the other 49 are
part of a massive kernel build launched by Karl the opportunistic kernel
hacker, who logged
in over the net to take advantage of some free CPU time. Assuming that
allowing Karl on the system is considered fair at all, it is reasonable to
say that his 49 compiler processes should, as a group, share the processor
with Alice's X server. In other words, X should get 50% of the CPU (if it
needs it) while all of Karl's processes share the other 50%.
This type of scheduling is called "group scheduling"; Linux has never
really supported it with any scheduler. It would be nice if a『completely
fair scheduler』to be merged in the future had the potential to be
completely fair in this regard too. Thanks to work by Srivatsa Vaddagiri
and others, things may well happen in just that way.
The first part of Srivatsa's work was merged into v17 of the CFS patch. It
creates the concept of a "scheduling entity" - something to be scheduled,
which might not be a process. This work takes the per-process scheduling
information and packages it up within a sched_entity structure.
In this form, it is essentially a cleanup - it encapsulates the relevant
information (a useful thing to do in its own right) without actually
changing how the CFS scheduler works.
Group scheduling is implemented in a separate set of patches which
are not yet part of the CFS code. These patches turn a scheduling entity into
a hierarchical structure. There can now be scheduling entities which are
not directly associated with processes; instead, they represent a specific
group of processes. Each scheduling entity of this type has its own run
queue within it. All scheduling entities also now have a parent
pointer and a pointer to the run queue into which they should be scheduled.
By default, processes are at the top of the hierarchy, and each is
scheduled independently. A process can be moved underneath another
scheduling entity, though, essentially removing it from the primary run
queue. When that process becomes runnable, it is put on the run queue
associated with its parent scheduling entity.
When the scheduler goes to pick the next task to run, it looks at all of
the top-level scheduling entities and takes the one which is considered
most deserving of the CPU. If that entity is not a process (it's a
higher-level scheduling entity), then the scheduler looks at the run queue
contained within that entity and starts over again. Things continue down
the hierarchy until an actual process is found, at which point it is run.
As the process runs, its runtime statistics are collected as usual, but
they are also propagated up the hierarchy so that its CPU usage is properly
reflected at each level.
So now the system administrator can create one scheduling entity for Alice,
and another for Karl. All of Alice's processes are placed under her
representative scheduling entity; a similar thing happens to all of the
processes in Karl's big kernel build. The CFS scheduler will enforce
fairness between Alice and Karl; once it decides who deserves the CPU, it
will drop down a level and perform fair scheduling of that user's
processes.
The creation of the process hierarchy need not be done on a per-user basis;
processes can be organized in any way that the administrator sees fit. The
grouping could be coarser; for example, on a university machine, all
students could be placed in one group and faculty in another. Or the
hierarchy could be based on the type of process: there could be scheduling
entities representing system daemons, interactive tools, monster cranker
CPU hogs, etc. There is nothing in the patch which limits the ways in
which processes can be grouped.
One remaining question might be: how does the system administrator actually
cause this grouping to happen? The answer is in the second part of the
group scheduling patch, which integrates scheduling entities with the process container mechanism.
The administrator mounts a container filesystem with the cpuctl
option; scheduling groups can then be created as directories within that
filesystem. Processes can be moved into (and out of) groups using the
usual container interface. So any particular policy can be implemented
through the creation of a simple, user-space daemon which responds to
process creation events by placing newly-created processes in the right
group.
In its current form, the container code only supports a single level of group
hierarchy, so a two-level scheme (divide users into administrators,
employees, and guests, then enforce fairness between users in each group,
for example) cannot be implemented. This appears to be a『didn't get
around to it yet』sort of limitation, though, rather than something which
is inherent in the code.
With this feature in place, CFS will become more interesting to a number of
potential users. Those users may have to wait a little longer, though.
The 2.6.23 merge window will be opening soon, but it seems unlikely that
this work will be considered ready for inclusion at that time. Maybe
2.6.24 will be a good release for people wanting a shiny, new, group-aware
scheduler.
(
Log in to post comments)
Using a userspace daemon to set process scheduling policy might be fine for long-running processes like the X server, but wouldn't it introduce a lot of overhead for short-lived processes like gcc during kernel builds? I expect it would add a few context switches of overhead to every fork(); that doesn't seem consistent with general kernel developer attitude towards efficiency.
That's indeed nice that such a feature appears in vanilla kernel.