|
Still waiting for swap prefetch
[Posted July 25, 2007 by corbet]
It has been almost two years since LWN covered the swap prefetch
patch. This work, done by Con Kolivas, is based on the idea that if a
system is idle, and it has pushed user data out to swap, perhaps it should
spend a little time speculatively fetching that swapped data back into
any free memory that might be sitting around. Then, when some application
wants that memory in the future, it
will already be available and the time-consuming process of fetching it
from disk can be avoided.
The classic use case for this feature is a
desktop system which runs memory-intensive daemons (updatedb, say, or a
backup process) during the night. Those daemons may shove a lot of useful
data to swap, where it will languish until the system's user arrives,
coffee in hand, the next morning. Said user's coffee may well grow cold by
the time the various open applications have managed to fault in enough
memory to function again. Swap prefetch is intended to allow users to
enjoy their computers and hot coffee at the same time.
There is a vocal set of users out there who will attest that swap prefetch
has made their systems work better. Even so, the swap prefetch patch has
languished in the -mm tree for almost all of those two years with no path
to the mainline in sight. Con has given up
on the patch (and on kernel development in general):
The window for 2.6.23 has now closed and your position on this is
clear. I've been supporting this code in -mm for 21 months since
16-Oct-2005 without any obvious decision for this code forwards or
backwards.
I am no longer part of your operating system's kernel's world; thus
I cannot support this code any longer. Unless someone takes over
the code base for swap prefetch you have to assume it is now
unmaintained and should delete it.
It is an unfortunate thing when a talented and well-meaning developer runs
afoul of the kernel development process and walks away. We cannot afford
to lose such people. So it is worth the trouble to try to understand what
went wrong.
Problem #1 is that Con chose to work in some of the trickiest parts of the
kernel. Swap prefetch is a memory management patch, and those patches
always have a long and difficult path into the kernel. It's not just Con
who has run into this: Nick Piggin's lockless pagecache patches have
been
knocking on the door for just as long. The LWN article on Wu Fengguang's
adaptive readahead patches appeared at about the same time as the swap
prefetch article - and that was after your editor had stared at them for
weeks trying to work up the courage to write something. Those patches
were only merged earlier this month, and, even then, only after many of the
features were stripped out. Memory management is not an area for
programmers looking for instant gratification.
There is a reason for this. Device drivers either work or they do not, but
the virtual memory subsystem behaves a little differently for every
workload which is put to it. Tweaking the heuristics which drive memory
management is a difficult process; a change which makes one workload run
better can, unpredictably, destroy performance somewhere else. And that
"somewhere else" might not surface until some large financial institution
somewhere tries to deploy a new kernel release. The core kernel
maintainers have seen this sort of thing happen often enough to become
quite conservative with memory management changes. Without convincing
evidence that the change makes things better (or at least does no harm) in
all situations, it will be hard to get a significant change merged.
Ina
recent interview Con stated:
Then along came swap prefetch. I spent a long time maintaining and
improving it. It was merged into the -mm kernel 18 months ago and
I've been supporting it since. Andrew [Morton] to this day remains
unconvinced it helps and that it 'might' have negative consequences
elsewhere. No bug report or performance complaint has been
forthcoming in the last 9 months. I even wrote a benchmark that
showed how it worked, which managed to quantify it!
The problem is that, as any developer knows, "no bug reports" is not the same
as "no bugs." What is needed in a situation like this is not just
testimonials from happy desktop users; there also needs to be some sort of
sense that the patch has been tried out in a wide variety of situations.
The relatively self-selecting nature of Con's testing community (more on
this shortly) makes that wider testing harder to achieve.
A patch like swap prefetch will require a certain amount of support from
the other developers working in memory management before it can be merged.
These developers have, as a whole, not quite been ready to jump onto the
prefetch bandwagon. A concern which has been raised a few times is that
the morning swap-in problem may well be a sign of a larger issue within the
virtual memory subsystem, and that prefetch mostly serves as a way of
papering over that problem. And it fails to even paper things completely,
since it brings back some pages from swap, but doesn't (and really can't)
address file-backed pages which will also have been pushed out. The
conclusion that this reasoning leads to is that it would be better to find
and fix the real problem rather than hiding it behind prefetch.
The way to address this concern is to try to get a better handle on what
workloads are having problems so that the root cause can be addressed.
That's why Andrew Morton says:
To attack the second question we could start out with bug reports:
system A with workload B produces result C. I think result C is
wrong for <reasons> and would prefer to see result D.
and why Nick Piggin complains:
Not talking about swap prefetch itself, but everytime I have asked
anyone to instrument or produce some workload where swap prefetch
helps, they never do.
Fair enough if swap prefetch helps them, but I also want to look at
why that is the case and try to improve page reclaim in some of
these situations (for example standard overnight cron jobs
shouldn't need swap prefetch on a 1 or 2GB system, I would hope).
There have been a few attempts to characterize workloads which are improved
by swap prefetch, but the descriptions tend toward the vague and hard to
reproduce. This is not an easy situation to write a simple benchmark for
(though Con has tried), so demonstrating the problem is a hard thing to
do. Still, if the prefetch proponents are serious about wanting this code
in the mainline, they will need to find ways to better communicate
information about the problems solved by prefetch to the development
community.
Communications with the community have been an occasional problem with
Con's patches. Almost uniquely among kernel developers, Con
chose to do most of his work on his own mailing list. That has resulted in
a self-selected community of users which is nearly uniformly supportive of Con's work,
but which, in general, is not participating much in the development of that
work. It is rare to see patches posted to the ck-list which were not
written by Con himself. The result was the formation of a sort of
cheerleading squad which would occasionally spill over onto linux-kernel
demanding the merging of Con's patches. This sort of one-way communication
was not particularly helpful for anybody involved. It failed to convince
developers outside of ck-list, and it failed to make the patches better.
This dynamic became actively harmful when ck-list members (and Con)
continued to push for inclusion of patches in the face of real problems.
This behavior came to the fore after Con posted the RSDL scheduler. RSDL
restarted the whole CPU scheduling discussion and ended up leading to some
good work. But some users were reporting real regressions with RSDL and
were being told that those regressions were to be expected and would not be
fixed. This behavior soured
Linus on RSDL and set the stage for Ingo Molnar's CFS scheduler. Some
(not all) people are convinced that Con's scheduler was the better design,
but refusal to engage with negative feedback doomed the whole exercise.
Some of Con's ideas made it into the mainline, but his code did not.
The swap prefetch patches appear to lack any obvious problems; nobody is
reporting that prefetch makes things worse. But the ck-list members
pushing for its inclusion (often with Con's encouragement) have not been
providing the sort of information that the kernel developers want to see.
Even so, while a consensus in favor of merging this patch has
not formed, there are some important developers who support its inclusion.
They include Ingo Molnar and David Miller, who says:
There is a point at which it might be wise to just step back and
let the river run it's course and see what happens. Initially,
it's good to play games of "what if", but after several months it's
not a productive thing and slows down progress for no good reason.
If a better mechanism gets implemented, great! We'll can easily
replace the swap prefetch stuff at such time. But until then swap
prefetch is what we have and it's sat long enough in -mm with no
major problems to merge it.
So swap prefetch may yet make it into the mainline - that discussion is
not, yet, done. If we are especially lucky, Con will find a way to get back into
kernel development, where his talents and user focus are very much in need.
But this sort of situation will certainly come up again. Getting major
changes into the core kernel is not an easy thing to do, and, arguably,
that is how it should be. If the process must make mistakes, they should
probably happen on the side of being conservative, even if the occasional
result is the exclusion of patches that end up being helpful.
(Log in to post comments)
"for example standard overnight cron jobs shouldn't need swap prefetch on a 1 or 2GB system"
Now I've been programming since magnetic drums were hip so I may be a bit confused here but it seems to me that I can remember a time less than a decade ago when a system didn't need a couple of gigs to run Linux well.
It seems that Linux may still be a little more efficient than Vista on most loads (but not Firefox) but back in the nifty nineties Linux was a _lot_ more efficient than Windows. In short, Linux has been getting worse faster than Windows.
When I have 2GB of ram in my home desktop system, I never ever want to see Linux drop binaries from ram, and swap them back in from the binary text file. Or see it drop data pages to increase hdd cache in memory.
Turn off swap then if you feel that way. I've got no problem with my system swapping out the getty copies which will never be used and instead using the RAM to cache my mail files.
I have a bunch of stuff that includes running programs, program data and files that the CPU has to deal with. I want everything to go as fast a possible so whenever possible if the CPU wants some of that data it should be able to get it as quickly as possible which means it should be in RAM if at all possible. The better the kernel does this, the faster everything will run.
Why should the kernel have to guess what to page back in when you could use the "pagein" from OpenOffice, or something like it? Here's a simple example.
Then you will be surprised.
When you get 2GB (I have them), just start Eclipse on a larger project, while Firefox and Thunderbird are running; and watch your memory getting used. And you didn't even start any VM (Xen or VMware)...
If Firefox is really faster on Windows than on Linux, then it should be easy to hack some benchmarks in Javascript to prove it. Nobody will be able to fix it, if he cannot measure the performance. And Internet network bandwidth has be taken out of the equation.
Sure, you can write bad code in any language, but in my experience Python is not particularly slower than other script languages. Here is a benchmark that appears to validate that experience: http://www.timestretch.com/FractalBenchmark.html Another one is here: http://acker3.ath.cx/wordpress/archives/7
Lua might be something to look at.
Of course you will always be able to write faster code in C, but this will take you some more time.
"Of course you will always be able to write faster code in C, but this will take you some more time."
Ya had me until you said this. Why stop at C, when it could be written in ASM? And heck, how do you know the assembler will generate fast code, better do it in hex instead.
I would have thought that after all these years that people would learn more about computer science and programming than to troll the "C is always faster than everything else" line.
The thing to do is to generate the assembler and then munge it with a horrible perl script.
(Hey, ghc does it, it must be good! :) )
Why stop at C, when it could be written in ASM?
That's not a natural progression. Code compiled from C is often faster than that compiled from assembly language, for the same reason that a computer can land an airplane more smoothly than a human. Even code compiled from C by a naive compiler (e.g. gcc -O0) is unlikely to be slower than code compiled from assembly language. C is that low-level a language.
how do you know the assembler will generate fast code, better do it in hex instead
Wedo know that. The assembler will generate code that is not only the same speed as that generated by the hex editor, but is actually the same code. That's the definition of assembly language.
I would have thought that after all these years that people would learn more about computer science and programming than to troll the "C is always faster than everything else" line.
The only line I saw was, "C is always faster than Python." And it is, isn't it?
Benchmarks in Javascript will mostly show the performance of the JS interpreter and things it can block on, i.e. it's not a complete monitoring tool by any means.
"Ubuntu and Red Hat both have this obsession with Python, which is about as bloaty a language as you can get, and then they write crappy apps in Python that would be slow as molasses even in C because they use crappy algorithms."
Could you elaborate on that? I'm not much of a coder, but it would be interesting to hear what the problem with python is since I have considered playing with it.
In regards to Ubuntu and Redhat, well I bet they are open for improvements to their code though you don't say what apps it is. Your comment lacks facts.
My problem with Ocaml is its syntax and its functional mindset: I tried once to learn Ocaml and I disliked the syntax plus the PDF book I used insisted on using functional way to solve everything which is strange as Ocaml is said to support both imperative and functional style, why the book insisted so much on the functional style is beyond me, bleach.
So to be successful, Ocaml would need 2 things: 1) replace the current default syntax with a better one.
There is already an alternative syntax for Ocaml (so apparently I'm not the only one who don't like the default syntax), it's quite better and F# (an Ocaml clone for .Net) 's syntax looks even better. 2) improve tutorials book to teach both imperative style and functional style, without having such blatant bias towards functional style, it has its place but not for everything.
Somehow I doubt that will happen, so Ocaml is bound to stay in the limelight..
> There's nothing wrong with python
... except that even a simple "hello world" seems to take 40 megabytes of memory. It's not about a few cpu cycles that kills you in performance, it's the enormous overhead that even simple programs get....
% ps ax -o cmd:50,vsz,rss,stime | egrep '[p]ython'
/usr/bin/python -E /usr/sbin/setroubleshootd 808024 12104 Jul06
python /usr/share/system-config-printer/applet.py 232764 6436 Jul06
/usr/bin/python -E /usr/bin/sealert -s 411728 18524 Jul06
python /usr/libexec/revelation-applet --oaf-activa 458328 55056 Jul06
python /usr/lib64/gdesklets/gdesklets-daemon 453980 65768 Jul23
python ./hpssd.py 171220 1176 Jul11
...that's 2_536_044 KB VSZ and 159_064 KB RSS, and as you can see I've rebooted gdesklets recently (it was roughly double that, I think).
Plus I'm not running pupplet/yum-updatesd atm. And for instance "revelation-applet" is just a text entry box 95% of the time.
I appreciate that the huge VSZ numbers are (hopefully) mostly unused shared libs. etc. but half a GB is still a lot for the OS to manage for a text box, and 40+MB of RSS for a simple GUI is far from "so much crap".
For comparison my webserver uses two processes with a VSZ of about 12MB each and RSS of about 1MB each, and I'd prefer that to be smaller.
The page cache is not an `unnecessary but nice performance enhancement'. The text pages of all your binaries are sitting in there while they run, as is all your other mmap()ed stuff. If your page cache was empty you'd not be able to run any userspace code to speak of.
_Linux_ has been getting worse, or the shit that people use on their systems is getting worse?
Check the size of a Linux kernel, and you will see that the is a bigger increment (in percent) in 5 years than the memory used by mozilla. We have more memory, so we can afford to use it more inefficiently.
We have more memory
Not really take a look at the OLPC or at the new development on the embeded device scene (Nokia N770/N800, OpenMoko) ...
"We have more memory, so we can afford to use it more inefficiently."
That's badly worded. It would be more correct to say that memory efficiency has been sacrificed for time efficiency.
Let's clarify... in a GNU/Linux system, the GNU part is getting bigger and slower. I can still happily load and run Debian 4.0 on a DX4/100 with 32M of ram and a 2G hdd. Can't do anything that involves a GUI, though.
It seems that Linux may still be a little more efficient than Vista on most loads (but not Firefox) but back in the nifty nineties Linux was a _lot_ more efficient than Windows. In short, Linux has been getting worse faster than Windows.
I am not sure if I agree with that. I am writing this on a Dell laptop with a 1GHz CPU and 512MB of memory. It is a pain to use on Windows XP, and I wouldn't dare try to put Vista on it. But latest version of Ubuntu trundles along quite nicely. There are a few applications that struggles; like eclipse; but they struggled 3 years ago too.
Tom
I really hate this notion that a three year old computer should be tossed in the trash as so obsolete there is no use it can be put to. Linux used to be a good way to get good use out of older hardware. Not anymore. Now you need hardware equal to, and prior to Vista shipping greater than, the minimum Windows baseline.
And just throwing hardware at the problem doesn't make it go away. Having 2GB of RAM will make it livable but hard drives aren't getting all that much faster. Paging in enough of OO.o and all the libraries it needs to get to mapping the initial window means looking at a throbber almost as long on a hot new monster PC as it does on an older one. Same for all the disc thrashing involved in logon as multi-megabyte blobs of libraries and executables are mapped in to provide what should be small crap like battery indicators and CPU speed monitor widgets in menu bar.
Having more resources is no excuse for sloppy and wasteful practices. And if we want our stuff to be an option for the coming world of smart phones, flash based laptops (without swap) and the embedded world we need to be thinking about getting our act together now.
Memory 'needs' increase exponentially in a Moore's Law like process that is entirely unrelated to Linux. You can still run Linux on tiny systems (I know people running 2.6 on systems smaller than 32MB), which is not true of any recent version of Windows.
Your conclusion simply does not follow from Nick Piggin's quite reasonable postulate.
Now I've been programming since magnetic drums were hip so I may be a bit confused here but it seems to me that I can remember a time less than a decade ago when a system didn't need a couple of gigs to run Linux well.
I would say it still does. I'm writing this as we speak from a PIII 700 MHz with 256 MB of RAM and honest Linux runs great. I even use Firefox which tends to eat up 50% of my RAM according to top, but still no complaints. In fact even with Firefox using half my RAM I'm still not even touching swap.
Interestingly enough when I read some of the posts on lkml claiming that they had 1GB of RAM and swap prefetch drastically improved their workload, all I could think is "what on earth are these people running?" Until a couple of months ago my dev-machine at work only had 512 MB of RAM and had all kinds of nasty things like Firefox, beagle, and Lotus Notes running but my swap was rarely used. Conversely on days when I was forced to boot it into Windows XP for one reason or another I couldn't leave an application minimized for more than 10 minutes before Windows decided to swap it out. Waiting 1 minute every time I switch apps on Windows is enough to make me go crazy and be thankful that Linux has a sane swapping algorithm.
Yeah we do, but does it really cache a bitmap of the page? That would seem a bit silly
I don't know exactly, but I would say it does. Loading a couple of big JPEG files takes quite longer than changing tabs between them.
That would seem a bit silly
Why, exactly? If you have the RAM to spare, it seems to be as good as any other use. I seem to remember some discussion on LWN about Firefox caching even pages in the history.
When people complain that "Firefox eats up 2 GB" in a 4 GB machine, it gives the wrong impression. Firefox runs fine on my 128 MB laptop, and memory seldom goes above 80 MB.
I wouldn't either. My little experiment with JPEG images involved decompressing a JPEG image vs. caching an uncompressed bitmap. Caching images in a lossy format would be ludicrous.
But caching in a lossless format such as PNG isn't such a good idea either. An important aspect of caches is that you should store an artifact which you already have, not one you have to generate. If you have to compress a bitmap to PNG before caching it, you are wasting a lot of CPU time just to generate a cache which you might as well never use again.
An example: you download a JPEG image, then uncompress it to show it, and then you compress it to PNG before caching it in memory. Messy.
Quite right. Sorry. Oops.
Really, FF crashes on me about once a week, because although I have 2GB RAM and 6GB of swap, firefox manages to malloc() 4 GB!
Geez, what are you doing to the poor thing??
No idea where it is going - admittedly I tend to have about 200 tabs open, but that alone shouldn't be the problem.
OK, that might do it... I have two FF windows containing a total of 9 tabs, and it's been running since early March. Total memory usage (per ps(1)): 197M allocated, 157M resident. (Of course, there's also X11 pixmaps, as was noted in another LWN article a while back. I've forgotten how to check that, but even if FF were the sole X client--which it's not, by far--that's only another 194M allocated and 46M resident.)
Greg
xrestop will show you the X memory usage, such as pixmaps.
Danke, sir! That's the command I couldn't recall.
Greg
>I can remember a time less than a decade ago when a system didn't need a couple of gigs to run Linux well.
Don't worry, it's still fine, running 2.6.22.1 kernel on old hardware, with a light desktop, xfce4, running lots of little server things, nfs, vnc, streaming, etc, the system using... 90 mB ram, out of 192 mB total. I originally had 512 in it but it never got remotely close to using it so I pulled it out and put back in small sticks I had lying around. Once in a while it uses swap, a tiny bit, but not often.
For full desktop like kde or gnome, 256 is what you need to avoid swapping.
And more for dev work, vmware, heavy graphics, and so on. Just because you put in 2 gig doesn't mean you use it, this box has 2 and it's using 3/4 gig with 50 or so apps open on 8 virtual desktops. But fire up a vmware or vbox install or two and you're getting closer to 2 gigs.
Isn't this precisely the sort of thing that should be controlled from user space? i.e. let the kernel provide the mechanism and let a user space daemon implement the policy?
For example, couldn't one create a virtual file called /proc/<pid>/swapin that when written to, fetches all the swapped data for that process back into ram?
The advantage, of course, would be that no existing users would be affected, and the optimal swap prefetch policy could be adjusted according to the application.
> The advantage, of course, would be that no existing users would be
> affected, and the optimal swap prefetch policy could be adjusted according
> to the application.
The disadvantages are, of course, that (1) few programs will use that interface, since it is too system dependent, (2) few programs will use that interface at the time it matters, since developers usually cannot exactly know when their program will need a lot of memory (who knows that a crazy user create ten thousand bookmarks in his browser and call expand all...), and (3) large program with VM even larger than the available RAM will fail (or cause the whole system to run very slowly) the moment it calls that interface, and most unluckily, it happens only on some small systems that the developer doesn't have access to.
But some users were reporting real regressions with RSDL and were being told that those regressions were to be expected and would not be fixed. This behavior soured Linus on RSDL and set the stage for Ingo Molnar's CFS scheduler. Some (not all) people are convinced that Con's scheduler was the better design, but refusal to engage with negative feedback doomed the whole exercise.
I was following most of that on LKML as it happened, and the way that I saw it was that a guy testing RSDL was reporting the fact that his X server now got 25% CPU instead of 75% as a regression.
Con did respond. He said the scheduler was fair, that was the design, and that he (the tester) could renice X to -10 or -15 if he wished.
I don't see how else Con could respond to that. RSDL was supposed to be fair. Giving X 75% isn't fair. There's just no way to resolve those two things.
That's just the sort of approach which created trouble for SD/RSDL. If people see regressions with their workloads, stamping a "100% certified fair!" label on it will not make them feel better about it. You have to address these problems; if you are unwilling to do so, your code will not make it into the kernel.
CFS is also a "fair" scheduler, but it has not drawn the same sort of complaints - though it will be interesting to see what happens as the testing community gets larger. As I understand it, the CFS brand of "fairness" takes a longer-term view, allowing tasks to get their "fair" share even if they sleep from time to time. That helps to prevent the sort of regressions seen with SD.
The real key, though, is what happens when things go wrong. There will certainly be people reporting scheduler issues over the 2.6.23 cycle. Ingo and the other CFS hackers could certainly dismiss them as "entirely silly," seeing as the scheduler is "completely fair," after all. But they won't do that. Instead, they will do their best to understand and solve the problems. That is why CFS is in the kernel, and SD is not.
Of course it is not unbiased reporting - we pay good money for Jon's
opinion. That's what an editorial is. Jon explained quite clearly, and
even-handedly in my opinion, why he came to the conclusion he did. Agree
with him or not (and I certainly don't always), but criticising his
reporting because you don't agree with his conclusions is not helping
anyone.
I find that hard to believe if we're talking about the same person
complaining. His problem can never be fixed by CFS, unless CFS
automatically would renice his X, or would introduce unfairness some
other way.
CFS will cause regressions, because it doesn't do unfair scheduling -
which is what users have come to expect. There is no way around it.
Besides, CFS does worse on 3D gaming compared to SD and mainline, and ppl
will complain about that as well.
Note that I'm happy CFS got in mainline, as far as I can tell, it has a
superior design. It's just that the mentioned reasoning for the choice
doesn't work for me...
Maybe this is worth reading, if you didn't already.
http://osnews.com/story.php/18350/Linus-On-CFS-vs.-SD
(don't forget the OTHER SIDE of the story ;-) )
As I understand it, the CFS brand of "fairness" takes a longer-term view, allowing tasks to get their "fair" share even if they sleep from time to time.
Correct, and i call this concept "sleeper fairness".
The simplest way to describe it is via an specific example: on my box if i run glxgears, it uses exactly 50% of CPU time. If i boot into the SD scheduler, and start a CPU hog in parallel to the glxgears task, the two tasks share the CPU: the CPU hog will get ~60% of CPU time, glxgears will get ~40% of CPU time. If i boot CFS, both tasks will get exactly 50% of CPU time.
I've described this mechanism and other internal details in another thread already, but i think it makes sense to paste that reply here too:
wait_runtime is a scheduler-internal metric that shows how much out-of-balance this task's execution history is compared to what execution time it could get on a "perfect, ideal multi-tasking CPU". So if wait_runtime gets negative that means it has spent more time on the CPU than it should have. If wait_runtime gets positive that means it has spent less time than it "should have". CFS sorts tasks in an rbtree with this value as a key and uses this value to choose the next task to run. (with lots of additional details - but this is the raw scheme.) It will pick the task with the largest wait_runtime value. (i.e. the task that is most in need of CPU time.)
This mechanism and implementation is basically not comparable to SD in any way, the two schedulers are so different. Basically the only common thing between them is that both aim to schedule tasks "fairly" - but even the definition of "fairness" is different: SD strictly considers time spent on the CPU and on the runqueue, CFS takes time spent sleeping into account as well. (and hence the approach of "sleep average" and the act of "rewarding" sleepy tasks, which was the main interactivity mechanism of the old scheduler, survives in CFS. Con was fundamentally against
sleep-average methods. CFS tried to be a no-tradeoffs replacement for the existing scheduler and the sleeper-fairness method was key to that.)
This (and other) design differences and approaches - not surprisingly - produced two completely different scheduler implementations. Anyone who has tried both schedulers will attest to the fact that they "feel" differently and behave differently as well.
Due to these fundamental design differences the data structures and algorithms are necessarily very different, so there was basically no opportunity to share code (besides the scheduler glue code that was already in sched.c), and there's only 1 line of code in common between CFS and SD (out of thousands of lines of code):
* This idea comes from the SD scheduler of Con Kolivas:
*/
static inline void sched_init_granularity(void)
{
unsigned int factor = 1 + ilog2(num_online_cpus());
This boot-time "ilog2()" tuning based on the number of CPUs available is a tuning approach i saw in SD and i asked Con whether i could use it in CFS. (to which Con kindly agreed.)
The practical difference is noticeable for something like the X server - Xorg is often a "sleepy" process but it's important that when it runs it gets its own maximum share of CPU time. With a "runners only" fairness model it will receive less CPU time than with a "sleepers considered too" (CFS) fairness model.
SwPr didn't get in because some very rich employers don't want to destabilize their NUMA monsters? Er, doesn't this sound really unlikely?
The article mentions that Linus and some others feel that SwPr is just papering over a more fundamental problem. So, why not spend time trying to fix the fundamental problem before hacking around it?
That's a rhetorical question... there could be a number of reasons: the root cause is too complex to be understood, or the proper fix is worse than SwPr, etc. I just think that Linus & crew would like to see someone attempt to fix the real problem before resorting to a SwPr hack. If a proper fix is attempted and proves unweildy, then I bet SwPr will jump a lot higher on a number of kernel devs' priority queues.
> ...entirely silly. Complaining about the fairness of a fair scheduler?
They weren't complaining about the fairness, they were complaining about the quality. Is a 100% fair scheduler actually the best scheduler? Probably not.
The thing that this complaint show is that a 'fair scheduler' in itself is not good enough for enduser desktop..
If you have an application APP that is important to you, you renice it so it has lots of CPU, fine, but then say that this application sends a lot of work to do for the X server (could be any server really), then there is a kind of 'priority inversion' which happens where APP is slowed down because the X server doesn't have a big enough CPU share..
It's quite difficult to solve.. The only way would be to have some way to transfer the 'CPU token' that APP have to the server it's asking to do some work on its behalf, if it is a multi-threaded server which use a different thread for each client, then maybe the kernel could understand what's happening and boost the corresponding server's thread priority accordingly, but in a non-threaded server, I don't see how it could be solved even if the APP says gives my 'CPU token' to server X, how could the server X would be able to report/understand that currently he is supposed to be working for client APP and not for another client?
You know things are bad when Corbet starts breaking out the Beckett puns.
> A concern which has been raised a few times is that the morning swap-in
> problem may well be a sign of a larger issue within the virtual memory
> subsystem, and that prefetch mostly serves as a way of papering over that
> problem. And it fails to even paper things completely, since it brings back
> some pages from swap, but doesn't (and really can't) address file-backed
> pages which will also have been pushed out. The conclusion that this
> reasoning leads to is that it would be better to find and fix the real
> problem rather than hiding it behind prefetch.
My interpretation: there are 3 classes of systems:
1. Those that have loads of memory and very few programs requiring big memory that ever run, and as such never write anything to the swap. Swap prefetch is of course a no-op in such systems.
2. Those that have loads of memory and very few programs requiring big memory that ever run, but still the swap is being written. Swap prefetch improves the performance of such system, but developers would ask, "why the hell did it happen in the first place?"
3. Those that have not such much memory to run all the programs requiring big memory that ever run, and naturally swap prefetch does help, as expected.
So because of the unclear reason that (2) happens, and perhaps because those systems in class (1) might see regressions and need to manually turn off prefetching via a kernel option (never mind that none is currently known after 18 months of testing), those (typically) systems among (3) have to suffer? After all, prefetch is not something that is done only to the swap. A block I/O system without prefetch also have horrible performance, and we have prefetch there for ages. So why swap has to be treated differently? Should we perfectly expect that an application being swapped (for whatever reason) should perform much worse than an application being loaded the first time if prefetching happens when the application is first being loaded and does not happen when the application is pushed to the swap due to uncontrollable memory pressure? Is it such a surprise that swap prefetch is something needed anyway?
And what is the consequence to (2) if swap can be prefetched? It means there is no way to detect such a problem exists? Of course not, the kernel keeps page fault counters, if developers care to write a script and collect the stats of each of the running processes. The only thing that will happen is that people will no longer be so unhappy about the problem, because that doesn't hit their bottom line: enjoying hot coffee while working in the morning. And the end result is, unsurprisingly, less attention to the problem. But... is it really such a bad thing after all, that some hard problem can be put aside because the problem no longer cause serious user dismay?
Perhaps I'm understanding something really wrong.
Thanks for explaining both sides of this issue so clearly. I have not been following this and then read the interview about Con leaving kernel dev. I was wondering what the other side of the story was. Great job!
|