Just like you said, that some users need to profile some code they develop.
They may indeed want to avoid running their tests with root privileges, and
that's where the sysctl is useful - they can disable the feature if they
want to.
A few notes now. In fact, the rdpmc instruction can also be used for side-
channel attacks, but we don't enable it currently so it does not matter.
Regarding serialization, I may not have been clear enough too. rdtsc is not
serializing, which means that it does not wait for the previous instructions
to execute completely before being executed. To compensate for that the
user needs to first execute a serializing instruction like cpuid, and right
after that put the rdtsc. With the fault approach, serialization is ensured,
because when returning to userland 'iret' is used, which is serializing. So
we have a 'iret+rdtsc', which has the same effect as 'cpuid+rdtsc'.
Also, a detail about my remark on accuracy. The basic use case for rdtsc is
the following:
start = rdtsc
work
end = rdtsc
elapsed = end - start
Here, we will fault on the first rdtsc; so the kernel will be entered, and
many cycles will be consumed there. But it does not matter, since the first
rdtsc is used as the starting point, and we don't care about adding cycles
before it. Therefore, the number of elapsed cycles is the same, with and
without the feature.
Finally, I'll add that there are other mitigations available on rdtsc, which
consist for example in adding a random (small) delta to the counter directly,
in order to fuzz the results. But then there is the problem of how big this
delta needs to be: big enough to mitigate side-channels, small enough to
still give relevant - yet a little inaccurate - information back to userland.