|
Glibc wrappers for (nearly all) Linux system calls
ByJonathan Corbet August 20, 2015
The GNU C Library (glibc) is a famously conservative project. In the past,
that conservatism created a situation where there is no way to directly
call a number of Linux system calls from a glibc-using program. As glibc
has relaxed a bit in recent years, its developers have started to
reconsider adding wrapper functions for previously inaccessible system
calls. But, as the discussion shows, adding these wrappers is still not as
straightforward as one might think.
A C programmer working with glibc now would look in vain for for a
straightforward way to invoke a number of Linux system calls, including
futex(),
gettid(),
getrandom(),
renameat2(),
execveat(),
bpf(),
kcmp(),
seccomp(),
and a number of others. The only way to get at these
system calls is via the syscall() function. Over the years, there have
been requests to add wrappers for a number of these system calls; in some
cases, such as gettid()
and futex(),
the requests were summarily rejected by the (at-the-time) glibc maintainer
in fairly typical style. More recently these requests have been reopened
and others have been entertained, but there have been no system-call
wrappers added since glibc 2.15, corresponding roughly to the 3.2 kernel.
On the face of it, adding a new system-call wrapper should be a simple
exercise. The kernel has already defined an API for the system call, so it
is just a matter of writing a simple function that passes the caller's
arguments through to the kernel implementation. Things quickly get more
complicated than that, though, for a number of reasons, but they all come
down to one root cause: glibc is not just a wrapper interface for
kernel-supplied functionality. Instead, it provides a (somewhat
standard-defined) API that is meant to be consistent and independent of any
specific operating system.
There are provisions for adding kernel-specific functions to glibc now;
those functions will typically fail (with errno set to
ENOSYS) when called on a kernel that does not support them.
Examples of such functions include the Linux-specific epoll_wait()
and related system calls. As a general rule, though, the glibc developers,
as part of their role maintaining the low-level API for the GNU system,
would like to avoid kernel-specific additions.
This concern has had the effect of keeping a lot of Linux system-call
wrappers out of the GNU C Library. It is not necessarily that the glibc
developers do not want that functionality, but figuring out how a new
function would fit into the overall GNU API is not a straightforward task.
The ideal interface may not (from the glibc point of view) be the one exposed
by the Linux kernel, so another may need to be designed. Issues like error
handling, thread safety, support on non-Linux systems, and POSIX-thread
cancellation points can
complicate things considerably. In many cases, it seems that few
developers have wanted to run the gauntlet of getting new system-call
wrappers into the library, even if the overall attitude toward such
wrappers has become markedly more friendly in recent years.
Back in May 2015, Joseph Myers proposed
relaxing the rules just a little bit, at least in cases when the
functionality provided by a wrapper might be of general interest. In such
cases, Joseph suggested, there would be no immediate need to provide support
for other operating-system kernels unless somebody found the desire and the
time to do the work.
Roland McGrath is, by his own admission,
the hardest glibc developer to convince about the value of adding
Linux-specific system calls to the library. He still does not see a clear
case for adding many Linux system-call wrappers to the core library; it is
only clear, he said, when the system call is to be a part of the GNU API:
My top concern is adding cruft to the core libc ABIs. That means
specifically symbols in the shared objects for libc, libpthread,
librt, libdl, libm, and libutil.
I propose that we rule out adding any symbols to the core libc ABIs
that are not entering the OS-independent GNU API.
Roland does not seem to believe that glibc should entirely refuse to
support system calls that don't meet the above criterion, though. Instead,
he suggested creating another library specifically for them. It would be
called something like "libinux-syscalls" (so that one would link with
"-linux-syscalls"). Functions relegated to this library should be
simple wrappers, without internal state, with the idea that supporting
multiple versions of the library would be possible.
There was some discussion on the details of this idea, but the core of it
seems to be relatively uncontroversial. Also uncontroversial is the idea
that glibc need not provide wrappers for system calls that are obsolete,
that cannot be used without interfering with glibc
(set_thread_area()
is an example), or those that are expected to have
a single caller (such as create_module()). So Carlos O'Donell has
proposed a set of rules that would clear
the way for the immediate addition of operating-system-independent system
calls into the core and the addition of a system-dependent library for the
rest.
Of course, "immediate" is a relative term. Any system-call wrappers will
still need to be properly implemented and documented, with test cases and
more. There is also, in some cases, the more fundamental question of what
the API should look like. Consider the case of the
futex()
system call, which provides access to a fast mutual-exclusion mechanism.
As defined by the kernel, futex() is a multiplexer interface, with
a single entry point providing access to a range of different operations.
Torvald Riegel made the case that exposing
this multiplexer interface would do a disservice to glibc users:
Keeping the multiplexing is bad for users. Can you tell me
off-hand what goes in "uaddr2", "val", or "val3" for all the ops?
Is it easy to remember based on the function signature? Can you
remember in which cases "timeout" is actually "val2" and not a
pointer but cast to uint32_t? So are we going to expect users to
cast uint32_t's to a pointer to call one of the operations and
consider that a useful API design? It's a nice way to potentially
trigger compiler warnings though.
He proposed exposing a different API based around several functions with
names like futex_wake() and futex_wait(); he also posted
a patch implementing this interface.
Joseph, while not disagreeing with that
interface, insisted that the C library should provide direct access to
the raw system call, saying:『The fact that, with hindsight, we might
not have designed an API the way it was in fact designed does not mean we
should embed that viewpoint in the choice of APIs provided to
users.』 In the end, the two seemed to agree that both types of
interface should, in some cases, be provided. If the C library can provide
a useful higher-level interface, that may be appropriate to add, but more
direct access to the system call as provided by the kernel should be there
too.
The end result of all this is that we are likely to see a break in the
logjam that has kept new system-call wrappers out of glibc. Some new
wrappers could even conceivably show up in the 2.23 release, which can be
expected sometime around February 2016. Even if the attitude and rules
have changed, though, this is still glibc we are talking about, so catching
up with the kernel may take a while yet. But one can take comfort in the
fact that a path is now visible, even if it may yet be a slow one.
(Log in to post comments)
The fundamental problem is that the errno convention has outlived its usefulness. The Linux kernel calls return an error code directly, but to be POSIX-compatible, glibc has to squirrel these away in errno. Which requires all this complicated wrapper code, as well as a whole extra mechanism to make errno thread-safe.
I recently hit the situation where the write(2) call didn’t write all the bytes I gave it to disk, with no error indication in errno. The man page only says this can happen
... if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes.
In other words, you don’t know why it happened.
struct writeret { size_t n; int errno; };
struct writeret writex(int, const void *, size_t);
Modern ABIs would return the values in registers, and an optimizing compiler could elide the existence of an independent struct writeret object altogether. That kernels still only return a single integer value is more about not evolving with the times. Pre-ANSI C didn't permit passing compound objects by value, only pointers, so ABIs and compilers didn't have to consider optimizing that case. In 1989 ANSI C changed that to permit passing structs and unions, but not arrays. For a long time ABIs and compilers would always pass the values on the stack, and it was considerable poor practice to make use of the feature in performance-critical code. But modern ABIs (e.g. AMD64) can pass the member values through registers. So there's no cost to using smallish structs as function parameters or return values.
The write design may not be wonderful, but the standard procedure upon a short write is to try again with the remainder of the data to be written. Then you will get the real reason in errno (unless it was just a temporary condition the first time).
I partition the negative range using a simple prefix system, which makes it easy to mix-and-match components. For example, from my DNS library,
#define DNS_EBASE -(('d' <<24) | ('n' <<16) | ('s' <<8) | 64)
enum dns_errno {
DNS_ENOBUFS = DNS_EBASE,
DNS_EILLEGAL,
DNS_EORDER,
DNS_ESECTION,
DNS_EUNKNOWN,
DNS_EADDRESS,
DNS_ENOQUERY,
DNS_ENOANSWER,
DNS_EFETCHED,
DNS_ESERVICE, /* EAI_SERVICE */
DNS_ENONAME, /* EAI_NONAME */
DNS_EFAIL, /* EAI_FAIL */
DNS_ELAST,
}; /* dns_errno */
/* for documentation only; will always be type int */
#define dns_error_t int
...
dns_error_t dns_res_submit(struct dns_resolver *, const char *, enum dns_type, enum dns_class);
struct dns_packet *dns_res_fetch(struct dns_resolver *, dns_error_t *);
Helpfully, strerror must always return a valid string for all integer values, even for unknown values.
The strerror function maps the number in errnum to a message string. Typically,
the values for errnum come from errno, but strerror shall map any value of type
int to a message.
C11 (N1570) 7.24.6.2p2
So if an application-specific value accidentally leaks to a component that doesn't understand the protocol it's relatively benign. In fact, most strerror implementations will include the value in the message, so you'll actually get useful output if it gets passed to strerror. But usually each component will define it's own strerror interface that forwards to strerror or a sub-component's strerror.
It's the most useful and practical error reporting method I've found, at least for C code, and especially for C libraries. Of course, sometimes a routine is better defined (easier to use, more intuitive) by returning the error value through a reference, rather than as the return value. But that's just a variation on the theme. Unless you know you'll always operate in a closed software ecosystem, every other scheme is just chasing a dragon like other classic rookie mistakes: constantly writing configuration parsers, relying too heavily on malloc/free replacements, and reinventing logging instead of using stderr or perhaps syslog. Billions of man hours have been wasted down those rabbit holes.
It would be called something like "libinux-syscalls" (so that one would link with "-linux-syscalls").
Umm, I have a problem with the "libinux" part of that. Granted, the associated command-line linker directive may be easier to understand, but everyone1 knows that the shared library name to link is the same as the actual .so file, minus that extension and also without the leading "lib" (immediately following the -l switch, of course).
Why not just call it "liblinux-syscalls and tell programmers to link it with -llinux-syscalls? Having two letter l's isn't so bad—hasn't anyone ever used -llzma, -llcms, or -llo?
I think Roland's idea is fine—just not the part about naming a proposed linked library something awkward and possibly confusing programmers to thinking it is a new and different command-line option.
1 I'm making a silly generalization here, of course.
It's clearly inspired by libiberty. There is a precedent for this!
Well then, why stop here? Why not get creative with DSO names? I propose:
-
libove.so - to convey an intense emotional attraction to your object code.
-
libick.so - to moisten the object file with your tongue.
-
libeak.so - Converts all calls to free(3) (and similar) to NO-OPs. Slight performance gain.
-
liboop.so - Enable your program to demonstrate the halting problem.
-
Combined linking of libeak.so and liboop.so makes for a quick 'n' easy way to test the kernel's OOM mechanism.
(Just kidding.) ;-)
...possibly confusing programmers to thinking it is a new and different command-line option.
But everyone1 knows that an argument starting with -l refers to a library, so there's no possibility of confusion!
1 I'm making a silly generalization as well, of course. :)
As gevaerts points out, there is definite precedent for this, and once noticed, it's never forgotten. And the consequences of getting it wrong are: link fails, need to link again. Not exactly earth-shattering. And the precedent was even associated with the same project: libiberty was what you used when you wanted to take advantage of some advanced glibc features on another vendor's libc.
I'd hope that this would only affect a pretty tiny percentage of code, in any case. It might turn out to be a very important tiny percentage—critical daemons and the like—but I hope people haven't given up on the idea of writing reasonably portable code in general. Or of not making their tangled web of #ifdefs any more tangled than they really need to be!
|
|