レナート   Wunschkonzert, Ponyhof und Abenteuerspielplatz   ﻟﻴﻨﺎﺭﺕ

Fri, 07 Oct 2011

A Plumber's Wish List for Linux

Here's a mail we just sent to LKML, for your consideration. Enjoy:

Subject: A Plumber’s Wish List for Linux

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.

Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers


An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.

posted at: 01:22 | path: /projects | permanent link to this entry | 11 comments


Posted by AdamK at Fri Oct 7 07:18:26 2011
"allow changing argv[]" - doesn't Wine do it already?

Posted by drago01 at Fri Oct 7 10:49:46 2011
No revoke() in that list?

Posted by M Welinder at Fri Oct 21 04:53:51 2011
If you want fanotify events for rename, you
probably want them for link and unlink too.
After all, a link followed by an unlink of the
old name is pretty much the same as a rename.

(I should check what fanotify does, but inotify
used to have this problem.)

Posted by Alexander E. Patrakov at Fri Oct 21 07:39:17 2011
An API to get and set the filesystem label by mount point would make sense not only for FAT, but also for NTFS and EXFAT. So it needs to be compatible with FUSE and, ideally, filesystem-agnostic.

Posted by Dan Ballard at Thu Jan 5 19:22:47 2012
Hey so now that 3.2 has launched, any chance we can see an update on how much got done, and a new call for the 3.3 merge window?

Posted by Anonymous at Sun Jan 22 03:55:32 2012
Request for new item:
"Support for SCTP is missing in SELinux"

Would it be possible to add the above item to the list? There are some pointers in https://bugzilla.redhat.com/show_bug.cgi?id=517676 .

tia.

Posted by Anonymous at Thu Jan 26 12:18:08 2012
While I do agree that the current quota approach won't work for tmpfs, I wonder if that suggests the need for a slightly more general solution.  You mentioned two major issues: "racily upload" and "all current and future UIDs".

The first problem seems like the same issue as setting other mount options at mount time: you need a way to set quota at mount time, or at least to do so before the mount point becomes accessible to anyone else.  Could systemd do this early enough in the boot process that nothing else can run?  Or, could you mount with no permissions for anyone but root, add quotas, then expand the permissions?  Alternatively, could the kernel just support quota information as a mount option?

The second problem seems like a problem common to almost any filesystem with quotas: you don't want to specify a different quota for each user, you want to specify one quota for all users with some potential exceptions.  That seems generally useful.

Would solutions for both of those problems address this issue for you?

Posted by Lennart at Thu Jan 26 23:11:50 2012
Anonymous: I see the usefulness of labels on SCTP, but I am not sure this would fit well under a "plumbers" wishlist...

Other Anonymous: the problem is that apps (i.e. most of the desktop stack) use poll() on /proc/self/mounts to watch what gets mounted and what not, and they expect (rightfully) to be able to access everything that shows up as it shows up. Trying to keep them off that is just going to be messy and is simply not how the system was designed nor is it desirable. And passing in the quotas as mount options isn't really doable, since at mount time one has little clue about which users might be created later on, and the quota data could get extensive and not really nice to include in the mount options. People would really hate us if we serialized quota info for 50.000 user in the mount options of /tmp! Think how the output of /bin/mount would look like then!

Posted by Anonymous at Sat Jan 28 16:15:33 2012
Lennart: No, I didn't mean that you should have to provide quota data for umpteen users in the mount options of /tmp.  I meant that the quota mechanism should support generic quotas that apply to broad swaths of users, so that you could do something as simple as mount -t tmpfs -o stdquota=50M tmpfs /path/to/some/tmpfs.  Does that seem reasonable?

Posted by Lennart at Sun Jan 29 15:47:08 2012
Anonymous: something like that would be better than nothing. However, I think in the long run this won't suffice, since users are different, and administrators will most likely want to configure different quotas for them, without having to remount all tmpfs all the time... But if somebody preps a patch for this I am going to support it, even though I wonder if it makes sense adopting a half-way solution in the kernel, if it is clear from the beginning that it is a half-way solution only. (Also, sincde we mount some tmpfs (i.e. /run) already in the initrd, so if we don't want to remount it during normal path we'd have to encode policy in the initrd which we try to avoid)

Posted by Plumbers in Costa Mesa at Wed Feb 1 07:39:50 2012
The information provided here regarding the plumbing layers and tools is very useful.Good job done but I wonder if that suggests the need for a slightly more general solution.something like that would be better than nothing

Leave a Comment:

Your Name:


Your E-mail (optional):


Comment:


As a protection against comment spam, please type the following number into the field on the right:
Secret Number Image

Please note that this is neither a support forum nor a bug tracker! Support questions or bug reports posted here will be ignored and not responded to!


It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored your are welcome to comment on your own blog instead.


Lennart Poettering <mzoybt (at) 0pointer (dot) net>
Syndicated on Planet GNOME, Planet Fedora, planet.freedesktop.org, Planet Debian Upstream. feed RSS 0.91, RSS 2.0
Archives: 2005, 2006, 2007, 2008, 2009, 2010, 2011

Valid XHTML 1.0 Strict!   Valid CSS!