レナート   Wunschkonzert, Ponyhof und Abenteuerspielplatz   ﻟﻴﻨﺎﺭﺕ

Tue, 12 Apr 2011

systemd for Administrators, Part VII

Here's yet another installment of my ongoing series on systemd for Administrators:

The Blame Game

Fedora 15[1] is the first Fedora release to sport systemd. Our primary goal for F15 was to get everything integrated and working well. One focus for Fedora 16 will be to further polish and speed up what we have in the distribution now. To prepare for this cycle we have implemented a few tools (which are already available in F15), which can help us pinpoint where exactly the biggest problems in our boot-up remain. With this blog story I hope to shed some light on how to figure out what to blame for your slow boot-up, and what to do about it. We want to allow you to put the blame where the blame belongs: on the system component responsible.

The first utility is a very simple one: systemd will automatically write a log message with the time it needed to syslog/kmsg when it finished booting up.

systemd[1]: Startup finished in 2s 65ms 924us (kernel) + 2s 828ms 195us (initrd) + 11s 900ms 471us (userspace) = 16s 794ms 590us.

And here's how you read this: 2s have been spent for kernel initialization, until the time where the initial RAM disk (initrd, i.e. dracut) was started. A bit less than 3s have then been spent in the initrd. Finally, a bit less than 12s have been spent after the actual system init daemon (systemd) has been invoked by the initrd to bring up userspace. Summing this up the time that passed since the boot loader jumped into the kernel code until systemd was finished doing everything it needed to do at boot was a bit less than 17s. This number is nice and simple to understand -- and also easy to misunderstand: it does not include the time that is spent initializing your GNOME session, as that is outside of the scope of the init system. Also, in many cases this is just where systemd finished doing everything it needed to do. Very likely some daemons are still busy doing whatever they need to do to finish startup when this time is elapsed. Hence: while the time logged here is a good indication on the general boot speed, it is not the time the user might feel the boot actually takes.

Also, it is a pretty superficial value: it gives no insight which system component systemd was waiting for all the time. To break this up, we introduced the tool systemd-analyze blame:

$ systemd-analyze blame
  6207ms udev-settle.service
  5228ms cryptsetup@luks\x2d9899b85d\x2df790\x2d4d2a\x2da650\x2d8b7d2fb92cc3.service
   735ms NetworkManager.service
   642ms avahi-daemon.service
   600ms abrtd.service
   517ms rtkit-daemon.service
   478ms fedora-storage-init.service
   396ms dbus.service
   390ms rpcidmapd.service
   346ms systemd-tmpfiles-setup.service
   322ms fedora-sysinit-unhack.service
   316ms cups.service
   310ms console-kit-log-system-start.service
   309ms libvirtd.service
   303ms rpcbind.service
   298ms ksmtuned.service
   288ms lvm2-monitor.service
   281ms rpcgssd.service
   277ms sshd.service
   276ms livesys.service
   267ms iscsid.service
   236ms mdmonitor.service
   234ms nfslock.service
   223ms ksm.service
   218ms mcelog.service
...

This tool lists which systemd unit needed how much time to finish initialization at boot, the worst offenders listed first. What we can see here is that on this boot two services required more than 1s of boot time: udev-settle.service and cryptsetup@luks\x2d9899b85d\x2df790\x2d4d2a\x2da650\x2d8b7d2fb92cc3.service. This tool's output is easily misunderstood as well, it does not shed any light on why the services in question actually need this much time, it just determines that they did. Also note that the times listed here might be spent "in parallel", i.e. two services might be initializing at the same time and thus the time spent to initialize them both is much less than the sum of both individual times combined.

Let's have a closer look at the worst offender on this boot: a service by the name of udev-settle.service. So why does it take that much time to initialize, and what can we do about it? This service actually does very little: it just waits for the device probing being done by udev to finish and then exits. Device probing can be slow. In this instance for example, the reason for the device probing to take more than 6s is the 3G modem built into the machine, which when not having an inserted SIM card takes this long to respond to software probe requests. The software probing is part of the logic that makes ModemManager work and enables NetworkManager to offer easy 3G setup. An obvious reflex might now be to blame ModemManager for having such a slow prober. But that's actually ill-directed: hardware probing quite frequently is this slow, and in the case of ModemManager it's a simple fact that the 3G hardware takes this long. It is an essential requirement for a proper hardware probing solution that individual probers can take this much time to finish probing. The actual culprit is something else: the fact that we actually wait for the probing, in other words: that udev-settle.service is part of our boot process.

So, why is udev-settle.service part of our boot process? Well, it actually doesn't need to be. It is pulled in by the storage setup logic of Fedora: to be precise, by the LVM, RAID and Multipath setup script. These storage services have not been implemented in the way hardware detection and probing work today: they expect to be initialized at a point in time where "all devices have been probed", so that they can simply iterate through the list of available disks and do their work on it. However, on modern machinery this is not how things actually work: hardware can come and hardware can go all the time, during boot and during runtime. For some technologies it is not even possible to know when the device enumeration is complete (example: USB, or iSCSI), thus waiting for all storage devices to show up and be probed must necessarily include a fixed delay when it is assumed that all devices that can show up have shown up, and got probed. In this case all this shows very negatively in the boot time: the storage scripts force us to delay bootup until all potential devices have shown up and all devices that did got probed -- and all that even though we don't actually need most devices for anything. In particular since this machine actually does not make use of LVM, RAID or Multipath![2]

Knowing what we know now we can go and disable udev-settle.service for the next boots: since neither LVM, RAID nor Multipath is used we can mask the services in question and thus speed up our boot a little:

# ln -s /dev/null /etc/systemd/system/udev-settle.service
# ln -s /dev/null /etc/systemd/system/fedora-wait-storage.service
# ln -s /dev/null /etc/systemd/system/fedora-storage-init.service
# systemctl daemon-reload

After restarting we can measure that the boot is now about 1s faster. Why just 1s? Well, the second worst offender is cryptsetup here: the machine in question has an encrypted /home directory. For testing purposes I have stored the passphrase in a file on disk, so that the boot-up is not delayed because I as the user am a slow typer. The cryptsetup tool unfortunately still takes more han 5s to set up the encrypted partition. Being lazy instead of trying to fix cryptsetup[3] we'll just tape over it here [4]: systemd will normally wait for all file systems not marked with the noauto option in /etc/fstab to show up, to be fscked and to be mounted before proceeding bootup and starting the usual system services. In the case of /home (unlike for example /var) we know that it is needed only very late (i.e. when the user actually logs in). An easy fix is hence to make the mount point available already during boot, but not actually wait until cryptsetup, fsck and mount finished running for it. You ask how we can make a mount point available before actually mounting the file system behind it? Well, systemd possesses magic powers, in form of the comment=systemd.automount mount option in /etc/fstab. If you specify it, systemd will create an automount point at /home and when at the time of the first access to the file system it still isn't backed by a proper file system systemd will wait for the device, fsck and mount it.

And here's the result with this change to /etc/fstab made:

systemd[1]: Startup finished in 2s 47ms 112us (kernel) + 2s 663ms 942us (initrd) + 5s 540ms 522us (userspace) = 10s 251ms 576us.

Nice! With a few fixes we took almost 7s off our boot-time. And these two changes are only fixes for the two most superficial problems. With a bit of love and detail work there's a lot of additional room for improvements. In fact, on a different machine, a more than two year old X300 laptop (which even back then wasn't the fastest machine on earth) and a bit of decrufting we have boot times of around 4s (total) now, with a resonably complete GNOME system. And there's still a lot of room in it.

systemd-analyze blame is a nice and simple tool for tracking down slow services. However, it suffers by a big problem: it does not visualize how the parallel execution of the services actually diminishes the price one pays for slow starting services. For that we have prepared systemd-analyize plot for you. Use it like this:

$ systemd-analyze plot > plot.svg
$ eog plot.svg

It creates pretty graphs, showing the time services spent to start up in relation to the other services. It currently doesn't visualize explicitly which services wait for which ones, but with a bit of guess work this is easily seen nonetheless.

To see the effect of our two little optimizations here are two graphs generated with systemd-analyze plot, the first before and the other after our change:

Before After

(For the sake of completeness, here are the two complete outputs of systemd-analyze blame for these two boots: before and after.)

The well-informed reader probably wonders how this relates to Michael Meeks' bootchart. This plot and bootchart do show similar graphs, that is true. Bootchart is by far the more powerful tool. It plots in all detail what is happening during the boot, how much CPU and IO is used. systemd-analyze plot shows more high-level data: which service took how much time to initialize, and what needed to wait for it. If you use them both together you'll have a wonderful toolset to figure out why your boot is not as fast as it could be.

Now, before you now take these tools and start filing bugs against the worst boot-up time offenders on your system: think twice. These tools give you raw data, don't misread it. As my optimization example above hopefully shows, the blame for the slow bootup was not actually with udev-settle.service, and not with the ModemManager prober run by it either. It is with the subsystem that pulled this service in in the first place. And that's where the problem needs to be fixed. So, file the bugs at the right places. Put the blame where the blame belongs.

As mentioned, these three utilities are available on your Fedora 15 system out-of-the-box.

And here's what to take home from this little blog story:

And that's all for now. Thank you for your interest.

Footnotes

[1] Also known as the greatest Free Software OS release ever.

[2] The right fix here is to improve the services in question to actively listen to hotplug events via libudev or similar and act on the devices showing up as they show up, so that we can continue with the bootup the instant everything we really need to go on has shown up. To get a quick bootup we should wait for what we actually need to proceed, not for everything. Also note that the storage services are not the only services which do not cope well with modern dynamic hardware, and assume that the device list is static and stays unchanged. For example, in this example the reason the initrd is actually as slow as it is is mostly due to the fact that Plymouth expects to be executed when all video devices have shown up and have been probed. For an unknown reason (at least unknown to me) loading the video kernel modules for my Intel graphics cards takes multiple seconds, and hence the entire boot is delayed unnecessarily. (Here too I'd not put the blame on the probing but on the fact that we wait for it to complete before going on.)

[3] Well, to be precise, I actually did try to get this fixed. Most of the delay of crypsetup stems from the -- in my eyes -- unnecessarily high default values for --iter-time in cryptsetup. I tried to convince our cryptsetup maintainers that 100ms as a default here are not really less secure than 1s, but well, I failed.

[4] Of course, it's usually not our style to just tape over problems instead of fixing them, but this is such a nice occasion to show off yet another cool systemd feature...

posted at: 03:51 | path: /projects | permanent link to this entry | 30 comments


Posted by Anonymous at Tue Apr 12 05:41:35 2011
Given that you already put services into their own cgroups, could you use those cgroups to isolate and analyze CPU/disk/network/etc activity?  Specifically, you could use that analysis to detect when a service really has finished initializing (even when it does so asynchronously), or to figure out what blocks that service.

Posted by Elad at Tue Apr 12 10:37:58 2011
I agree that "enterprise-level storage managing system" is quite unnecessary in most fedora installations. Is it possible to automatically disable those services (say at anaconda post script) in case that there is no LVM nor RAID configured on the system?

Posted by Paul Bolle at Tue Apr 12 12:08:27 2011
> [...] Startup finished in 2s 65ms 924us (kernel) [...]

I misread that first time (thinking it said 2s 650ms etc.). So a minor nitpick would be that the ms and us counts need zero padding on the left (for people like me, that is).

Posted by Lennart at Tue Apr 12 14:30:33 2011
Anonymous: I fail to see how cgroups would be useful in this context.

Elad: well, sure, for F16 we will hopefully no longer enable LVM by default and thus can keep it out of the boot process.

Paul: we have a patch waiting in our bugzilla which changes the format of timespans a little to fix this.

Posted by Alejandro Nova at Tue Apr 12 19:00:49 2011
I have to file a bug, but against KDE. Gwenview can't display the text in SVGs generated by systemd-analyze plot.

Great work!

Posted by Elad at Tue Apr 12 21:57:05 2011
You obviously have some patches for cups, as in my blame list it's 5117ms (and it's on top of the list).

Posted by cesarb at Wed Apr 13 00:26:45 2011
> Is it possible to automatically disable those services (say at anaconda post script)

Or even better: how about noticing the rd_NO_* flags on the kernel command line and somehow bypassing the corresponding units? If udev-settle is being called only as a dependency of them, and all of them are disabled, it could then be skipped too.

Posted by oiaohm at Wed Apr 13 02:12:46 2011
Same applies to the Raid scripts.  The Raid should mount as soon as all its parts are located.  Any longer delays system start up.  So even for the Raids udev-settle.service is invalid.  Since this requires waiting too long.  If your services and everything else is in the raid waiting for everything to probe is a major waste of time.

Now if all udev registrations are done and lvm and raid volumes still are not mounted this should be an error.  Lot of your complaints about LVM and Raid being a heavy hitter on performance directly relate to this over waiting.

This is basically the alteration to make LVM and Raid hotpluggable as per data recovery systems.  Not like we want to be messing around when recovering data from dead machines.

Lennart I know this is hard.  But there are times when particular users will require raid or lvm.  There is no particular reason to treat them as second class and leave them with defective scripts.

Of course some people can raise the line of what happens if there is a conflict in lvm or raid.  A working system this should not happen.  But if you want to play it safe.  LVM and raid drives remain read only until probing is complete then if they are write enabled change to write.  Still this is less stopping on the boot process than current state of not being able to read the LVM or Raid at all until probing is complete.  Note in datarecovery drive never goes read-write on its own so the conflict is not a issue.

Not that this issue will be require for me. My lvm drives are lvm so I can snapshot them.  Btrfs is not exactly what you call fully production ready yet.

Elad cups can be slow.  Cups is another one that has to probe.  cups start speed is normally directly linked to the printers its trying to use.

Posted by oiaohm at Wed Apr 13 02:14:06 2011
Opps this should be before my other comment.  Cut it off somehow.>>
Lennart I would say the LVM startup scripts need redesign.  Since I do hotplug drives containing LVM without issues all the time.  LVM should be no worse than any other partitions other than little longer to come into existence.

RAID and Multipath(that is basically enhanced raid) enabled yes I can see requirement to settle.

Not all cases is LVM partitions expanded across drives.  This is why I can hotplug.  LVM volume on a single disk not mirror or spanned should just mount without ever requiring udev-settle.service as soon as the drive its on shows up.  Each LVM spanned or mirrored volume is tagged as such.  These should trigger a kinda of wait.  Basically this LVM volume is not mountable until its other parts turn up.  This could be a form of hotplug registration.

Posted by oiaohm at Wed Apr 13 03:52:39 2011
Lennart I also have to bust you bubble of hope.  LVM disappears out of Fedora 16 at the same time Btrfs appears.

Reason Btrfs also contain raiding options.  Raiding has the same set or problems over and over again.

Designing a model to handle raid devices that model will apply to Btrfs LVM ZFS some cluster filesystems.

Other thing that is over looked is a mirror raid can double read speed.  So a mirror raid machine should in theory boot faster than a non mirror unless its wasting time somewhere waiting for stuff it should not be.

Posted by pankaj at Wed Apr 13 05:39:40 2011
Well i'm unable to run check out the plot feature, getting some wierd error.

[pankaj@pankajlaptop2 ~]$ systemd-analyze plot > plot.svg
Traceback (most recent call last):
  File "/usr/bin/systemd-analyze", line 95, in <module>
  start_time, finish_time = acquire_start_time()
  File "/usr/bin/systemd-analyze", line 33, in acquire_start_time
  assert startup_time <= finish_time
AssertionError

How can this happen :(

systemd-analyze blame seems to run fine

[pankaj@pankajlaptop2 ~]$ systemd-analyze blame
  8339ms plymouth-start.service
  5217ms boot.mount
  5214ms systemd-tmpfiles-setup.service
  4725ms NetworkManager.service
  4567ms fedora-storage-init.service
  4038ms udev.service
  3875ms fedora-readonly.service
...

Posted by Elad at Wed Apr 13 10:04:29 2011
pankaj: Your RTC (the computer clock) is non-UTC, this bug was fixed in git. (but you should change it to UTC anyway)

Posted by gmazyland at Wed Apr 13 17:27:57 2011
Well, see, 5 seconds for cryptsetup is long time, for one active keyslot (most of configurations) it should take 1 second.

cryptsetup calculates iteration time to take 1 second by default (during format, after benchmarking system), but because you are probably running it in parallel with some CPU busy code, it takes longer (also systemd doesn't increase process priority like cryptsetup does).

Anyway, you read the paper, for others - read http://clemens.endorphin.org/cryptography before you start to blame someone that 1 second is too long. Security is not for free.

If you want to decrese iteration timeout and you are aware what it means, you can, it is parameter for luksFormat. (You can also do it later by reformating keyslot without data loss.)

For the starters, read "What is iteration count and why is decreasing it a bad idea?" FAQ section here http://code.google.com/p/cryptsetup/wiki/FrequentlyAskedQuestions#5._Security_Aspects

Posted by Lennart at Wed Apr 13 17:36:16 2011
gmazyland: the CPU is otherwise completely idle when the crypto disk is set up. And the link you mentioned just includes the usual bullshit of claiming "1s is perfect", but gives no reason at all why it is, and why 100ms couldn't be fine too. 1s is arbitrarily picked, and I have yet to see any kind of research that would recommend 1s, and would strictly warn of 100ms. The FAQ you mention does not include any reference to any research about this. "Just because" and "Security is not free" is not a valid reason for insisting on 1s.

Posted by zoom at Wed Apr 13 20:03:54 2011
Who cares about boot time... girl scouts?

Posted by zoom at Wed Apr 13 20:05:21 2011
That above was an example of a bullshit argument. Not the reasonably argued reason s for choosing the default of 1s iteration time.

Posted by Karel Zak at Thu Apr 14 00:55:00 2011
Lennart: I don't think that LUKS (or any other document) claiming that 1s is perfect.

The number of iterations in PBKDF2 multiplies CPU time that is necessary for brute-force (or dictionary based) attacks -- 100ms means less iterations, 1s means more iterations.

There is discussion about it in RFC3962, the function is described in RFC2898 and something is at http://www.schneier.com/paper-low-entropy.html.

From your note it seems that you believe that key derivation functions are stupid fake -- do you have any URL where we can read more about it?

Posted by pankaj at Thu Apr 14 15:06:28 2011
@Elad: Thanks for your hint.
How do i go about changing the RTC to use UTC instead of localtime. The hwclock command does not seem to be able to do it.
Thanks

Posted by andreas at Thu Apr 14 15:45:31 2011
@pankaj: In Ubuntu you edit /etc/default/rcS and in Fedora some googling tells me that you run the command system-config-time

http://www.ardamis.com/2009/11/24/windows-time-off-when-dual-booting-linux/

Posted by Lennart at Thu Apr 14 15:51:28 2011
pankaj: the place to configure UTC RTC is the 3rd line of /etc/adjtime. Everything else is non-standard per-distribution hackery. See hwclock(8) for details.

Karel: I am not saying that the alg is flawed, I am just arguing that 1s is needlessly high and 100ms would do too.

Posted by pankaj at Thu Apr 14 16:03:13 2011
Thanks Andreas. Sadly system-config-time is no longer available in Fedora 15. There must be some other way out

Posted by pankaj at Thu Apr 14 16:09:23 2011
@Lennart: thanks a lot, that man page was really extensive. It didn't strike me to check it there.

Posted by Eero Tamminen at Sat Apr 16 12:53:43 2011
Maybe systemd could wrap meego's minimal bootchart to systemd: http://meego.gitorious.org/meego-developer-tools/bootchart ?

It's only 1.5KLOCs of C which includes SVG chart generation...

Posted by Lennart at Sun Apr 17 15:55:12 2011
Eero: Michael's bootchart version can do a lot more, since it doesn't just sample running processes at certain intervals, but actually subscribes to tasks being created and removed.

Posted by misc at Wed Apr 20 15:48:20 2011
I've recently noticed a precarious behavior of systemd and in attempts to circumvent it found a bug:

If I remove my USB HDD (1 btrfs partition) systemd will completely halt the boot process when it would be time to mount. init knew better as it just gave a "mount point not available" warning (or something like that) and ignored that. However, booting does resume after the drive is connected. (fstab: http://pastebin.com/7fJu35bA )

On the Arch Linux bbs falconindy then suggested appending the relevant fstab line with comment=systemd.automount. The system will then boot apparently normally &#8211; but nothing can access the device both when connected after boot and from the beginning, neither can the contents of the folder containing its mount point (/media) be listed. Apparently some system call does not respond and hence leaves the applications freezing.

Posted by Lennart at Wed Apr 20 15:52:08 2011
misc: i think you want to be using "nofail" as mount option. It's a switch understood by traditional mount, and if the previous archlinux boot scripts didn't fail if mount points where missing though nofail was not passed for them then the archlinux boot scripts were buggy.

Posted by misc at Wed Apr 20 17:28:35 2011
Lennart: "nofail" works, thanks.

However, the apparently faulty behavior of comment=systemd.automount remains. Mere reboots after appending it seem to cause no trouble, but once I e. g. umount it the folder becomes inaccessible. It doesn't matter if the drive is connected at the next reboot, only removing the part from fstab works. (The partition is exclusively for music storage, mpd gets started as separate user at boot &#8211; if that's relevant.)

Posted by Lennart at Wed Apr 20 19:08:31 2011
misc: if you manually unmount the automount point, then well, the automount point is gone and will not work. no surprise in that.

(This blog post is not a bug tracker btw. I would much prefer if you filed this bug against your distribution's bug tracker, so that the maintainers can than escalate the bug to us upstream if necessary. Thanks)

Posted by Clyde Kunkel at Tue Oct 18 21:52:27 2011
In current rawhide with one LV snapshot, fedora-storage-init.service takes ~40 secs.  Expected? I will file a bz if this seems excessive.

Posted by kirill at Sun Nov 27 06:15:25 2011
On my computer after installing openSuSE 12.1 systemd boots system quite long. I've rolled back to systemv and got boot process much faster...

which means - it is not so simple, and does require tune-up....

Leave a Comment:

Your Name:


Your E-mail (optional):


Comment:


As a protection against comment spam, please type the following number into the field on the right:
Secret Number Image

Please note that this is neither a support forum nor a bug tracker! Support questions or bug reports posted here will be ignored and not responded to!


It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored your are welcome to comment on your own blog instead.


Lennart Poettering <mzoybt (at) 0pointer (dot) net>
Syndicated on Planet GNOME, Planet Fedora, planet.freedesktop.org, Planet Debian Upstream. feed RSS 0.91, RSS 2.0
Archives: 2005, 2006, 2007, 2008, 2009, 2010, 2011

Valid XHTML 1.0 Strict!   Valid CSS!