by V.R.
I am not sure I am such a big fan of reimplementing NetworkManager…
– Lennart Poettering’s famous last words, March 2011
10 years ago, systemd was announced and swiftly rose to become one of the most persistently controversial and polarizing pieces of software in recent history, and especially in the GNU/Linux world. The quality and nature of debate has not improved in the least from the major flame wars around 2012-2014, and systemd still remains poorly understood and understudied from both a technical and social level despite paradoxically having disproportionate levels of attention focused on it.
I am writing this essay both for my own solace, so I can finally lay it to rest, but also with the hopes that my analysis can provide some context to what has been a decade-long farce, and not, as in Benno Rice’s now famous characterization, tragedy.
In the first chapter, on the basis of contemporary mailing list posts, I discuss efforts to modernize init, rc and service management that took place before systemd, and what were the prevailing motives at the time. I begin with a preface on the cultural cleavages between different classes of Linux users.
In the second chapter, I discuss the early history and design philosophy of systemd, and what factors drove its adoption.
The third chapter is a technical critique of systemd. It assumes prior familiarity with systemd and it is heavy on discussion of implementation details. I also include a few “case studies” based on bug reports to better illustrate some of the drier theory.
The fourth chapter discusses other historical parallels to systemd in FOSS development, wraps up some of the threads in the first and second chapters, and concludes with some conjectures about the future of low-level Linux userspace.
-
-
3.2. Units
-
3.3. Jobs
-
3.5. Naming inconsistencies and abstraction failure in systemd
- 3.5.1. Dependency hell-based init
-
3.6. Case studies
1. Init modernization efforts before systemd
1.1. The root of the Linux culture war
The complaints over Linux’s fragmented nature as an anarchic bazaar of individual white-box components stitched together into distributions, and various perennial attempts at coming to a “resolution” of this innate issue – are nearly as old as the first distributions themselves.
Fred van Kempen, an early contributor to Linux’s TCP/IP stack, is quoted in a June 1994 interview with the Linux Journal as saying:
Personally, I think the Linux community will have to get used to (a) paying some money for the software they use (for example, shareware and commercial applications), and (b) a somewhat more closed development environment of the system itself. Many, many people will disagree, and exactly this program is what is keeping Linux from a major breakthrough in The Real World.
The lively bazaar would need to make compromises with the Cathedral and with proprietary software, in this early pioneer’s estimation.
A 1998 article entitled “Linux and Decentralized Development” by Christopher B. Browne, written in the storm
of critical reception following ESR’s famous essay on the Cathedral and the Bazaar, opens with the observation that “many people have complained over the last few years that there should be some sort of “central” Linux organization.”
Browne goes on to argue for the merits of decentralized development, but by the end does concede the usefulness of what he dubs a “Linux Foundation,” and goes on to describe and sources of funding for such an organization, which would indeed go on to become a reality shortly after.
Besides the ill-fated Linux Standard Base, one of several early attempts at standardizing a unified “Linux” API was the now forgotten EL/IX specification, drafted in late 1999 by Cygnus Solutions, shortly before their acquisition by Red Hat. It was specifically intended for so-called “deeply embedded” platforms, defined then as “automotive controls, digital cameras, cell phones, pagers,” and to compete with RTOS such as eCos. Note that in the FAQ we read that “Red Hat is committed to ensuring the portability of Linux and preserving freedom of choice” – a view that does not find as great favor in our time. A contemporaneous article in the EE Times reveals that the announcement of EL/IX was met by a very mixed reception from other vendors who worked with embedded Linux at the time.
The major cultural cleavage in the Linux “community” boils down to two things: its entanglement with the history of GNU and the free software movement, and its image as the “revolution OS” – a product of free culture worked on by hobbyists and volunteers towards the end of emancipation from shrink-wrapped software vendors and pointy-haired bosses, or the snooping cloud service provider as the more modern equivalent would be.
Consequently, the professional Linux plumber and the plebeian hobbyist occupy two different worlds. The people who work at the vanguard of Desktop Linux and DevOps middleware as paid employees have no common ground with the subculture of people who use suckless software, build musl-based distros from scratch and espouse the values of minimalism and self-sufficiency. For many in the latter camp who came to Linux from the message of free software, their eventual realization that real-world Linux development is increasingly dominated by the business interests of large-scale cloud providers as represented in the platinum memberships of the Linux Foundation, is almost analogous to the disillusionment of a communist true believer upon witnessing Comrade Trotsky brutally punish the sailors of the Kronstadt rebellion. The iron law of oligarchy still reigns as usual irrespective of the pretense toward progress and equality.
Despite the age of the homesteading hobbyist making a substantial difference long being over, the image still persists. The communitarian ethos of free software can never be fully wiped from the DNA of GNU/Linux, but it can be increasingly relegated to an irrelevant curiosity. The likes of Richard Stallman, GNU and the FSF are seen more and more as an embarrassment to be
overcome in favor of a more “professional” and “inclusive” scene, which will in all likelihood mean professionally showing the user their place as a data entry within the panopticon, not a free and independent yeoman as the admittedly utopian pipe dream envisioned.
Reality would mug the hobbyist hacker quite early. In an April 2000 interview with Red Hat founder Bob Young for the Ottawa Citizen, he states:
There are two big myths about the business, and the first is there is a single Linux operating system. Linux is a 16-megabyte kernel of the 600-megabyte operating systems that companies like Corel and Red Hat make. Linux might be the engine of your car, but if you plunk an engine in your driveway, you’re not going to drive your kids to school.
Our job is making people understand this revolution is about open-source software and it is not about Linux at all. Linux is simply the poster boy for this movement. The other myth is that Linux is being written by 18-year-olds in their basements when actually most of it is being written by professional engineering teams.
Another harbinger was the dot-com bubble – the amount of long-forgotten Linux companies that were started to ride the IPO craze, was immense. Turbolinux, LynuxWorks, Stormix, Linuxcare, Cobalt Networks and LinuxOne are just a few names of ventures that disappeared as abruptly as they rose. This was a time when LWN.net ran stock listings and meticulously covered every Linux-related financial vehicle. The biggest story from that era, though, was probably IBM announcing its intent to invest $1 billion on Linux over the next three years from 2001 onward. A discussion of IBM’s open source strategy c.2005 is available here.
In 2020, the name of the game is “cloud native,” and the operating system buried under layers of middleware meant to abstract it into a reproducible application server for the purposes of
easily deploying the latest and greatest way of doing what you’ve been able to do for years, but as a networked service. This return of time-sharing, described in Rudolf Winestock’s “The Eternal Mainframe,” has brought forth the ethical problem of Service as a Software Substitute, at the unease of the idealistic hobbyist and the indifference or enthusiasm of the hardened professional.
GnomeOS has been a high-level goal for many years, motivated by the professional’s resentment against the existence of the Linux distribution as a middleman in application deployment with its own packaging guidelines, policies, changes to upstream defaults and indeed the very concept of “package management.” Tobias Bernard expresses this sentiment in “There Is No ‘Linux’ Platform”. Among the damages caused by fragmentation, according to Bernard, are upstream maintainers “adding a permanent dock, icons on the desktop, re-enabling the systray” to the DE. Presumably such irresponsibility may lead to future horrors, like people actually reading source code. Either way, the most recent attempt at achieving the GnomeOS vision involves tools such as Flatpak, OSTree, BuildStream and the Freedesktop SDK (essentially a distro crafted out of BuildStream files), which also appears to be the longest lasting and most likely to succeed. How “legacy” distributions will adapt in this brave new world remains an open question.
The most unambiguous statement of the unificationist perspective, however, comes from GNOME developer Emmanuele Bassi:
If desktop environments are the result of a push towards centralisation, and comprehensive, integrated functionality exposed to the people using, but not necessarily contributing to them, splitting off modules into their own repositories, using their own release schedules, their own idiosynchrasies in build systems, options, coding styles, and contribution policies, ought to run counter to that centralising effort. The decentralisation creates strife between projects, and between maintainers; it creates modularisation and API barriers; it generates dependencies, which in turn engender the possiblity of conflict, and barriers to not just contribution, but to distribution and upgrade.
Why, then, this happens?
The mainstream analytical framework of free and open source software tells us that communities consciously end up splitting off components, instead of centralising functionality, once it reaches critical mass; community members prefer delegation and composition of components with well-defined edges and interactions between them, instead of piling functionality and API on top of a hierarchy of poorly defined abstractions. They like small components because maintainers value the design philosophy that allows them to provide choice to people using their software, and gives discerning users the ability to compose an operating system tailored to their needs, via loosely connected interfaces.
Of course, all I said above is a complete and utter fabrication.
You have no idea of the amounts of takes I needed to manage to get through all of that without laughing.
[…]
Complex free software projects with multiple contributors working on multiple components, favour smaller modules because it makes it easier for each maintainer to keep stuff in their head without going stark raving mad. Smaller modules make it easier to insulate a project against strongly opinionated maintainers, and let other, strongly opinionated maintainers, route around the things they don’t like. Self-contained modules make niche problems tractable, or at least they contain the damage.
Of course, if we declared this upfront, it would make everybody’s life easier as it would communicate a clear set of expectations; it would, on the other hand, have the side effect of revealing the wardrobe malfunction of the emperor, which means we have to dress up this unintended side effect of Conway’s Law as “being about choice”, or “mechanism, not policy”, or “network object model”.
[…]
So, if “being about choice” is on the one end of the spectrum, what’s at the other? Maybe a corporate-like structure, with a project driven by the vision of a handful of individuals, and implemented by everyone else who subscribes to that vision—or, at least, that gets paid to implement it.
Of course, the moment somebody decides to propose their vision, or work to implement it, or convince people to follow it, is the moment when they open themselves up to criticism. If you don’t have a foundational framework for your project, nobody can accuse you of doing something wrong; if you do have it, though, then the possibilities fade away, and what’s left is something tangible for people to grapple with—for good or ill.
And if we are to take the “revolution OS” metaphor further, then Bassi’s position is not unlike Stalin’s defense of the need of a vanguard party in The Foundations of Leninism (1924), with those opposed consequently in the role of Trotskyites, Zinovievites and ultra-leftists: “The theory of worshipping spontaneity is decidedly opposed to giving the spontaneous movement a politically conscious, planned character. It is opposed to the Party marching at the head of the working class, to the Party raising the masses to the level of political consciousness, to the Party leading the movement; it is in favour of the politically conscious elements of the movement not hindering the movement from taking its own course; it is in favour of the Party only heeding the spontaneous movement and dragging at the tail of it.”
One can no more dissuade a visionary of this kind than one can dissuade a member of the Fabian Society from the virtues of global humanitarian government, but then neither will the vox populi of provincial yokels be of any use in countering it. One can only stoically resign to the pull of inexorable necessity.
The professionals are doomed in all their vainglory to be perpetually embarking on the Sisyphean task of a unified and integrated Linux ecosystem, even if it means turning the kernel into a runtime for the BPF virtual machine, or making a Rube Goldberg machine of build and deployment pipelines, as appears to be the most recent trend. The hobbyists are doomed to shout in the void with no one to hear them. In this tragedy the only victor is chaos and discord itself, which disguises itself as “progress.” All that is guaranteed is permanent revolution through constant reinvention, where by revolution we mean running around in circles. The suits and ties have forgotten what it was to be Yippies, and for their part the Yippies are fools who are had by ideas, rather than having ideas.
1.2. Before systemd: the disjointed goals of mid-2000s Linux vendors
There were quite a few efforts from around 2001 to 2010 at trying to remedy the bugbears of sysvinit and initscripts, besides the most famous example of Upstart. Looking into many of these old mailing list posts not only provides an interesting time capsule, but it also shows that many of the motivations that Linux developers had for updating daemon management and startup in the mid-2000s were quite different from the ones that would be expressed after the arrival of systemd.
A paper by Henrique de Moraes Holschuh, presented at the 3rd Debian Conference in 2002, entitled System Init Scripts and the Debian O.S., provides a comprehensive overview of the state of startup management at the time. H. de Moraes Holschuh looks at NetBSD’s then recent rc.d, runit, Richard Gooch’s simpleinit, Felix von Leitner’s minit, jinit, and the serel dependency manager.
Very amusingly, one of his contemporary complaints was that sysvinit was too resource-intensive:
Sysvinit has a large memory footprint. On an ia32 system, it takes about 1280 kibibytes of virtual space, and 480 kibibytes of RSS. That amounts to an itty-tiny bit of memory in these KDE/GNOME days, but it is hardly what you would use for memory-starved embedded systems.
Moreover, by his testimony, “most of the System V init script system is actually quite good.” The symlink farms in /etc/rc?.d with the consequent ordering issues are the primary complaint. All of his proposed improvements are incremental: tools like invoke-rc.d and policy-rc.d for use in dpkg maintainer scripts and easier management of symlink farms, a registry of initscripts, and the most radical but also quite hypothetical proposal was to replace runlevels with dependency directives like init-provide, init-before and init-after. This hypothetical system would still be based on initscripts, and be compatible with telinit directives to switch runlevels. At no point was any new declarative configuration format, major overhaul of logging or event-driven features proposed – much less unifying distributions.
In 2002, Richard Gooch wrote his proposal on Linux Boot Scripts for a program called simpleinit. Here, initscripts cooperatively resolved their own dependencies by means of a program called need(8), with the init daemon running the scripts in any order. An implementation of simpleinit existed as part of util-linux years ago, and later an enhancement by Matthias S. Brinkmann called simpleinit-msb was created. simpleinit-msb is still the default in Source Mage GNU/Linux. An old mailing list discussion justifying its use can be found here. Easier integration with package management, shorter scripts, parallelism and explicit dependency handling are cited as advantages.
Within the sources of simpleinit-msb was a rant by Matthias S. Brinkmann titled “Why sysvinit Setups Suck.” It deserves to be quoted in full:
The classic SysVinit boot concept, which uses numbered symlinks (lots of
them), has several drawbacks:-It’s ugly! If you feel that that “K123dostuff” is a pretty name, count
yourself lucky, but do me a favor and take advice from your relatives when
you name your children 😉-Unless you’re a black-belt in Mikado, you soon get lost in a SysVinit setup.
Most scripts have at least 3 representations in
the filesystem: the script itself, an S symlink and a K symlink. A higher
symlink count is not uncommon, though.-You have to manually specify the order in which boot scripts are to be executed.
And you have to do this for all boot scripts, even though it would be more natural
to only specify the order for those scripts where it matters.
You may be able to fool yourself into believing this is a matter of being in
control, but honestly, do you really care if the keymap is loaded before or
after the system clock is set ?
If you want that control, that’s okay, but the problem is that SysVinit forces
it on you.-It doesn’t have dependency management. Sure, giving service A the number 100
and service B the number 200 will guarantee that A is started before B, but
sometimes that’s not enough. What if service B needs service A running ?
SysVinit doesn’t help you with this. It starts A before B but if A fails
SysVinit will still try to run B. If mounting filesystems fails, it will
still attempt every remaining service, even those that need to write to disk.
In the end you have a system with nothing running but SysVinit’s runlevel
program will happily tell you that you’re on runlevel 5, running the X
Window System and full network.-It’s hard to modify. Why do people write fancy GUI programs to add and
remove scripts from/to runlevels ? Because manually creating or deleting
half a dozen symlinks, all with different names where even a single mistyped
letter can cause the system to break is just madness. Tell me truthfully,
wouldn’t you just prefer to do a “mv runlevel.3/telnetd unused/” to
deinstall the telnetd service and a “mv unused/telnetd runlevel.3/” to
add it back again ?-It doesn’t scale well. Look at LFS: It uses three digits for the sequence
numbers and every service has a number in the hundreds. Makes a lot of sense
on a system with only 10 boot scripts, doesn’t it ?
The trouble is that whenever you add a boot script you need to fit it into
the sequence and should you ever have to start a boot script between a
script with number N and a script with number N+1, there is only one
solution: Reindexing all your boot scripts. It reminds me of a long forgotten
past, when I still wrote programs in a BASIC interpreter with line numbers.
From time to time I would need to insert more than 9 lines between
line N and line N+10. Fortunately there was the “renum” command that would
do it for me. But of course you can write a shell script to do this for
your symlinks. No problem. SysVinit admins love exercise.-It doesn’t work well with automatic installations. If you want to build an
installation tool that allows the user to select the packages he wants to
install and only installs those, you will inevitably run into trouble with
packages that come with a boot script (e.g. an ftp daemon). Your only
chance is to assign every script your installation tool supports a unique
sequence number, working under the assumption that the user installs all
packages. And what if the user installs only part of the packages, adds his
own scripts and then wants to install more of your packages which
unfortunately use numbers the user’s scripts already occupy ?
The worst thing is that this problem exists even for scripts whose order
in relation to the other scripts doesn’t matter at all (the usual case).-No user space testing. To test SysVinit boot scripts and runlevel setup
you have to be root and execute potentially dangerous commands. Sometimes
you have to reboot multiple times before you get it right. Especially people
obsessed with the aesthetics of the boot script output will prefer an easy
and safe way to work on the scripts in user space.-Unsafe shutdown. SysVinit installations rely solely on boot scripts to ensure
unmounting of file systems and killing of processes. This is very unsafe and
can result in data loss if worst comes to worst. A good shutdown program has
a safety net to deal with this.
Years later, Busybox developer Denys Vlasenko would publish his own criticisms of sysvinit in the form of a Socratic dialogue, advocating instead the daemontools approach, also defended in the 2012 article Process Supervision: Solved Problem.
In late 2003, a story broke out regarding a proposal by GNOME developer Seth Nickell to write an init replacement. It never progressed beyond conceptual design, and it was unabashedly desktop-driven, but it was described as being based around the idea of using D-Bus as a service discovery mechanism for daemons: “ServiceManager, when you tell it “start org.designfu.SomeService” does a check on SomeService’s dependencies, loads those first if necessary, and then activates org.designfu.SomeService using normal dbus activation. Ideally this would mean activating the daemon itself which would use DBus, but it could also mean activating a “python wrapper”. ServiceManager sends out a signal over DBus announcing a new system service (and providing its name). At this point org.designfu.SomeService is responsible for keeping people notified as to its state.”
In addition: “My personal agenda is to encourage daemons to depend on DBus in the future, which will make them more likely to use DBus for providing a client interface. I fear that even after DBus exists and when it makes sense to, e.g. have an org.apache.WebServer DBus service it will get shot down because nobody wants to add the (compile time optional) dependency to get a few “small” features (small to daemon/kernel/network hackers, big to the desktop!).”
The contemporary reception was quite mixed. The OSNews.com comments are a mix of bewilderment, apathy and little positivity. The LWN.net comments are similar – either people defending the status quo, being wary of desktop integration, suggesting their own preferred alternatives like a hypothetical on-demand service launcher or the use of Gooch’s boot scripts, or daemontools. One commenter even speaks of his experience of rolling a PID1 in Common Lisp, but abandoning it due to inconvenience with upstream packaging.
Reacting to Seth Nickell’s proposal, Fedora contributor Shahms King wrote in the mailing list about his experiments with revamping initscripts. Content with sysvinit itself and runlevels, he instead wanted to rewrite the /etc/rc script in Python for parallel startup on the basis of a ‘depends’ header in scripts. Fedora developer Bill Nottingham replied with the following, dismissive of parallelism gains:
- Dependency headers are already specified in the LSB, might as well
use those tags- We’ve done testing with some of these alternatives. It really doesn’t
provide that much of a speedup just by paralellizing things. At
most we saw 10-15%.IMO, the way to faster boot is simple, yet hard: Do Less.
In a similar vein, Red Hat developer David Zeuthen was dismissive in April 2007 of boot time gains on the basis of replacing sysvinit, instead advocating for readahead:
It’s a popular misconception that boot time can be “boosted” by just
replacing SysVinit with something else. The biggest bang for the buck
actually comes from fixing readahead so it reads all the files it needs,
without making the disk seek; see [1] for some old experiments I did 2.5
years ago. The good news, however, is that the readahead maintainer is
working on this; see fedora-devel-list archives for discussion.
What he cited was his own mailing list post from 2004 on boot optimization, which at the end included an off-hand remark on replacing init (but again not initscripts):
The whole init(1) procedure seems dated; perhaps something more
modern built on top of D-BUS is the right choice – SystemServices
by Seth Nickell comes to mind [1]. Ideally services to be started
would have dependencies such as 1) don’t start the gdm service
before /usr/bin/gdm is available; 2) the SSH service would only
be active when NetworkManager says there is a network connection;
/usr from LABEL=/usr would only be mounted when there is a volume
with that label and so forth. Also, such a system would of course
have support for LSB init scripts.
(This is probably a whole project on it’s own so I’m omitting
detailed thinking on it for now)
Starting from around 2005, Fedora had various on-and-off attempts at updating initscripts under the banner of FCNewInit. It did not envision replacing initscripts, but on the contrary went for full backwards compatibility with incremental improvements to LSB compliance and some vaguely defined hunch for exposing services over D-Bus. The action page concluded: “Looking at these features, the best way to do this is almost certainly to add the D-BUS, etc support into the services themselves, and provide a wrapper for legacy LSB and other initscripts that provides the D-BUS interface to them.”
In June 2005, Harald Hoyer wrote a proof-of-concept for this scheme in the form of a Python script called ServiceManager, which “reads all /etc/init.d/* scripts and creates DBUS “Service” objects. These parse the chkconfig and LSB-style comments of their script and provide a DBUS interface to retrieve information and control them.”
Nothing ultimately came out of this effort, likely due to it simply adding more cruft in what was already a ball of mud. Instead by 2007 a more conservative effort again by Harald Hoyer was meant to use LSB dependency headers for parallelizing the scripts, following Mandriva’s work. Notably, Hoyer wrote of init replacements that:
Alternatives to SysVInit (like upstart/initng) can live in Fedora as well, but we are very conservative in
changing the startup mechanism that proved to function for a long time now. Unless the “real” killer feature
is absolutly needed, we would like to keep backwards compatibility as long as possible.
The “real” killer feature in question was left unspecified.
In 2005, Mandriva implemented a initscript header-based parallelization scheme entitled prcsys. It was described as saving 12 seconds of boot time. At first using Mandriva-specific header directives that began with X-Parallel-*, but was updated in 2006 to be fully LSB-compliant. Debian and openSUSE had similar approaches by means of startpar and insserv. A 2008 blog post by a Mandriva developer further confirms that optimization-related grunt work was the primary area of focus, and not any radical redesign of the startup process. This was the same throughout the wider mainstream.
In 2007, D-Bus system bus activation was implemented with the justification of potentially bypassing any need for a dedicated service manager entirely, as envisioned previously:
Launching programs using dbus has been a topic of interest for many
months. This would allow simple systems to only start services that are
needed, and that are automatically started only when first requested.
This removes the need for an init system, and means that we can
trivially startup services in parallel. This has immediate pressing need
for OLPC, with a longer term evaluation for perhaps Fedora and RHEL.
However, as later pointed out by Jonathan de Boyne Pollard, this is an antipattern as the D-Bus daemon “has almost no mechanisms for dæmon management, such as resource limit controls, auto-restart control, log management, privilege/account management, and so forth.” Moreover, the cooperative approach that many D-Bus services exhibit or used to exhibit of directly executing other services effectively led to upstream dictating administrative policy decisions.
In 2005, Debian developers created a working group called initscripts-ng. A major figure here was again Henrique de Moraes Holschuh, mentioned in the beginning of this section, with the initial workgroup objectives on a similar footing to the proposals of his 2002 paper, but this time more ambitious. Here the high-level goal was to create a distribution-agnostic framework to unify all kinds of startup schemes for ease of administrative policymaking. Adopting either initng or runit as a base was suggested. Even then, the idea of replacing initscripts entirely was dismissed as impractical.
Throughout the years, the Debian group brainstormed on a wide variety of schemes, most of which didn’t stick. Some of the more successful things to come out of it were conservative adjustments to startup speed by means of addressing hotspots, described in greater detail in a 2006 position paper. Among the adjustments proposed were replacing bash with dash for initscripts, LSB-compliance, using startpar for parallelization, and backgrounding certain time-consuming operations. Ubuntu implemented similar measures before Upstart.
It took until July 2009 for LSB dependency-based startup to become the default in Debian, and parallel startup itself not until May 2010, a good 8 years after the initial DebConf proposal in 2002, and by that time systemd had already made its entry.
A fascinating diversion that happened in the meantime was a project called metainit. metainit was, basically, a Perl script to generate initscripts. Yes, really. The amount of people working on this was surprisingly high, among them Michael Biebl, future Debian and Ubuntu package maintainer for systemd and upstream systemd developer. The metainit script was pretty much nothing more than a kludge to ease the burden of package maintainers and permit a slightly higher degree of interoperability between Debian and Ubuntu. Indeed, one ought to observe that the most important consumers of an init system are not sysadmins and ops people as commonly believed, but rather distro maintainers. It is their laziness that determines the balance. The project unsurprisingly fizzled out soon after its initial announcement for extended testing.
In September 2009, Petter Reinholdtsen, who had done much of the LSB-compliance and parallelization work on Debian, announced a tentative roadmap regarding the future of Debian’s boot process. It’s an interesting time capsule of the period in its own right, half a year prior to the announcement of systemd. Reinholdtsen appears to have internalized much of the Upstart-related framing around event-driven, hotplug-capable bootup. The sequential nature of Debian’s startup at the time was identified as the root problem. The proposed solution was quite ad hoc: Upstart would have been adopted as /sbin/init, but it would have been modified to understand sysvinit-style inittab(5). The existing rc system and the initscripts were to be left intact in the short-term with Upstart as a fancy initscript launcher until somehow the scripts could be progressively enhanced to emit and react to Upstart events, with insserv being modified to understand Upstart jobs. Not only that, but LSB compliance was still a strong motivation even at this late point, entailing the need for a mix of event-driven job manifests and scripts coexisting:
According to the Linux Software Base specification, all LSB compliant
distributions must handle packages with init.d scripts. As Debian
plans to continue to follow the LSB, this mean the boot system needs
to continue to handle init.d scripts. Because of this, we need a boot
system in debian that is both event based for the early boot, and
which also calls init.d scripts at the appropriate time.”
This was the culmination of a long period of discussion regarding Upstart among the initscripts-ng group. As introduced by Scott James Remnant in a May 2006 draft, Upstart aimed to be the superset and replacement of all task launchers, including at the time udev, acpid, apmd, atd, crond to distro-specific helpers like ifupdown. The ReplacementInit page at the Ubuntu wiki goes into further detail as to the motivation.
A “sprint” between Debian and Ubuntu developers that took place in June 2009 involved an Upstart transition plan for Debian. It’s a very good demonstration of just how convoluted it would have been for Upstart to coexist with sysv-rc, and be integrated in Debian’s packaging infrastructure on top of that. As late as April 2010, Petter Reinholdtsen was writing of his hopes that “for Squeeze+1, /sbin/init in sysvinit will be repaced with upstart while keeping most init.d scripts and insserv in place.” Less than a month later, he was expressing interest in systemd.
Fedora used Upstart from versions 9 to 14 largely as a dumb wrapper for initscripts. Scott James Remnant reported attending GUADEC in 2007, where he says he “took part in a BOF along with members of Fedora/RedHat and SuSE to discuss Upstart and how
it fits in to the “big picture” alongside udev, HAL, D-BUS, etc.“ – evidently these were the primary concerns of the market leaders at the time.
In summary:
- Virtually all work from the period of about 2001-2010 was centered around incremental and ad hoc improvements to initscripts, rather than replacing or otherwise overcoming them.
- Parallel startup was not held in universally high regard as a performance optimization feature. Much hope was put instead on the use of readahead and prefetching, besides more mundane profiling of hotspots.
- Among the more outlandish proposals in those days was to use the D-Bus daemon directly as a service manager, to have initscripts themselves register D-Bus interfaces, and writing a script to generate other initscripts. Most major distributions were surprisingly strict and insistent about adhering to LSB initscript headers.
- The primary bottleneck around revamping the init system was a social one, that of distribution packaging guidelines.
- Init systems live and die by the level of indifference of the package maintainers who must work with them. Only when the complexity of adding more cruft to initscripts started to reach a tipping point did a desire for radical change appear.
2. systemd: the design philosophy and politics
2.1. The frustrated visionaries
In this atmosphere of chaos and disarray, largely of the distribution developers’ own making, the yearning for decisive action to break them free of the yoke of their technical debt, was quite understandable. It took 8 years for Debian to parallelize its initscripts – something had to budge.
Prototyped under the name of “BabyKit,” in reference to being a process babysitter, systemd’s arrival on the scene in March 2010 would see it rapidly rise to hegemonic status with unusual swiftness, the subject of unprecedentedly strong controversy in free software circles that has never truly ended. One may easily speak of a permanent demarcation between a pre-systemd and post-systemd era.
There are at least three questions that need to be answered so as to put systemd in its proper historical context: firstly, at what point did it shift from being merely an “init system” to a general middleware and platform (was it ever meant to be limited to a given scope at all?); secondly, was the ambition and vainglory of its developers a decisive factor in its success at all; thirdly, was it a radical break from the status quo or did it capitalize on embryonic trends and existing influences?
One of the earliest public debates surrounding systemd was in a May 2010 thread at the Fedora developer list, shortly after the initial Rethinking PID1 story in April. In the Fedora thread, Lennart Poettering shares that the initial impetus for systemd came from an inability to persuade Scott James Remnant of Upstart to adopt certain features and principles:
Kay and I and a couple of others sat down at various LPC
and GUADEC and discussed what we would like to see in an init
system. And we had long discussions, but ultimately most of our ideas
were outright rejected by Scott, such as the launchd-style activation
and the cgroup stuff, two of the most awesome features in systemd
now. (That said, we actually managed to convince him on other points,
i.e. I believe we played a role in turning him from a D-Bus-hater into a
D-Bus-lover).But anyway, these discussions did happen, over years. But there is no
recorded archive of that, no mailing list discussion I could point you
to, sorry. You can ask Kay and me and Scott about it though.
The most interesting part of the discussion is the back-and-forth between Lennart Poettering and Casey Dahlin, an Upstart developer, on such subjects as the merit of cgroups, dependencies versus events, etc. Not only that, but the infamous issue of user sessions with nohup and screen/tmux being killed on logout was presciently brought up there. At that point, Lennart had not made up his mind as to proposing systemd for Fedora 14. Near the end of the thread, Lennart blows up at Scott James Remnant.
By far the major aspect of systemd’s design which Lennart routinely hyped in the early stages, was that systemd is not a dependency-based init despite being commonly understood as such. (As recently as October 2019, systemd developer Michal Sekletar was presenting it as a “dependency based execution engine.”) Instead, the use of socket activation was meant to obviate explicit dependency information entirely:
systemd-style activation is about parallelizing
startup of (mostly) local services and making fully written dependencies
obsolete. And that’s what is so magic about it. And an init systemd
should be designed around this at its core! And systemd is.
This is something that needs to be reiterated throughout, as it reflects a certain vision that systemd had which was never quite fulfilled and subsequently minimized.
In an LWN article that covered the aforementioned mailing list thread, Lennart Poettering (alias mezcalero) can be seen throughout the comments. The use of X-activation as a replacement for dependencies is a theme he hammers continuously. Lennart refers to every unit type as a form of “activation” in its own right: “Yes, we currently handle socket-triggered, bus-triggered, file-triggered, mount-triggered, automount-triggered, device-triggered, swap-triggered, timer-triggered and service-triggered activation.”
He unambiguously declares that dependencies are only intended for early boot:
However, for most cases for normal services you will not need to manually configure any dependencies, as the various ways of activation will deal with that automatically. Manual dependency configuration is really only necessary for units necessary for the very early boot, or the very late shut down.
Or in other words: we use dependencies internally, we also expose them to the user, but he should only very seldom need to make use of them.
This is a bit different from launchd, which really knows no dependencies at all. For that they pay a certain price though: their early boot-up is done completely differently from the actual service startup. Since our Linux start-up procedure is more complicated then theirs we hence decided to provide both: the dependency-less main boot-up plus manual dependencies for a few special services involved with very early boot.
Lennart also dismisses the need for a generalized event broker, which systemd is again commonly interpreted as (for instance, the Debian Reference describes systemd as an “event-based init(8) daemon for concurrency”):
So, I fail to see why you’d want any generalized eventing system beyond the dependency system that systemd offers to you and the various notification systems the kernel already provides, such as inotify, netlink, udev, poll() on /proc/mount, and similar. If apps want those events they should use those notification facilities natively, there is little need to involve systemd in that.
Once more on dependencies:
First of all, you create the impression that systemd’s core design was about dependencies. Well, it is not. Dependencies are supported by systemd only for the purpose of early boot. Normal services should not really use that for much. One of the nice things in systemd is that the kernel will get the dependencies and ordering right for you, out of the box.
Also, claiming that launchd’s or systemd’s core design was around on-demand loading of services is misleading. While we support that too, we do socket-based activation mostly to parellelize boot-up and to get rid of explicitly configured dependencies. Only a few services will actually be started on-demand in a systemd system. Most will be started in any case, but fully in parallel due to the socket activation logic.
This was also the rationale for the introduction of the DefaultDependencies= directive in July 2010.
Another implication of systemd’s ruthless embrace of parallelization is systemd’s startup sequence being a dataflow without discrete points in time like “first” or “last,” nor precise priority-based ordering guarantees. This was confusing enough to people that Lennart and Kay repeated it often, e.g. in November 2010:
Well, since we start everything in parallel and without waiting there
isn’t really a point in time where we know we finished start-up. Such a
point simply does not exist.The big problem here is that there isn’t really a well defined point in
time anymore where bootup is finished. In traditionally sysvinit startup
messages were only printed on the console for proper sysv services and
only when started at boot time. However, in the much more dynamic
systemd we print them for all services (including D-Bus services, which
actually account for more services than SysV on most setups right
now). The effect of that is that services come all the time and there’s
little point in synchronizing getty startup to that.
In January 2012, Kay Sievers wrote: “The order for non-dependent services is not defined.”
Again by Lennart in October 2015:
There’s no concept of running things “first” or “last” during boot-up
or shutdown, and there’s no concept of freezing execution of other
units if yours is running, as that would be pretty weirdly defined in
a mostly parallel scheme like systemd where connecting to another
service or invoking another method on another service might result in
automatic activation.
(An early bug that drove this point home was the service to remount the rootfs pulling in the swap target not being equivalent to actually pulling in the individual swap units, until an implicit dependency for them was created in upstream. Besides that, many people have struggled to shoehorn imperative tasks like unmounting NFS shares at a very specific point in the shutdown stage.)
A more subtle implication of the parallel dataflow approach that Lennart shared in a March 2011 post is the inability to make guarantees about which set of services will be started for a given target:
I am not convinced this would really be useful. We should conisder [sic] a
system fully dynamic these days: the set of services running is no
longer the the one that has been started on boot, but the sum of all
those which got triggered sometime during the past. And triggers can
even work differently if they are used in conjunction. Hence it is
really difficult to answer questions like “Hey, will this service be
running if I boot into multi-user.target”, because the answer is mostly
“Well, depends, if the user started application foo, are plugged in
hardware bar…”.I really don’t want to create the impression that we could reliably tell
people if a specific service will be running if they start a specific
target, because it’s impossible to do.
(This was also the reason why the snapshot unit type was eventually removed in systemd-228.)
Prior to systemd-8, Type=oneshot was called Type=finish, RemainAfterExit= was called ValidNoProcess= and RefuseManualStart= was called OnlyByDependency=. An August 2010 thread suggests an influence borrowed from Upstart’s “tasks” and by extension Upstart as the primary genealogical influence on systemd. The use of “job” objects for state change transitions is another similarity.
Before the introduction of the journal, systemd shipped a small helper service called systemd-kmsg-syslogd which forwarded the contents of /dev/log to the kernel ring buffer, which would then be flushed to disk by a syslog implementation.
An early glimpse into the character of systemd development is Kay Sievers’ response to a Gentoo user who asked whether systemd would have similar levels of flexibility as OpenRC’s /etc/conf.d/ Instead of answering the question, Sievers responds with a snarky misdirection by just listing off a bunch of systemd features. Subsequently, Sievers openly acknowledges his contempt at the questioner, while adding that flexibility is actually a misfeature: “Honestly, most of the “flexibility” of sysv, and probably openrc too, i don’t know the details here, is because nobody ever came up with
something that makes almost all of this “flexibility” absolutely needless in the real world setups.”
timedated, localed and logind were all introduced in systemd-30. Sievers initially hinted that ConsoleKit would be replaced as a systemd daemon in September 2010 when responding to a question about getting rid of statically initialized gettys: “Maybe we can do that when we move most of the session-tracking stuff from ConsoleKit into systemd and kill the ConsoleKit daemon.”
By July 2010, Lennart embarked on a daemon patching spree to exploit systemd-style socket activation to its fullest:
Note that native systemd support has already been merged into quite a
number of projects now. I have recently started to post my remaining
patches to the various projects, so expect them there soon too. I’ll
also begin to encourage people to include systemd service files for the
remaining Fedora default install services now, i.e. will push for this
on fedora-devel. It would be great if other distributors could start
this push, too.
On fedora-devel during the same period:
At this point the following packages in rawhide have been updated to
provide socket based activation [Hint: in case you are wondering, socket
activation is one of the amazing things in systemd, see the original
announcement for details]: dbus, udev, avahi, rtkit. Before F14 I want
to at least add rpcbind and rsyslog to this list for socket based
activation, and most likely cups. For rpcbind/rsyslog the patches have
been submitted upstream, and even have partly been merged already).It would be great if we could ship native unit files (as replacement for
the current sysvinit scripts) for as many packages in Fedora as we can
for F14, and in particular for all those services that are installed by
default.
On to the central thesis of this section: behind all of these design choices was a radical vision, often left only implicitly stated, for systemd as the Linux equivalent of the Mach bootstrap server. On macOS, this is launchd. launchd has no explicit dependency directives, instead expecting each service to register itself via IPC and be started on demand, i.e. daemons cooperatively resolve their own dependencies. launchd is tightly coupled to various undocumented features of the XNU kernel, like “coalitions” for adaptive scheduling policies. Certain events like availability of storage volumes and network interfaces have special object representations for use in service property lists. “Technical Note TN2083: Daemons and Agents” from the Apple documentation archive elaborates on this model in detail. Different bootstrap namespaces and hence available services exist for different user sessions. In the brave new world that systemd’s developers would have liked, every service would have registered itself over kdbus for system-wide or user bus activation, sent resource control requests via IPC to systemd’s cgroup broker in PID1, delegated IPC access control decisions analogously to XPC services on macOS, used method calls over systemd D-Bus services instead of direct system calls for many configuration tasks, etc. – all in the vein of “plumbing layer as the new kernel.”
A comment in March 2011 by Lennart hinted at this:
My look on the whole situation is actually that these targets are
stopgaps in a world where not all daemons are socket activatable and
where some daemons are written in a way that they assume that the
network is always up and does not change dynamically.In a world where where all services are socket activatable and all
daemons subscribe properly to netlink we don’t need neither
syslog.target nor network.target, because the syslog daemon is always
accessible and it doesn’t matter anymore when the network is configured
and when not.
Netlink would have been superseded by kdbus, of course.
In practice, this vision did not quite pan out. systemd is overwhelmingly used as a dependency-based init, upstream vendors like Debian and Ubuntu apply numerous distro-specific patches, and daemon writers have not been completely steered even if many flagship projects like Flatpak make extensive use of systemd features with it being the standard. The distros have consented to a great deal of unification with the promise of easing their maintenance burden, but not so much as to lose their identity. With the recent Debian GR (with its ensuing sense of despondence) opening up the possibility of many of systemd’s configuration management features becoming part of packaging infrastructure, this may change yet again, but it remains to be seen.
Part of this is that systemd as an upstream still operates as a node in a bazaar acting without direct regard for specific Linux vendor interests, with the inevitable antagonisms that follow from this, no matter how much they may have tried to overcome this and cornered the bazaar into being a vertically integrated megacorp reaping the benefits of scale economies. There is an almost dialectical relationship in the way that trying to unify a bazaar only reinforces its contradictions more strongly, as the participants gain a greater self-consciousness of their position within the software distribution channel. Lennart confessed as much back in October 2014:
Well, systemd is a suite of components that people build OSes from. As
such it isn’t really an app you install on top of your OS, it’s more a
toolset for distro and device builders. Now, if end users have
questions about details how they can uses these devices and distros,
then I figure they should always contact the manufacturers of these
devices, and the distro developers first.Or in other words: we are not the final product that people should
interface with, we just provide a set of components where other people
can build final products out, and by doing so they also need to take
the responsibility for providing a first level of help for it.
The systemd developers largely demoted the primacy of this vision after 2015 or so once they secured their ubiquity, and once the commercial interests behind containerization shifted the focus away to stateless systems, immutable image-based deployments and others. However, it played a crucial role from about late 2012 to 2014 in systemd’s rise to power and manifested itself in numerous widely publicized developments, which are the subject of the next section.
2.2. The rise to power
The systemd developers wasted no time in proselytizing. Apart from the initial socket activation patching for daemons mentioned above and the spats at fedora-devel, it was made the default in Fedora Rawhide by July 2010.
A Fedora development thread from August 2010 concerning packaging guidelines for systemd acceptance has Lennart in a belligerent posture at anyone showing the slightest commitment to backwards compatibility or stability. In response to Matthew Miller’s concern over inittab(5) compatibility, he quips: “Maybe we should check AUTOEXEC.BAT first, too?”
When Daniel J. Walsh inquires about respecting SELinux labels, Lennart says: “This is not fair! Upstart never did this.”
When Bill Nottingham takes issue with his dismissing distro concerns as being upstream problems and not his, we hear from him: “Yay, thanks that you don’t care. You are aware that by putting
everything on a single man’s shoulders and then telling him “you don’t care” you make him feel really welcome and make him wonder why he even bothers with this shit?”
Finally, when Matthew Miller tells him that there are release requirements to be fulfilled so as to build an integrated system, Lennart sardonically fires:
Great, I geuss [sic] I’ll become an X hacker then. Apparently if KMS is borked
it’s now my duty to fix it. Yay!
The entire thread demonstrates a callous and infantile lack of consideration on Lennart’s part toward the fact that system integrators have many moving parts they need to iron out before they can add in the shiniest new toy. Naturally, this soap opera was also covered by LWN.net.
Be that as it may, the high ambitions started early. In July 2010, Lennart was already confident that the distros were converging:
7) Q: Are the other distros switching too?
A: Well, they are not as fast, but it looks very much like it. I have
been working very closely with Kay Sievers from Suse to integrate
systemd equally well into the OpenSuse semantics, and it will
enter their distro as soon as their development cycle
reopens. Debian, Gentoo, ArchLinux have packages in their
archives, but it’s not the default there (yet?). I guess at least
for Debian/Gentoo it is very hard to make decisions about dfaults
like this. Meego has someone working on this, but I have
not followed that closely. And there are some smaller distros
which have adopted it too, and I know at least one (Pardus) that
plans to make it the default in the next release. And regarding
Ubuntu I leave it to you figuring out what is going on (Hint: you
might want to check out for what company the main developer of
Upstart – which systemd replaces – works for…). And never
forget that Fedora is of course the leader in development, so it
should be us who lead here… 😉
In September 2010, he explicitly pledged cross-distro unification as a goal:
Well, it is definitely our intention to gently push the distributions in
the same direction so that they stop supporting deviating solutions for
these things where there’s really no point at all in doing so.Due to that our plan is to enable all this by default in “make
install”. Packagers may then choose to disable it by doing an “rm” after
the “make install”, but we want to put the burden on the packagers, so
that eventually we end up with the same base system on all
distributions, and we put an end to senseless configuration differences
between the distros for the really basic stuff.If a distro decides that for example the random seed save/restore is not
good enough for it, then it’s their own job to disable ours and plug in
their own instead. Sooner or later they’ll hopefully notice that it’s
not worth it and cross-distro unification is worth more.
The four major trends giving leverage to systemd were: the udev merge, kdbus, logind and GNOME’s usage thereof, and the single-writer proposal for the unified cgroup hierarchy.
The first indication of intent by the systemd developers to integrate with GNOME was as early as January 2011:
My rough plan is to introduce systemd sooner or later as session manager
into GNOME and then come to a new definition of what an app is along the
lines of “one cgroup per app, matching one .desktop file”. That
information would then be available to things like gnome-shell to match
processes to apps to desktop files to windows, instead of the current
heuristics everybody uses. The goal in the long run is definitely to
give the foreground app an extra CPU boost, where gnome-shell decides
what the foreground app is.
The first major step towards an implementation was when Lennart lobbied the GNOME development list in May 2011 for what proved to be a long and contentious thread. In this post, Lennart appears to share that localed and timedated were meant specifically for desktop widgets, and proposes that gdm and gnome-session make respective use of logind and a per-systemd user instance. Further, he claims regarding systemd’s prevalence that “[the] majority of the big and small distributions however has switched by now or is planning to switch in their next versions, or at least provides packages in the distribution. The one exception is Ubuntu.”
By GNOME 3.4 in January 2012, hostnamed, timedated and localed from the systemd package were used by default.
(As it turns out, there was a little-known Freedesktop-affiliated project called xdg-hostname in 2009 which was a start towards creating such D-Bus “utility daemons” for use in desktop widgets and others, all as a standalone project. Had this been followed through instead of having the utility daemons end up being part of the systemd source tree, a good deal of political acrimony could have been avoided – though at the cost of reducing systemd’s leverage.)
It would not be until October 2012 that more significant controversy would ignite as gnome-settings-daemon was officially migrated over to logind starting from GNOME 3.8. Objections were raised by developers from Gentoo, Ubuntu, OpenBSD, Solaris and others. Work on adding further systemd integration for user session management began soon afterwards (see also), which affected components like gnome-session and gnome-shell. As of GNOME 3.34, it is fully systemd-managed, and running a full GNOME desktop without systemd requires extensive patching.
A great deal of confusion at the time was due to the uncertainty of just which were the dependent components, were they build-time or runtime dependencies, and the entire semantic distinction between depending on systemd and depending on an interface exported by a component of the systemd suite. At the time, the systemd Interface Portability and Stability Chart listed logind as not independently reimplementable. Note also GNOME developer Olav Vitters’ contemporary equivocation on the subject.
logind’s use of scope and slice units made the question further entangled with the future status of the cgroups subsystem, a hot topic in 2013 and 2014 discussed below.
By 2013 and 2014, Debian and Ubuntu were nearly the last holdouts among the major distributions, and the GNOME issue was a significant motive for the former’s acceptance of systemd, among others.
(Arch Linux had switched by August 2012 with some drama, but mostly localized. Within Arch, then-maintainer of initscripts Tom Gundersen was a major voice for systemd and an upstream systemd developer, later the main architect behind systemd-networkd. He expressed ‘cross-distro collaboration’ as an important goal of his: “IMHO, a nice goal would be to increase cross-distro collaboration. How well are the different major distributions represented in your contributor base? I think a strong point of systemd is that they have active contributors from pretty much all major distros (including gentoo and arch, but possibly with the exception of ubuntu, I’m not sure).”)
For instance, Debian developer Ansgar Burchardt spoke at the TC in January 2014 of the wider ecosystem around Desktop Linux nudging toward systemd:
On the other hand even when using upstart as an init replacement, we’ll
continue to use large chunks of systemd (logind, other dbus
services). I personally think “less is more” would only be a convincing
argument if we actually would not need the aditional features.I also have one question: your mail doesn’t mention the integration
problems with logind into a system that uses upstart and not systemd as
init. Do you think this will not be an issue? Given it means ongoing
work instead of a one-time investment, this is one of my main gripes
with upstart. I feel that minor technical differences between the init
replacements are not work committing to long-time maintaince of a
systemd-logind branch that works outside of systemd. There are more
interesting areas we can invest our resources into.Note that this might also include session management functions in the
future. As you mentioned yourself in [1], DEs are looking into using
advanced session supervision. So far both kwin and GNOME seem to target
systemd for this. So this would be another area that we would need to
invest resources into to maintain an upstart replacement.
Josselin Mouette, then a GNOME packager for Debian, wrote his statement in October 2013:
Systemd is becoming a de facto
standard in Linux distributions (at least Fedora, SuSE and Arch), and is
getting excellent upstream support in many packages. So far, only Gentoo
uses OpenRC (and it doesn’t have most of the features I’d like to have),
and only Ubuntu uses Upstart. Therefore using OpenRC would mean
maintaining many patches on our own, and using Upstart would mean that
our upstream would become Ubuntu.…Finally, I say this as one of the GNOME packages’ maintainers. GNOME in
jessie will need systemd as the init system to work with all its
features, just like it needs the network configuration to be handled by
NetworkManager. While it is (and can remain) possible, just like in the
NM case, to install it without systemd and lose functionality, I think
it is unreasonable to ask for a default GNOME installation without it.
Most pertinently of all, Russ Allbery’s highly influential summary of the Debian init situation in December 2013, in section 3.1. “Ecosystem Reality Check” conceded that the real debate was never systemd-vs-the-alternatives, but how-much-of-systemd:
One of the points that I think may have been obscured in the discussion,
but which is important to highlight, is that basically all parties have
agreed that Debian will adopt large portions of systemd. systemd is an
umbrella project that includes multiple components, some more significant
than others. Most of those components are clearly superior to anything we
have available now on Linux platforms and will be used in the distribution
going forward.In other words, this debate is not actually about systemd vs. upstart in
the most obvious sense. Rather, the question, assuming one has narrowed
the choices to those two contenders, is between adopting all the major
components of systemd including the init system, or adopting most of the
major components of systemd but replacing the init system with upstart.
Either way, we’ll be running udev, logind, some systemd D-Bus services,
and most likely timedated and possibly hostnamed for desktop environments.I think this changes the nature of the discussion in some key ways. We’re
not really talking about choosing between two competing ecosystems.
Rather, we’re talking about whether or not to swap out a core component of
an existing integrated ecosystem with a component that we like better.
And so it happened.
Allbery subsequently reiterated this point with a wider digression on the nature of volunteer labor in a community project. His proposal that the question reach a GR was over this very social awareness: “That’s not a technical question; that’s a question of overall project direction, a question about whether we force loose coupling even when our upstreams are preferring tight coupling.”
The other big push factor was the udev merge. It was formally announced in April 2012. At that time, Kay Sievers promised that builds for usage in non-systemd systems would be officially supported and that the merger largely amounted to a build system change. Yet only months later in August Lennart was openly declaring non-systemd usage of udev a “dead end”:
Yes, udev on non-systemd systems is in our eyes a dead end, in case you
haven’t noticed it yet. I am looking forward to the day when we can drop
that support entirely.
This concerning declaration by the Fed chairman, along with other grievances, paved the way for an acrimonious fork in the form of eudev.
The uncertainty was exacerbated in May 2014 when the infamous “Gentoo folks, this is your wakeup call” message was published, dealing with the proposed transition of udev to using kdbus as transport:
Also note that at that point we intend to move udev onto kdbus as
transport, and get rid of the userspace-to-userspace netlink-based
tranport udev used so far. Unless the systemd-haters prepare another
kdbus userspace until then this will effectively also mean that we will
not support non-systemd systems with udev anymore starting at that
point. Gentoo folks, this is your wakeup call.
This was further clarified to include libudev, reneging on earlier promises of compatibility:
Anyway, as soon as kdbus is merged this i how we will maintain udev, you
have ample time to figure out some solution that works for you, but we
will not support the udev-on-netlink case anymore. I see three options:
a) fork things, b) live with systemd, c) if hate systemd that much, but
love udev so much, then implement an alternative userspace for kdbus to
do initialiuzation/policy/activation.Also note that this will not be a change that is just internal between
udev and libudev. We expect that clients will soonishly just start doing
normal bus calls to the new udev, like they’d do them to any other
system service instead of using libudev.
kdbus (an effort unveiled in January 2014) itself was desired for several reasons. One was to rid systemd itself of its cyclic dependency on the D-Bus daemon and hence remove the code for the /run/systemd/private socket as well as fix some ordering issues with the journal on shutdown. Increasing logging throughput and more general performance improvements were also on the table.
(The basic dependency loop is that D-Bus needs a logging service [the journal], which requires systemd, which requires D-Bus. In fact, to this day on early bootup systemd exposes certain unit properties in /run/systemd/units specifically for the journal to consume before it can query D-Bus, because as stated in a source code comment in src/core/unit.c: “Ideally, journald would use IPC to query this, like everybody else, but that’s hard, as long as the IPC system itself and PID 1 also log to the journal.” Had a kernel-level D-Bus existed, it would have been available on early boot and hence broken the loop. Of course, the right thing would have been not to (ab)use D-Bus for PID1 at all, but that ship sailed long ago.)
The initial motivations for kdbus appear to have been for use as a capability-based IPC mechanism for application sandboxing, particularly for a planned GNOME application feature called “portals,” similar to powerboxes in the capability-based security literature. This scheme would ultimately be implemented over plain D-Bus for xdg-app (later Flatpak).
Additionally, in a Google+ post by Lennart from October 2013, it was revealed that it would have also been used as service discovery for daemons:
The other thing is kdbus. The userspace of kdbus pretty much lives inside of systemd. Bus activation work will be using the same mechanisms as socket activation in systemd, and again you cannot isolate this bit out of systemd. Basically the D-Bus daemon has been subsumed by systemd itself (and the kernel), and the logic cannot be reproduced without it. In fact, the logic even spills into the various daemons as they will start shipping systemd .busname and .service unit files rather than the old bus activation files if they want to be bus activatable. Then, one of the most complex bits of the whole thing, which is the remarshalling service that translates old dbus1 messages to kdbus GVariant marshalling and back is a systemd socket service, so no chance to rip this out of systemd.
By June 2015 with systemd-221, kdbus support became a mandatory build-time option, and was being encouraged for use in development and testing branches of upstream distributions. kdbus was included in Fedora Rawhide kernels from July 2015 until being dropped in November.
Last but not least, the cgroupv2 redesign toward a unified hierarchy. The gist of it at the time is explained here – instead of each cgroup controller being attached to an independent tree/hierarchy, this is all flattened into one. But the bigger issue was that at the time this was proposed in 2013, the idea was that cgroups were to have a strict single-writer constraint with only one process on the system allowed to write to the tree, and no possibility of subtree delegation. Since systemd-the-PID1 has cgroups as a core feature and was approaching universal adoption, this would have made systemd the de facto monopoly on the use of cgroups in most GNU/Linux systems to the detriment of container runtime developers and advanced users in general – unless, of course, they ditched systemd.
This was a story one could not have missed in 2013, with LWN devoting massive volumes of coverage, but has surprisingly been forgotten as a pivotal moment in the systemd saga. It seeped in with the rest of GNOME/logind, kdbus and udev to create an interlock. One of the main defenses on the part of GNOME and systemd developers, that GNOME did not depend on systemd but only on logind which could in principle be emulated (even though systemd’s own interface stability chart claimed otherwise), was rendered moot in the light of these new developments. Olav Vitters of GNOME admitted this. Josselin Mouette, the aforementioned GNOME packager for Debian, explained in December 2013:
Systemd developers are getting ready to part 3 [when cgroupv1 is deprecated] by working closely with
the kernel cgroups developers. It is not clear to me whether cgmanager
will be able to do the same: from my discussions with more knowledgeable
people, it is merely exposing the current cgroups API in D-Bus calls.
This approach cannot work transparently when the API changes. Therefore,
we might only have one available cgroups arbitrator in the end: systemd.These other parts have to migrate to a D-Bus-based interface. The
problem is that systemd and cgmanager developers have not been able so
far to agree on a common API. The consequences for those
cgroups-consuming services are easy to infer.
Some services will only support systemd.
Some will use more complex code in order to support both.
Some will wait until a “standard” emerges and will not work
towards the transition.
The same person, in a January 2014 back-and-forth with Upstart developer Steve Langasek, then dropped all pretense of logind being a separable component from systemd:
The fact that logind
used to work without systemd as init was purely coincidental, since
logind was designed as an integral part of systemd from the very
beginning (particularly because of the cgroups design). This specific
change might has been triggered by anticipation for a kernel change (a
change that still doesn’t exist), but if not for cgroups, it would have
been for another reason.
Lennart himself, in the previously linked Google+ post from Oct 2013:
The kernel folks want userspace to have a single arbitrator component for cgroups, and on systemd systems that is now systemd, and you cannot isolate this bit out of systemd. The Upstart world has exactly nothing in this area, not even concreter plans. There are dreams of having some secondary daemon taking the cgroup arbitration role, but that’s a complex task and is nothing you can just do at the side. I am pretty sure the Ubuntu guys don’t even remotely understand the complexities of this. Control groups of course are at the center of what a modern server needs to do. Resource management of services is one of the major parts (if not the biggest) of service management, and if you want to stay relevant on the server you must have something in this area. The control group stuff exposes an API. The API systemd exposes is very systemd-specific, it is unlikely that whatever solution Upstart one day might come up with will be compatible to that. Of course, at that time a major part of the Linux ecosystem will already use the systemd APIs…
A June 2013 post on the systemd mailing list talks of what the D-Bus API for systemd as cgroups single-writer would have looked like. Worth noting is that kernel hacker Andy Lutomirski proposed a possible solution to this conundrum by beefing up subreapers for reliable process tracking so as to avoid systemd’s use of cgroups entirely and free up management of cgroups to different processes – the systemd devs expressed no interest.
Regardless, the initial announcement by Lennart of this upcoming change was remarkably bombastic:
This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd’s APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The “Pax Cgroup” document is a thing of the
past, it is dead.
This predictably led to a massive flame war in its aftermath, most notably on the LKML between Lennart Poettering and Google developer Tim Hockin. Highlights include Lennart interjecting for a moment to say that “systemd is certainly not monolithic for almost any definition of that term” right as modular toolkit approaches like libcgroup were about to be obsoleted. Tim Hockin later reported being rebuffed in trying to standardize a common cgroup API between the developers of systemd and those of cgmanager, an alternative cgroup writer at the time examined with interest by LXC and Ubuntu.
Thus, when Lennart said in October 2013 that the Linux userspace plumbing layer resided in systemd, he was quite right:
I believe ultimately this really boils down to this: the Linux userspace plumbing layer is nowadays developed to a big part in the systemd source tree. Ignoring this means you constantly have to work with half-baked systems, where you combine outdated components which do not belong to together into something that in many facets might work but is hardly integrated or consistent. Or to put this another way: you are at a road fork: either you take the path where you use the stuff that the folks doing most of the Linux core OS development work on (regardless if they work at Red Hat, Intel, Suse, Samsung or wherever else) or you use the stuff Canonical is working on (which in case it isn’t obvious is well… “limited”).
People on the email thread have claimed we had an agenda. That’s actually certainly true, everybody has one. Ours is to create a good, somewhat unified, integrated operating system. And that’s pretty much all that is to our agenda. What is not on our agenda though is “destroying UNIX”, “land grabbing”, or “lock-in”. Note that logind, kdbus or the cgroup stuff is new technology, we didn’t break anything by simply writing it. Hence we are not regressing, we are just adding new components that we believe are highly interesting to people (and they apparently are, because people are making use of it now). For us having a simple design and a simple code base is a lot more important than trying to accommodate for distros that want to combine everything with everything else. I understand that that is what matters to many Debian people, but it’s admittedly not a priority for us.
Additionally, in October 2014, core systemd developer Zbigniew Jędrzejewski-Szmek compared running different init systems to running different processor architectures, demonstrating systemd’s highly subsuming vision of the world:
For such basic functionality that influences the whole OS, if the maintainer uses a different
init, it is like being on a different architecture.
In the position of Debian or any other community distro, the situation in 2013-4 was major desktop environments depending on logind for power management, which in turn meant depending on systemd due to the single-writer constraint for cgroups, which also hosted the userspace and service activation layer for kdbus (which also encompassed hotplug with udev), over which daemon writers would have registered interfaces. This was the seemingly inexorable direction, and hence the decision was sealed before it was deliberated.
2.3. Complacency and the loss of purpose
Except, of course, this didn’t happen. kdbus became the kdbuswreck after meeting obstacles by kernel maintainers such as Andy Lutomirski and Eric W. Biederman, much to the chagrin of hopeful Desktop Linux buffs all over. Subsequent endeavors like BUS1 did not succeed, either. The cgroupv2 API did gain subtree delegation with systemd exposing it as well, though many other redesigns in anticipation of the single-writer changes were left intact. Also, for 4 years starting from September 2013, the developers left behind this misleading document on the unified cgroup hierarchy on their wiki as if it were an accomplished fact and not an unfulfilled proposal, until in November 2017 it was updated to mention the Delegate=yes option.
Regardless, these developments served their propaganda value, and by 2015 systemd had decisively secured its place.
Their victory came at the price of their vision. The world of “systemd as Mach server” would be set back by the failures of kdbus and the single-writer cgroupv2 API. Dependencies were used far beyond early boot and were dominant over the various X-activation paradigms. However, many of the auxiliaries like host/timedate/locale/logind were becoming successful, including the container tools like machined and nspawn. tmpfiles (and later sysusers) would go on to be used in configuration management.
Once the thumotic energy for conquest is spent out, a protracted period of mediocrity ensues. Yet the ideological imperative of “making progress” and “pushing things forward” remains, and with it new directions and justifications must be found to maintain the illusion for the developers and their followers.
Things really jumped the shark when systemd received its own annual conference called systemd.conf in 2015, rebranded to “All Systems Go!” in 2017 with a more general Linux userspace focus. Still, witnessing the project’s own marketing evolution throughout the years is instructive.
That cross-distro unification had been a goal since 2010 was documented above. In Poettering’s January 2011 LCA presentation “Beyond init,” systemd is described as a “system and session manager for Linux,” but at the same time already a “basic OS building block” and for “cross-distribution standardization.” The future tasks listed at that time were fairly modest: session management and automatic initrd fallback.
At LinuxCon Europe 2012, a surprisingly adulatory LWN piece states that: “…the developers redefined systemd a little more, to be not just an init system, but also a platform.” Moreover:
Unfortunately, Lennart ran out of time, so that he was unable to go over his thoughts about the future of systemd. However, after two years, it’s clear that systemd is an established part of the Linux landscape, and there are increasing signs that it is moving toward becoming an essential part of the operating system.
In 2013, Lennart delivers a status report on “systemd: The First Two Years”, where systemd is called both an init system and a platform. It is targeted towards all platforms: mobile, embedded, desktop and server. Here, the future tasks are becoming more ambitious, which coincides with our analysis of 2013-4 as the “critical period” or peak of systemd’s development: container support, cloud/cluster support, kdbus and the vaguely specified “apps” are what were promised as directions further on.
In Lennart’s 2014 GNOME Asia talk, systemd is a “system and service manager, a platform, the glue between the applications and the kernel.” Universal adoption had been reached. The objectives are much more self-flattering and grandiose: “Turning Linux from a bag of bits into a competitive General Purpose Operating System,” “Building the Internet’s Next Generation OS,” “Unifying pointless differences between distributions,” “Bringing innovation back to the core OS,” “Auto discovery, plug and play is key.” Moreover, it is now emphatically an open-ended project: “Never finished, never complete, but tracking progress of technology.” And as a sly wink: “Never the cathedral, just the building blocks to build it.” Future directions listed are: network management, kdbus, NTP, containers, sandboxing, stateless systems/instantiable systems/factory reset, integration with the cloud.
One can already tell by the explosion of buzzwords that a lack of focus is setting in – the initial rush of boundless dreaming of future glory after a recent victory.
In October 2014, we read: “Our intention with systemd is to provide a strong platform. One platform. If people want to use our code in other contexts, then that’s totally fine, but please understand that I am not going to do any work for that, I am not going to maintain it, and I don’t want to see that in my code.”
In late 2014, we are treated to a talk about stateless systems. Here things are much lighter on specifics. tmpfiles and sysusers are shown off but the rest is conjectural discussion about btrfs subvolumes and dynamically populating /etc and /var. As a matter of fact, “Stateless Linux” is a project that Red Hat developers had been flirting on and off with since 2004.
2015 was the year of systemd’s first conference, but also relatively incremental in terms of development pace. The merging of gummiboot as systemd-boot was a highlight, as well as networkd improvements and the inclusion of systemd-importd. systemd-resolved emerged as an offshoot from networkd, as the latter began to expand in scope from its humble beginnings in November 2013.
The highlight for 2016 is portable services, which are effectively a container format for system services based on raw disk images or btrfs subvolumes.
In 2017, systemd.conf is renamed to All Systems Go! The content begins to grow increasingly bland. Lennart’s keynote for that year is Containers without a Container Manager, with systemd, largely summarizing systemd’s namespacing, seccomp-bpf and bind-mounting features. A UID randomization feature called “Dynamic Users” is also introduced to ambivalent reception.
2018 and 2019 alike continue with an underwhelming and haphazard selection of talks. The most notable new addition to systemd is systemd-homed, effectively a new name service with motives similar to those of Sun with NIS/YP in the 90s, but specifically for home directories.
Overall, following its developmental peak around 2014 and the subsequent loss of its identity, systemd appears to have mostly shifted its emphasis on refining tooling for containerized deployments, in line with prevailing business interests among Linux Foundation members. Poettering himself dropped an instructive hint as early as November 2015:
sysusers is definitely something we should make a Fedora default, that
is used distro wide, as it makes user registration portable, and is
also what Atomic wants.
“Atomic” being Project Atomic, a Red Hat cloud distribution project that overlapped with projects such as rpm-ostree and later being the base for Fedora CoreOS and Silverblue. An earlier September 2015 Q&A with Lennart by the CoreOS team is much more explicit about this shift.
This direction was also presaged by the infamous post on Revisiting How We Put Together Linux Systems published September 2014, which largely set systemd’s high-level development goals over the next several years. Much of the work that came out of it was outside systemd proper: Flatpak, OSTree, etc. Within systemd, it culminated in portable services and systemd-homed.
It is an open question how systemd will evolve with newer kernel developments like pidfds, the redesigned mount API, and the general trend of eBPF turning Linux into an extensible hybrid kernel of sorts. Can it march on, or has its vitality been sapped? In any event, only an internal dissension could ever rupture its dominant position.
Summarizing our historical section on systemd:
- systemd had far-reaching ambitions for cross-distro standardization right from the beginning, and no later than January 2011 was already referring itself as a “basic building block for an OS.” Integration with GNOME went into planning at around the same time. The window of time during which it was ever “just” an init system was quite brief, at most half a year.
- The ultimate reasons for its adoption were as much due to social and network effects as much as technical evaluation. Numerous distro developers were burned out by the nature of distributed bazaar development and welcomed an opportunity to consolidate low-level userspace into a central upstream source tree. Years of piling cruft on initscripts led to increasingly difficult to maintain arcana, with systemd’s ‘clean reset’ of unit files in addition to incursion into major projects finally giving distributors the greenlight to speed up their stagnant on-and-off efforts at init modernization. Upstart was much less effective in part due to it allowing initscripts to be executed verbatim within script stanzas, in addition to its esoteric event model.
- Numerous concrete or planned developments, such as GNOME’s increasing number of runtime dependencies on systemd for different components, the seemingly inevitable arrival of kdbus and massive overhaul of the entire D-Bus ecosystem, the planned end for udev support on non-systemd systems and the switch to a kdbus API for libudev, and the single-writer constraint for the redesigned cgroupv2 API – all of these intersected at the same time in a relatively short period of time which created the appearance of an unstoppable current tending towards distributions either assimilating or becoming irrelevant.
- systemd was built around a grand vision of its original authors that was usually not spelled out in its entirety, but could be reconstructed by reading earlier source materials from its developers. Its intended way of usage (ubiquitous socket and bus activation) did not coincide with its actual usage, which along with several setbacks involving the kernel maintainers, led to a loss of direction with increasingly ad hoc and reactive development goals for the project as a whole, bordering on stagnation (which could more charitably be called ‘maturity’).
- systemd did have a certain preexisting appeal to the GNOME/Red Hat/SUSE contingent owing to the latter having embarked on various unsuccessful attempts throughout the years prior at basing service management around D-Bus interfaces and bus activation, which tipped the initial scales in its favor. In contrast, this was never a priority for Upstart or any other alternative init system until relatively late.
- Machiavellianism and coalitional politics do not magically wither away just because the software is free.
3. systemd: a technical critique
The proposition that systemd is not a dependency-based init appears to be rather strange. After all, it exports a large number of dependency types to unit file writers. Counting the exact number is difficult, since what systemd internally treats as a dependency differs from what it exports, in addition to the myriad of options that either carry dependency-like side effects or end up being converted to dependencies proper.
10 years on, a systemic overview of the systemd architecture still does not satisfactorily exist – one must go through mailing list posts, bug reports and the source code to get any real idea. Consider this bug report of a person being bewildered as to how a failed service can still have its Wants= satisfied, and having to be told vaguely by Poettering that “systemd is a job engine,” later again by Andrei Borzenkov that “systemd really is job engine and ordering dependencies are defined between jobs.” systemd’s developers and almost all public documentation revolve around a “unit-centric” definition of systemd (except when discussing bugs), but my position is that for systemd-the-service-manager the most important construct is not the unit but the job, and hence one ought to understand systemd in a “job-centric” way. The root of many of systemd’s complexities is that in practice units and the job queued for them do not have a 1:1 correspondence in terms of their semantics.
With that said, I propose the following quick definition of systemd:
3.1. systemd defined concretely
systemd is an event-driven object manager with dependency-like side effects which ‘boxes’ primitive kernel resources and userspace subsystems into a generic object type called Unit. These Unit objects are scheduled through the state propagation mechanism of ‘jobs’ and dynamically dispatched via a singleton object called Manager, responsible for launching jobs in ‘transactions’ which do merging, cyclic ordering and consistency checks and serve as the main point at which unit dependencies are pulled in. Unit startup is executed as a non-indempotent parallel dataflow with weak ordering guarantees on the job level, mostly independent of the active state of dependent units.
3.2. Units
Units are the grand abstraction by which systemd models the world for the end user.
Units are objects used to represent a unit of work with common methods like start, stop, reload, etc. embedded in a polymorphic vtable dispatched for every individual unit type, of which there are 11: services, sockets, targets, devices, mounts, automounts, swaps, timers, paths, slices and scopes. Units are associated with a Manager, an object that describes an instance of systemd itself, either system-wide or per-user-session. Units hold a load state, an active state, metadata like description and documentation, a hash table of its dependencies, and a list of its condition and assertion checks. Each unit has a job slot, which represents a state change request. They also contain references to a load queue, run queue and D-Bus queue associated with the Manager that runs them. Besides that, there is a variety of data related to a unit’s execution state and booleans that produce ad hoc alterations in its behavior for special cases.
Unit active states include ‘active,’ ‘activating,’ ‘inactive,’ ‘deactivating’, ‘failed,’ ‘reloading’ and ‘maintenance.’ Every unit type has a state translation table that maps type-specific active states to the generic active states.
Unit dependencies can be divided in several categories: ordering versus requirements, forward versus inverse, reload propagators, and others. Ordering is needed due to parallel startup and can have an effect on what failure states are triggered depending on what job finishes first. Requirements are used to pull in jobs and hence propagate state changes.
Ordering is controlled by Before= and After=. Forward dependencies are Requires=, Wants=, BindsTo=, and Requisite= with their respective inverse dependencies of RequiredBy=, WantedBy=, BoundBy= and RequisiteOf=, used internally. PropagatesReloadTo= and its inverse ReloadPropagatedFrom= are sui generis. Conflicts= and its inverse ConflictedBy= are also a sui generis “negative” dependency. PartOf= is actually an inverse dependency with its forward ConsistsOf= purely internal. OnFailure= and JoinsNamespaceOf= are also considered by systemd to be unit dependency types.
Socket, path, timer and automount activation is facilitated by means of Triggers= and TriggeredBy= dependencies, again not directly available to end users. There is also References= and ReferencedBy= for purposes of garbage collecting units.
RequiresMountsFor= path dependencies are stored in their own separate hash table and treated separately from other dependency types.
The addition of every unit dependency is marked by a unit dependency mask denoting its origin. Dependencies added from unit files are marked with UNIT_DEPENDENCY_FILE
, but most masks are for units that systemd synthesizes programatically without direct user input, such as UNIT_DEPENDENCY_UDEV
, for devices, UNIT_DEPENDENCY_PROC_SWAP
for swaps, UNIT_DEPENDENCY_MOUNTINFO_*
for mounts (both implicit and default, denoting implicit and default dependencies respectively).
Unit files are merely one way of loading a Unit object, via explicit manifest. They have their own specific ‘unit installation’ logic ([Install] directives) which is purely lexicographic and distinct from the rest of systemd’s unit and job machinery. This is needed so that systemd can kick off a ‘goal service’ like the default target from which units and their dependencies can be recursively loaded for an initial boot transaction.
Beyond representing a generic ‘node’ object for purposes of propagating state changes, units are not a very cohesive abstraction – they can differ on everything from what implicit and default dependencies they pull in, whether they can be created from files, whether they are perpetual, can be run only once, support ordering or other specific dependencies like triggers, support being started, stopped or reloaded at all, whether they encapsulate processes (and if they have ‘main’, ‘control’ or both processes), whether they can be synthesized as transient units, or if they even have a failure state.
3.3. Jobs
A job is a state change request for a unit associated to a Manager object, the side effects of which comprise a resolution for a unit dependency.
There are four properties to a job: types, states, modes and results.
A job type is an action for transitioning a unit to a different state. These include JOB_START
, JOB_STOP
, JOB_RESTART
(which is JOB_STOP
patched to then become JOB_START
), JOB_RELOAD
, JOB_TRY_RESTART
and JOB_VERIFY_ACTIVE
. In fact, the latter is actually what Requisite= queues, and not a state transition but a state check. Only one job at a time can run for a given unit. Dependencies in systemd are mostly a matter of what jobs get propagated where.
Some complex job types like JOB_TRY_RESTART
, JOB_TRY_RELOAD
and JOB_RELOAD_OR_START
are collapsed respectively to JOB_RESTART
, JOB_RELOAD
and JOB_RELOAD
depending on a unit’s active state.
A job state is simple, either one of ‘waiting’ or ‘running’. For instance, a restart job is set to ‘waiting’ once the stop job reaches the ‘done’ result and it gets its type changed to a start job.
A job mode, as also documented in the –job-mode flag for systemctl(1), affects the behavior of how a job should preempt other already queued jobs. This can extend not only to whether a conflict with a pending job (e.g. if a waiting start job will be turned into a stop job) should fail or be replaced successfully, but to more global changes on the unit level as well, like JOB_ISOLATE
for stopping all other units except the unit to be isolated, or JOB_IGNORE_DEPENDENCIES
to force a job irrespective of ordering and requirements.
A job result is the outcome of a job, which can encompass various things like JOB_DONE
, JOB_CANCELED
, dependency failures (JOB_DEPENDENCY
), timeouts, being skipped, etc. Error codes from the unit method state machine (start/stop/reload/etc.) are propagated downwards to job results, such that job results often mirror unit method errors and have varying meanings like ‘unit not loaded, ‘unit doesn’t support starting’, ‘can’t be started a second time,’ ‘operation already in progress’, etc.
Jobs can be triggered explicitly by the service manager, either through the bus, as part of transaction dependency addition, or other ways jobs get pulled by transactions the Manager builds – this is, for instance, the normal path when a unit is started from a file. In addition, every unit type calls into a unit_notify method with optional type-specific notification flags; this is done for every kind of low-level or type-specific state change (like process exit, kill signal, or timeout expiry), which hence includes those that do not originate from jobs, but that will still lead to jobs being queued. This is how OnFailure= dependencies get propagated for service units with auto-restart, for instance. We can dub these “implicit jobs.”
3.4. Transactions and the Manager
The Manager is the singleton object (every systemd auxiliary daemon like logind and networkd also has such an object) which dispatches jobs and transactions in its run, load and D-Bus queues. It also contains device, mount and swap-specific data. All explicit jobs, including those started from systemctl(1), go through the Manager. The Manager object is also responsible for global system state transitions, such as poweroff, reboot, halt, isolate, and so on.
Manager-triggered jobs are started in so-called ‘transactions,’ with the transaction builder actually once residing in the same source file until systemd-183. A transaction always starts from an ‘anchor job’ (the one requested by the caller) and from that point recursively adds jobs for dependent units. A transaction is intended to perform certain sanity checks like detecting ordering cycles, preventing conflicting jobs from running, and attempting to resolve conflicts by means of job merging rules, e.g. a JOB_VERIFY_ACTIVE
and a JOB_START
on a unit will be merged into the latter. By extension, dependencies in systemd are typically not additive constraints, but follow a certain precedence hierarchy.
An important subtlety about systemd transactions is that they are computed independently of current unit run state, and hence start jobs are non-idempotent with ‘wakeups’ of dependencies for already started units happening by design. systemd developer Zbigniew Jędrzejewski-Szmek explains this like so:
A bit of background from the systemd side: when starting a service, systemd walks the full dependency
tree recursively, even for services and targets which are already started. So if e.g. at some
point we have a job like httpd.service/start, we’ll go into all the deps of that, including
usually sysinit.target, and then local-fs.target, and call a start job for any unit in that
tree which isn’t running (or hasn’t run in case of Type=oneshot/RemainAfterExit=yes units).Doing things like this increases robustness, because new dependencies will often be started
after being configured even without explicit restarting of targets, and if things fail, they
will often be started again. OTOH, it makes pid1 go through the whole unit tree every time
something is started. It also has the downside that units will be started if they are part
of the dep tree even in cases where we don’t expect. This has been discussed before, and
I think it’d be interesting to explore if we can change this behaviour, but it’d be a very
risky change to fundamentals and I’m not even sure if it would make things better. So for the
foreseeable future this will not change.
Not until January 2019 was systemd(1) updated after a PR by Jonathon Kowalski, to state that:
Note that transactions are generated independently of a unit’s state at runtime, hence, for example, if a start job is requested on an already started unit, it will still generate a transaction and wake up any inactive dependencies (and cause propagation of other jobs as per the defined relationships). This is because the enqueued job is at the time of execution compared to the target unit’s state and is marked successful and complete when both satisfy. However, this job also pulls in other dependencies due to the defined relationships and thus leads to, in our our example, start jobs for any of those inactive units getting queued as well.
Also in systemctl(1) for –show-transaction:
Note that the output will only include jobs immediately part of the transaction requested. It is possible that service start-up program code run as effect of the enqueued jobs might request further jobs to be pulled in. This means that completion of the listed jobs might ultimately entail more jobs than the listed ones.
A notable example of spurious wakeups and non-idempotence is in the case of JOB_ISOLATE
, the job mode behind ‘systemctl isolate’ used to emulate the functionality of runlevels. An isolate job will re-run oneshots without RemainAfterExit=yes, kill user services in scopes, bring down socket-activated services, as well as take down hardware-specific targets. This has made it something that even core systemd developers are hesitant about recommending. Recall again Lennart’s statements from March 2011: “the set of services running is no longer the the one that has been started on boot, but the sum of all those which got triggered sometime during the past. And triggers can even work differently if they are used in conjunction.”
In addition, since a) ordering dependencies are evaluated on the job and not unit level (something which has confused even Canonical devs who work on Snappy), and b) systemd “transactions” do not actually coalesce multiple units at the same time, the order in which you call multiple units for a start/stop operation matters. Service starts and restarts can thus be nondeterministic, as described here in a trivial case.
A much more subtle case arose in the rpcbind.service and rpcbind.socket files. A call to ‘systemctl restart rpcbind.service rpcbind.socket’ would sometimes succeed and sometimes fail, causing an upgrade-breaking bug in Debian.
Poettering explains, referencing another similar case with syslog.socket and rsyslog.service:
Note that the command will first enqueue the restart job for the first
mentioned service, then the restart job for the second service. It
will then wait for both jobs to complete. Depending on the deps it
might happen in the second case, that the service is first stopped,
and then the socket stopped, and then the service started again. Now,
when the socket is about to be started again too, the service will
already be up, but in non-socket-activation mode, at which point the
socket unit refuses to start up, in order to not corrupt the socket
the service created on its own without usage of socket activation.
Hence, the interaction between ordering and unit type-specific policy can create some interesting race windows.
3.5. Naming inconsistencies and abstraction failure in systemd
All of systemd official documentation, publicly available unit file directives, and systemctl(1) reveal a very ambivalent picture about how systemd chooses to expose its internals. By and large, systemd expects you to think purely in terms of ‘units’ and dependencies between units.
Yet, at the same time, systemctl(1) allows you to select job modes when queueing a job type (without ever explaining job types and results themselves concretely), and systemd.unit(5) gives one a CollectMode= option to tweak unit GC logic, as well as an OnFailureJobMode= option mostly used in upstream-bundled targets with a mode of ‘replace-irreversibly’. Dependency directives are explained fairly vaguely without specifying what job types are propagated and failure states raised exactly. You will never read that Requisite= queues a JOB_VERIFY_ACTIVE
job, for instance, despite this making the meaning of the option much clearer.
This means that most people have an incorrect ‘folk’ mental model of systemd’s operation, which the developers have not seen fit to actually write a proper specification for, despite it being a decade now and a well-entrenched standard.
Let’s start off with some of the more trivial examples, before we dive into the details of systemd’s dependency directives.
Socket units do not encapsulate just sockets, but other IPC endpoints such as POSIX message queues and FIFOs, as well as character devices and virtual files. And, most strangely, something as specific as USB GadgetFS descriptors. This suggests a certain lack of extensibility.
Mount units are a bit involved and their properties are somewhat different depending on how they’re synthesized. systemd automatically generates mount units from /proc/self/mountinfo, however mount units loaded from .mount unit files actually work by directly executing the /bin/mount binary from util-linux as a control process (MountExecCommand), as shown in the ExecMount D-Bus property and visible from ‘systemctl status’. One has to keep this distinction in mind, as for instance it makes drop-ins for mount units nonsensical in the former case but not necessarily in the latter. Swap units have the same separation with those synthesized from /proc/swaps and those configured from swap files running SwapExecCommand at /sbin/swapon. In addition, there are several “extrinsic mounts” that are excluded from having mount units generated on them, but the user cannot so far as I know provide their own mount points to be considered extrinsic so as to get the mount unit state machine off their back. Additionally, the logic that creates units from /proc/self/mountinfo has led to a notorious years-long issue of mount storms (also described in this Jane Street article) that can easily DoS a machine. An attempt at fixing them in 2018 was reverted owing to regression test failures. Mount units also don’t yet have separate start and stop timeouts.
Device units also have numerous sources, but at core they are derived from udev tags, hence making them a dependency propagator for udev of sorts. Device units do not support ordering dependencies like Before= and After=. Additionally, since devices have no stop or restart jobs associated with them, PartOf= does not work on them because devices do not go through a ‘stopping’ state, instead directly to ‘inactive’. BindsTo= does work, however.
The interaction between mount and device units has been plagued by a long-standing bug relating to systemd unmounting manual mounts due to it being unable to update stale information between mount and device relationships.
Devices and targets have no failure state (u->can_fail is not set to true), and hence cannot be the source of an OnFailure= dependency ostensibly. This is despite the fact that for years the upstream systemd target files up until recently with systemd-245 provided such OnFailure= directives on targets like local-fs and initrd, which made them no-ops!
OnFailure= and JoinsNamespaceOf= are internally regarded as dependency types. Following this logic, we must also include OnFailureJobMode= and StopWhenUnneeded= at the least. In addition, all of the *Directory= options are reduced to RequiresMountsFor= dependencies, which itself is treated separately with its own dedicated hash table. The Unit= option in path and timer units actually creates a Triggers= dependency on said unit, which in turn has its own JOB_TRIGGERING
mode, which itself is inconsistent with the fact that dependencies tend to be propagated as job types and not modes. Triggers= is of course not directly usable by end users, even if it could in principle be useful to create generic lazy activation relationships.
The meaning of DefaultDependencies= is overloaded, which relates to ‘dependency’ in systemd as a whole being overloaded to cover any kind of job type or state propagation at all. Though default dependencies can be turned off if need be, every unit type also contains implicit dependencies which cannot.
A unique thing about PartOf= is that it’s the only dependency type to only have an inverse variant available to the user. The forward dependency ConsistsOf= is purely internal, despite its useful potential in creating ‘virtual services,’ i.e. a ‘provides’ relationship. Unit file templating and presets are the other options for achieving this, which don’t actually involve the job and transactional dependency propagation logic directly at all.
systemd’s job result of JOB_DONE
does not actually mean ‘successful,’ the same result is returned even if Condition*= failures exist. Or, as Lennart clarified: “A unit A with Requires=B doesn’t actually care if B is up or not. All that matters is that the start job for it succeeded. But that can succeed
even without B actually being up if some ConditionXYZ= failed for the service. After all conditoins [sic] are considered “non-fatal”. They permit a unit’s job to succeed cleanly even if the condition doesn’t hold, but of course the unit won’t be up afterwards.”
systemd has no idiomatic OnDependencyFailure= option to force service restarts when a stop job or failure state is triggered by the Manager rather than from a supervised process failure. This is because there is no distinction between a manual ‘systemctl stop’ and a stop job enqueued in a transaction to satisfy a propagative dependency from some distant requirement. Numerous brittle workarounds exist.
Ordering dependencies are supposed to be orthogonal to requirement dependencies, but this is violated for targets, which automatically get After= deps on units they Want, are PartOf or are Requisite to.
Scope units can only be started once, and there is even a special job result of JOB_ONCE
solely to report that constraint. Slice units do not have the same restriction for some reason. Both scopes and slices are perpetual units, and cannot be stopped.
Conflicts= is a bidirectional relationship with an implicit ConflictedBy=. They also have special integration in the transaction building logic. It is notoriously brittle and ineffective, trivially allowing the Manager object to start both the conflicting and conflicted service on bootup, making it unreliable for creating exclusive services. Since Conflicts= really just queues a stop job on the conflicted unit, a subsequent transaction can preempt it, as it isn’t any kind of hard constraint. Conflicts= is overwhelmingly used mostly as a default dependency on shutdown.target, so as to calculate a shutdown order inverse of the one on startup. Indeed, Lennart Poettering himself has essentially advised that it isn’t recommended for any other use. One gets the impression that it should never have been exposed to end users at all, but has to remain so due to compatibility reasons.
Reload jobs propagated implicitly by the Manager, such as those in response to device state changes, as well as those explicitly done from PropagatesReloadTo=, appear to be enqueued with the JOB_IGNORE_DEPENDENCIES
mode, making them sui generis as well. On numerous occasions Poettering has called JOB_IGNORE_DEPENDENCIES
an “awful invention,” “frickin’ ugly” and also a “horrid invention,” so there you go. I believe the reason for this Lovecraftian invention is due to reload jobs in systemd being synchronous as discussed in the linked mailing list posts, which manifested in deadlocks such as this one.
In fact, the origins of PropagatesReloadTo= and ReloadPropagatedFrom= in all their INTERCAL-like glory, is because BindsTo= doesn’t propagate reload jobs, which was requested as a feature. Except, given the nature of what ‘dependency’ means in systemd, the consistent thing to do would have been to expose Propagates= directives for all job types. This would probably lead to unit files becoming much more confusing to read, but that is the nature of systemd’s architecture.
Similarly, one ought to expect RefuseManualRestart= and RefuseManualReload= as there is for start and stop, but they do not exist. Other inconsistencies include no ExecReloadPre=/ExecReloadPost= or ExecRestartPre=/ExecRestartPost=, TimeoutStopSec= not being available for all units, and Type=oneshot units not supporting ExecStopPost= or RestartForceExitStatus= (up until recently they did not support Restart= at all).
3.5.1. Dependency hell-based init
Starting a unit from its start method, and queueing a start job for a unit are two distinct operations. Whether or not a unit can be started has no bearing on whether or not a start job can be queued for it. A systemd source code comment in unit.c says that “That’s because .device units and suchlike are not startable by us but may appear due to external events, and it thus makes sense to permit enqueing [sic] jobs for it.” Hence, it is routine for the systemd Manager to pull in units ‘ambiently’ that the user cannot. Nor is there a clear separation as to whether dependencies get resolved on a unit or job level (as part of a transaction), as BindsTo= for instance gets rechecked for any low-level unit state change – this I believe is why it can be used in device units where PartOf= cannot.
Furthermore, canceling a start job doesn’t necessarily halt the activation of a unit, since the JOB_CANCELED
result is not the same as a failure condition. In fact, it’s explicitly used to avoid triggering OnFailure= dependencies when queueing a start job of mode JOB_ISOLATE
, like when isolating a target.
The best description of systemd’s dependency primitives available is in Jonathon Kowalski’s writeup. Though messily written and unorganized, it is the closest there is to an informal spec of systemd’s job engine.
Requires= has three distinct effects: it queues a start job, it propagates stop jobs to RequiredBy= units, and it fails dependent units with the JOB_DEPENDENCY
result. It is almost always used with After=, reason being:
Requires= alone without After= has interesting interaction with
systemd’s job machinery.When you use Requires= alone, say from a to b, two start jobs will be queued for both as part of the
transaction, but the one for a will not wait for the one for b to
complete running. Therefore, both go in parallel, and at one point, if
a completes before b, it will start up as usual. However, if b fails
before a completes, the start job for a is canceled with
JOB_DEPENDENCY
job result. Hence, in your case, the job is dispatched
right away and completes before b fails.This is also why targets that usually gain deps through filesystem
symlinks have an implicit ordering by default, as defining ordering
through that mechanism is not possible.Now, answering why explicitly stopping b stops a? This is how
Requires= is actually supposed to work. Infact, this is perhaps the
only noticeable difference for end users between Requires= and
BindsTo=…
Conventional wisdom would hold that there is no valid use case for Requires= without After=. However, due to the multitude of side effects it exhibits, there is. Requires= with Before= will effectively disable service failure on dependency errors by not waiting for the required job to complete, while keeping the behavior of propagating a start job on startup a la Wants=, but also stop jobs on RequiredBy= units, unlike Wants=.
Wants= is quite special. systemd.unit(5) calls it a requirement dependency, and frames it as the weaker version of Requires=. Core systemd developers have said that Requires= without After= effectively downgrades to Wants=. This is very misleading. The only thing that Wants= does is queue a start job, and nothing else. Not only that, but more interesting is what Wants= doesn’t do: it doesn’t regard “unit not found,” “unit masked” and “job type not applicable” to be errors and will finish the job regardless. As such, Wants= isn’t really much of any “dependency” at all, it’s an unconditional start. This is also why dependents tend to pull themselves into targets by WantedBy=, as since targets have no failure state and Wants= being as weak as it is means ordering cycles can be easily broken, it’s a way of ensuring that a target as synchronization point can run no matter what. At the same time, the promiscuity of Wants= make it easy to generate perfectly valid infinite loops.
systemd provides no idiomatic option in between the extremes of Wants= and Requires=.
We dealt with Conflicts= above. Its only real valid use case is to obtain ConflictedBy= relationships relative to shutdown.target, but isn’t useful as an exclusion mechanism beyond that. The reload propagators were also covered satisfactorily.
PartOf= (introduced in systemd-188 ostensibly for grouping targets) extends Requires= by propagating both stop and restart jobs, but doesn’t support units like devices that do not have explicit stop and restart operations, which again is distinct from ‘being pulled by stop and restart jobs,’ as the latter can occur implicitly through low-level unit state changes sent through unit_notify, which can differ for each unit type. Nor does it pull in a start job on its own, unlike Wants= and Requires=. It is the only systemd dependency directive that is available only as an inverse dependency, with its forward equivalent ConsistsOf= left inaccessible.
BindsTo= is basically a ‘catch-all’ option that tracks all explicit and implicit start, stop and restart jobs, hence its use in devices. Unlike PartOf=, it also queues a start job, so it is not a complement to PartOf=, and therefore has been criticized as “[conflating] two orthogonal concepts (start job propagation from the subject and stop/restart propagation on all those state changes that either make it go to inactive/failed or due to jobs triggered on unit).” A subtlety of BindsTo= is that if used without After=, it gets skipped in a check that occurs on the unit level, and not just on the job level as most other dependencies. More broadly, it tracks the activation state of a unit, and units that go straight from activating to inactive with no intermediary state will be failed, unlike in Requires=.
Requisite= has no special dependency handling at all, it just fires JOB_VERIFY_ACTIVE
. It should more properly be called AssertStarted= or something of that sort. Used without ordering, it is highly racy, quoting Kowalski:
Requisite= internally triggers a job of type
JOB_VERIFY_ACTIVE
for that unit you reference. This job then is responsible to fail your unit’s start job (as an example) if theJOB_VERIFY_ACTIVE
job doesn’t yeild [sic] success. However, not specifying ordering will mean either of these can be
dispatched in arbitrary order when you request a start job on your unit, and depending on who completes first, it may or may not fail with aJOB_DEPENDENCY
job result.After= ensures the scheduler puts your unit’s job in the
JOB_WAITING
state until that job completes, so that means it can deterministically fail your job. This however means that then you will wait on every job for that unit, be it a start job/stop job/ etc. So, if you use Wants= without After=
(just to pull in a unit, not wait on it, and not fail if it dies), and also want to use Requisite=, you lose the property of not waiting on it, just as a consequence of the poor implementation internally.
A consequence of Requisite= is that for a unit that doesn’t reach ‘active’, as in a oneshot without RemainAfterExit=yes, it will always fail, since after all a ‘verify-active’ job will naturally always fail for such a unit. However, due to job merging rules, if you use Requisite= and Wants= together, the JOB_VERIFY_ACTIVE
of the former will be merged with the JOB_START
of the latter to produce JOB_START
and always succeed, as it downgrades to Wants=. See this example.
What the Requisite= example vividly shows is that systemd dependencies are not additive, nor are they constraints, invariants or ‘checks’ of any kind that one might intuitively expect. Combining dependencies overrides and supersedes behavior rather than enforcing more constraints, since all of these dependency directives are coarse-grained ad hoc instructions with numerous side effects that range from events, relationships, and ordering to their propagation effects. None of this is easy to reason about.
Additionally, unlike Wants=, Requisite= will fail dependent jobs with the JOB_DEPENDENCY
result, which means they will trigger OnFailure= conditions. Which means that one can’t use this directive simply to aynchronously queue a verify-active job and ignore dependency failures as a form of soft check. Only Wants=/WantedBy= appear to be unique in their minimalism and permissiveness. Almost every dependency operation in systemd is either too coarse or too thin, with practically no way of telling the Manager the exact desired behavior you want from its state machine.
3.6. Case studies
The above may sound rather theoretical and nitpicking, so let’s illustrate with a few examples.
3.6.1. Valid transaction with nonexistent unit
Suppose I have a service file foo.service:
[Unit]
Description=Nonexistent dependency
Requires=nonexistent.service
After=nonexistent.service
[Service]
Type=simple
ExecStart=/bin/sleep 999
[Install]
WantedBy=multi-user.target
If I start this manually with ‘systemctl start foo,’ the requirement dependency on nonexistent.service will not be found and the service will fail with a ‘unit not found’ error.
If however I enable it through the unit file installation logic, the lexicographic resolution will not find anything wrong. When I reboot, systemd will pull in the nonexistently dependent unit in the initial boot transaction, load it and start it just fine. A restart will fail with a ‘unit not found’ error, but a stop will fortunately take it down with SIGTERM.
More fascinating on my Manjaro Linux system was when I changed the WantedBy= directive to graphical.target (which is the default target) and Type= to oneshot. Then, I would have four pending start jobs on my system after reaching the display manager: the nonexistent dependent in an ‘activating’ state, and multi-user.target, graphical.target and tlp.service in a waiting state. Canceling them manually got rid of them without triggering a failure state, as discussed above.
This has been reported several times: 1, 2, 3. It has been reported to affect BindsTo= and RequiresMountsFor=, as well.
As far as I can tell, there are several things going on here. First, recall that Wants= and WantedBy= are effectively unconditional starts that by design have almost no sanity or error checking, as they ignore unit-not-found, job-type-not-applicable and unit-masked errors. Second, targets have no failure state also by design, so that they can always be reached. As such, you get what you asked for, since systemd’s job engine just trusts that you give sensible input for this kind of dependency. Job propagation is not additive nor atomic (‘commit when all constraints are satisfied’), so the target’s Wants= preempts the harder requirement. This can be particularly surprising if one Requires= an unavailable mount unit.
A related and fascinating problem: systemd will inconsistently propagate nested Wants= and Requires= dependencies depending on the order in which the services are started, due to some as yet enigmatic flaw in the garbage collection logic.
3.6.2. Transaction with conflicting job types
Based on issue #11440, say we have one.service:
[Unit]
Description=one
[Service]
ExecStart=/bin/sleep infinity
two.service:
[Unit]
Description=two
Conflicts=one.service
[Service]
ExecStart=/bin/sleep infinity
three.service:
[Unit]
Description=three
After=one.service two.service
PartOf=one.service two.service
[Service]
ExecStart=/bin/sleep infinity
We have a service one
, a service two
that conflicts (stops) one
, and a service three
ordered after one
and two
which also propagates restart and stop jobs to one
and two
, but without pulling them in on its own, i.e. semantics of PartOf=.
Running ‘systemctl start one
two
three
’ works fine, as two
and three
are up while one
is inactive after being taken down by two
’s stop job on one
.
Now run ‘systemctl restart two’. It will fail with “Failed to restart two.service: Transaction contains conflicting jobs ‘restart’ and ‘stop’ for three.service. Probably contradicting requirement dependencies configured.”
Two things here: again, job propagation is not atomic, but also recall that transactions are generated without accounting for a unit’s active state. Or, more specifically, unit state is checked at time of job dispatch and not when enqueued in a transaction to begin with.
With three
having PartOf=one two
, both one
and two
now have ConsistsOf=three
. Now, we request a restart job on two
. This restart job becomes our anchor. We follow the Conflicts=one.service
and queue a stop job for one
, which remember is already inactive, so we’re propagating a stop job on an already stopped unit. Since one
ConsistsOf=three
, it also gets a stop. We then walk down again to two
which also ConsistsOf=three
and propagate JOB_TRY_RESTART
to three
.
The stop job gets propagated before the try-restart for the same unit, which fails mergeability rules, and the transaction errors out.
This may seem like a contrived corner case, but in fact this same scenario was the cause of a long-standing bug in the service files for fail2ban and firewalld. It was reported in Fedora in 2016, in Debian in 2017, and in 2019 for openSUSE. fail2ban would have PartOf=iptables.service firewalld.service, and firewalld would have Conflicts=iptables.service. The reasoning was valid: fail2ban can work with either firewalld or iptables, but only one can be up at a time, furthermore fail2ban should be restarted if either firewalld or iptables is restarted. Intuitively, this seemed that it ought to work, but it was based on an incorrect mental model of systemd dependencies as invariants.
3.6.3. Dependency propagation overrides explicit restart policy
Suppose we have a service-specific foo.target which BindsTo=foo.service:
[Unit]
Description=foo target
BindsTo=foo.service
foo.service:
[Unit]
Description=foo service
PartOf=foo.target
[Service]
ExecStart=/bin/sleep 999
ExecStopPost=/bin/false
Restart=always
We then run ‘systemctl start foo.target.’ foo.service is up and running with it.
Now manually kill foo.service’s main process with -KILL, -TERM, -SEGV or what have you. Since this is a kill signal which triggers a low-level unit state change separate from an explicitly Manager-triggered job, one would expect ‘Restart=always’ to fire. But it does not, the service stays down.
Here we have an unexpected dependency propagation coupled with a race which simultaneously interacts with a quirk from the service unit type-specific state machine. Upon foo.service going down, the BoundBy target gets a stop job queued and it’s down too. But the PartOf in foo.service picks this up and walks the ConsistsOf in foo.target to queue a stop job on foo.service itself, which is a Manager-triggered job and hence it inhibits the restart timer from firing since the ExecStopPost= directive (which can be anything) holds the service in a deactivating state, thus crediting the final transition to inactive/failed to an explicitly systemd-queued job and not to an external state change.
This is discussed in issue #11456.
Kowalski also clarifies:
Units are only restarted when they change their state on process exit, killing through signals, or reaching a timeout set for it. However, if a user explicitly asks for a unit to be stopped, or BindsTo= triggers it to be stopped (i.e. both being result of operation by the manager), the subject unit is not restarted. This is because Restart= only acts on implicit state changes, and not on explicit jobs causing a state change (and all of propagation dependencies are explicit i.e. enforced by systemd).
From the user perspective, an explicit ‘systemctl stop’ and a propagation from BindsTo= would seem different, one explicit and the other implicit, but internally both are explicit. Hence the workarounds needed for the absence of an OnDependencyFailure= option, or the lack of otherwise selectively overriding restart policies to include Manager-triggered jobs.
3.6.4. Bugs as features: destructive transactions, implicit .wants, and PartOf= intransitivity
Throughout the years, people have either grown accustomed or found interesting ways to make features out of systemd failure conditions, bugs and quirks, which can make attempts to “fix” semantics to be more consistent actually break many use cases that depended on said inconsistency.
The Red Hat Customer Portal for RHEL 7, for instance, officially recommends deliberately creating a destructive transaction in systemd to act as a “reboot guard” feature so that a root user can be prevented from rebooting until some action is complete. Assuming systemd’s job mergeability rules are tweaked or that multi-unit atomic transactions ever become a thing, this could stop working.
Up until systemd-242, device units would get an implicit .wants dependency on their corresponding mount units, with the appearance of a device always leading it to it being immediately mounted. This would cause mount units to get started up after being explicitly stopped by the user if a device unit state change (like a ‘changed’ event) occurred to pull them in. After the removal of this “feature,” it turned out many were relying on it as a poor man’s hotplug.
Quite recently as of this writing, from systemd-245 onward, a patch that modified the GC logic for jobs was merged. It was suggested as a response to the issue of PartOf= dependencies being intransitive, i.e. the behavior used to be that PartOf A->B->C and then stopping C while B is inactive did not propagate to A, but stopping the already inactive B did.
The patch led to a massive wave of regressions that is still ongoing as I write this. The Debian package for systemd currently contains a patch reverting this specific change. Among the strange side effects were the Plymouth bootsplash being restarted in the middle of a graphical session, with Debian systemd maintainer Michael Biebl going ballistic at the systemd developers over this. Additionally, services with failing condition checks would be continuously restarted (also reported here), thus clogging the journal.
What this debacle shows is that seemingly minor enhancements to systemd’s state machine can have disproportionate effects on the entire user ecosystem that core developers cannot foresee until they actually deploy the changes and watch the outcome. This raises serious doubts about the possibility of systemd being reformed or overhauled in any non-trivial fashion.
3.7. Illusions of declarative configuration
It’s worth noting that most people are going to be unfamiliar with all of these intricacies for the simple reason that most people only ever use a small subset of systemd’s features – and this goes for most default setups in major Linux distros, too. As I informally grepped recently through /usr/lib/systemd/system/ and probed systemctl(1) on Manjaro, Ubuntu Server 20.04 on QEMU and Fedora 29 on JSLinux, I found that only about a quarter to a fifth of services are socket-activated when excluding systemd services (with only a few socket unit options used), that cron jobs and timer units often still coexist, that reload propagators are practically never used, Conflicts= overwhelmingly on shutdown.target as in upstream units, BindsTo= mostly on device units but also a lot by libvirtd-related sockets and services especially, Requisite= only once for systemd-update-utmp-runlevel, PartOf= rarely outside of nfs-utils, and RequiresMountsFor= similarly rare outside of upstream units.
Evidently, people avoid getting smacked by the job engine by not using much of it.
Still, the commonly cited advantage of systemd unit files being ‘declarative’ is hard to square with its dependency model not allowing you to think in term of ‘goals’, ‘constraints’ and ‘invariants’ as you would expect from declarative programming. I’ve noticed that nowadays ‘declarative’ gets stretched to refer to just about any simple configuration language that doesn’t have explicit control flow constructs, which would make the term completely trivial.
systemd’s job semantics make it highly stateful and effectful, with its non-idempotent transactions actually behaving quite like how systemd’s developers criticized Upstart: “When we looked closer at Upstart we eventually realized that its fundamental design was backwards – at least in our opinion. We thought a system manager should calculate the minimal amount of work to do during boot-up, while Upstart was actually (in a way) designed to do the maximum amount of work, and left the developers and administrators in charge to calculate what precisely should be done when.”
Operating as it does on 11 different unit types with a wide variety of unique properties and interactions, systemd habitually intertwines global system state with the state of services and many other objects that are effectively fictions maintained by systemd, which indeed it must maintain in order to enforce ordering dependencies within its model. Its architecture makes the dependency graph an entirely transient and non-reproducible artifact dependent on the implicit propagation of state changes from the ‘ambient’ system environment.
Most systemd dependency types couple what job types the subject propagates, what job types are propagated to the object, and what job result is returned to the caller, doing so non-atomically and with different outcomes depending on unit type-specific state. Hence one cannot predict or reproduce the state of a system from its explicitly configured manifests. Your dependency graph is not your reference graph.
These kinds of AdHocCoarseGrainedOptions= are hardly limited to unit dependencies, but apply across the board to options for modifying execution state, as well (compare and contrast the chain loading approach). User requests for custom systemd verbs/actions have been frequent and consistently denied: e.g. 1, 2, 3.
4. What comes after?
4.1. Utopia Banished: HAL, DeviceKit and the other vision that wasn’t
In open source development, history has a tendency to repeat itself – perpetually as farce.
Around 2004, an initiative by the semi-facetious name of “Project Utopia” appeared on the scene. Its goal was to radically reshape the state of hotplug and hardware autodetection on Linux at the time, a rather hairy situation in those days, involving /sbin/hotplug, lots of shell script glue, vendor-specific tools like Kudzu on RHEL and Fedora Core, and kludges like a “supermount” kernel module.
As described by one of its two leading developers, Robert Love of Novell (the other Joe Shaw):
Joey and I decided to create an umbrella project—a meta-project. The plan was to spur development of HAL-aware applications that can provide hardware policy on the desktop. Never should a user need to configure hardware. It should happen automatically in response to the user plugging the hardware in. Never should the user (or even the programmer) have to mess with device nodes and esoteric settings. HAL should provide all of that, on the fly, to the applications. Never should the user have to guess how to use new hardware. If I plug in a camera, my photo application should run. If I insert a DVD, it should start playing. All of this should happen magically, automatically and cleanly.
I coined the name Project Utopia. It was, after all, a bit utopian.
We did not have a central Web site or source repository or cute logo. Project Utopia was a cause and a way of thinking. We had a goal and a set of use cases and a growing disgust toward things not working. We blogged and spoke at conferences and wrote code. One by one, piece by piece, we started to build a set of policy pieces on top of HAL, guided by the following rules:
Make hardware just work.
Use HAL, udev, sysfs and 2.6 Linux kernel as our base.
Tie it all together with D-BUS.
No polling, no hacks—everything should be event-driven and automatic.
Carefully divide infrastructure into system and user level.
System level should be platform-agnostic; user level, GNOME-based.
The key behind this just-works-event-driven-automagic utopia was “halification,” i.e. “the act of converting a program to use HAL, either simply to reduce code size or to add new functionality (ideally, both).”
It was a GNOME-centric vision as well, with GNOME Volume Manager being the centerpiece of it all. The components working together comprised HAL, D-Bus, udev, GNOME Volume Manager and NetworkManager.
But especially HAL – the Hardware Abstraction Layer. From about 2005 to 2010, HAL was a ubiquitous object-oriented RPC mammoth (but actually having some semblance of a specification) used by many applications to query hardware metadata. HAL was a system daemon maintaining a database of device objects (holding a unique identifier, properties and interfaces) introspectable over D-Bus which read from device information files. It supported its own ad hoc service management through means of “addons,” daemons bound to HAL device objects that HAL would launch on demand, and “callouts,” oneshot jobs to add metadata in response to a device-added and device-removed events.
Robert Love confidently boasted that:
Today, the Project Utopia mindset continues to foster new applications, interesting hacks and fresh projects aimed at making hardware just work. Linux distributions from Novell, Red Hat and others sport powerful HAL-based infrastructures. The GNOME Project is integrating HAL and D-BUS across the board. The Project Utopia cause is spreading beyond GNOME too, as other platforms implement HAL-based solutions in a similar vein.
Linux development has never stood still, however. Like a rabid cheetah, development sprints forward toward better, faster, simpler solutions. Support for new hardware continues to roll in, and solutions in the spirit of Project Utopia are continually implemented to provide a seamless user experience.
Cute hacks such as having your music player mute when your Bluetooth-enabled cell phone receives a call are not a dream but the reality in which we live. What cute hacks will tomorrow bring? What new hardware will we support next? What application will be halified next? Join in and answer those questions yourself!
In an April 2004 mailing list post, Love spoke of his desires to unify the Linux ecosystem around Utopia:
At some very high level, unless distributions unify their settings,
these things are going to remain vendor-dependent. Take networking
configuration, for example. I am working on callout code for that now.
Different vendors can obviously share HAL, the callout scripts I am
writing, and maybe some other glue. But since we have different
configuration files, and different configuration utilities, that stuff
will remain separate.But that is not any change from the status quo today – e.g., whether or
not Project Utopia is implemented on a given distribution, you still
have vendor-specific configuration utilities. If vendors ever unify
around a single configuration utility, then that code too will be
shared.The further up the stack you get, the more vendor-specific and
policy-specific you get, so naturally the less and less that will be
shared. I think our goal is to make the infrastructure as rich,
flexible, and kick ass as possible so as little nontrivial stuff as
possible is not shared.For example, take the current system stack in Red Hat: the kernel,
MAKEDEV, kudzu (and all of that stuff), and the redhat-config tools.
Plus all the RH-specific stuff such as initscripts, networking scripts,
configuration files, and other magic.Project Utopia could unify almost all of that, but not all of it. The
big thing to get rid of in the above is kudzu, imo 😉
Much of this, then, lived and died on the basis of HAL.
Yet by May 2008, core HAL developer David Zeuthen posted a retrospective.
In it, he said that HAL was “a huge kitchen sink that hasn’t seen any real rewrites” full of crufty marshalling code, that “because it does a lot, no single developer has a 100% overview of the code base,” “too abstract/generic,” “inefficient,” and “huge overlap with underlying components” (namely udev).
But despite this damning verdict, he still expressed his belief in the “idea,” and that Project Utopia was right on the “conceptual” level. Then-recent developments like ConsoleKit, PolicyKit and D-Bus system bus activation had only proven the existence of this “real trend.” He also announced the introduction of DeviceKit, a much simplified layer over sysfs to replace HAL in the long run. DeviceKit would later evolve into udisks2 and upower.
This announcement triggered what in Ubuntu was called the “Halsectomy” where many programs that had once used HAL backends were switched over to libudev or reading from sysfs directly. Fedora also led its HalRemoval initiative, summarizing the state of things as “HAL is a behemoth, do-it-all, daemon to access hardware. It is now obsoleted by udisks and upower, as well as libudev for device discovery.”
Kay Sievers in April 2009, relaying the upcoming transition:
If things work out as planned, DeviceKit, the main daemon, will go
away. Subsystem daemons will subscribe directly to device evens [sic] with
libudev. Udev/the kernel will do the event multiplexing/filtering,
there will be no D-Bus involved. It will be part of main udev, not
udev-extras.
And so, a hulking HAL daemon and its surrounding D-Bus service stack was ultimately replaced by a much smaller event multiplexer working over a kernel Netlink socket. The introduction of devtmpfs was an important milestone.
The HAL story is an interesting one as it’s a story of a highly ambitious and overdesigned userland endeavor that ends up being pursued due to insufficiently expressive kernel mechanisms as the root cause of things. Once /sys, devtmpfs and other kernel subsystems were improved and reworked, the entire journey was revealed to be a dead end, and what followed was extensive surgical incision to remove the traces of a sprawling master daemon. Wayland replacing X has similar broad outlines.
If systemd is another HAL, then the driving cause of change among the Linux developer patriciate will not be another ‘platform’ or ‘basic building block’ to replace what exists, but extensive kernel changes that pull the developers into a paradigm shift of offloading as much work as possible to the kernel while keeping a thin event broker as the interface to userspace.
In all likelihood, this will be BPF.
4.2. Closing thoughts
At this point, systemd endures as a platform, and from the network effects of said platform, which include, from Josh Triplett’s joyful listing:
systemd user sessions, socket activation, sysusers, dynamic users,
systemd-homed, temporary directory setup, transient units, anything
talking to the slice or control group APIs, containerization, firstboot,
systemd’s whole “preset” system-wide configuration and policy mechanism
for admins to say “what services do I want launched when installed and
what services do I want to leave stopped until I configure them”,
“stateless system” capabilities, and I’m probably forgetting another
dozen.
In effect, even if systemd as an init system begins to be seen as lackluster, this wouldn’t make much of a difference given the wide range of the project. And simply having a “better init system” would certainly be of no import. When distributions signed up for systemd, they weren’t adopting a “better init system.” They were adopting a platform. Projects like Flatpak and Snappy have significant integration with systemd these days, and it is a done deal.
Upstart suffered from nasty bugs that, at their core, boiled down to having its ad hoc job and event engine out of sync with the underlying kernel process model, producing nonsensical system states. systemd suffers from the same class of issues.
On one hand, we must ask ourselves: do init systems really matter?
I don’t think they do all that much.
First, a little conundrum: we know that mainstream distributions stuck with brittle initscripts well past their sell date for years and resorted to ineffective incremental changes when a major redesign was needed. How is it, then, that the same people who persisted in doing the wrong thing for so long suddenly all did the right thing at the exact same time? How can we trust this? I think the only honest answer is that distributions adapt to the incentives posed by the upstream channels they package, hence we can’t draw any conclusion about technical progress being made from their decisions.
But let’s take a look at, say, ChromeOS. It still uses Upstart very extensively to this day despite it being long abandoned by its original developers. Has the use of such a buggy and brittle init system impeded the hegemony of Chromebooks in public education and increasingly the laptop market as a whole? Are Chromebooks completely unusable? Evidently not.
Android still boots from a monolithic init.rc file which Rob Landley once aptly described (paraphrasing) as “looks like a shell script, but isn’t.” It also has a reverse-dependency model based on ‘actions,’ ‘events’ and ‘triggers’ which again most closely resembles Upstart than anything else. And who cares about the init system on their portable tracking device, except the psychotically obsessive?
Aye, init systems don’t matter. But systemd is unique in that it does matter. Which is strange. Why should it? What right does it have to matter?
Perhaps as BPF subsumes Linux into making it a managed-runtime hybrid kernel with subsystems becoming increasingly componentized and instrumented (with early signs in the proposal to add Linux Security Module hooks to BPF programs), as pidfd/process descriptors allow for reliable supervision to be distributed across self-contained processes, as the native Linux mount API becomes more event-driven on its own, and as a generation of Rust fanatics in their undying religious zeal insist that all men are obligated to offer sacrifices to the borrow checker, a new shift may emerge where once again, the init system is made to cease mattering.
One thing I’m certain of is that this shift cannot emerge from dilettantes, outsiders and proverbial basement hackers. One does not unseat a platform without already being part of the patriciate that calls the shots on what gets integrated where across the largest nodes in the ecosystem.
Just as Poettering et al. rose like lions to depose the foxes that lived complacently, so too will they now in place of the foxes be overthrown by a new breed of lions of their own making. Who, from where and doing what are unknowns about which I can only idly speculate.