Tópico: [EN] Systemd vs. Docker
04-03-2016, 07:55 #1
[EN] Systemd vs. Docker
LWN on systemd vs Docker is now free. Plus about a million comments in the thread.
Josh Berkus | Red Hat
February 24, 2016
There were many different presentations at DevConf.cz, the developer conference sponsored by Red Hat in Brno, Czech Republic this year, but containers were the biggest theme of the conference. Most of the presentations were practical, either tutorials showing how to use various container technologies like Kubernetes and Atomic.app, or guided tours of new products like Cockpit.
However, the presentation about containers that was unquestionably the most entertaining was given by Dan Walsh, Red Hat's head of container engineering. He presented on one of the core conflicts in the Linux container world: systemd versus the Docker daemon. This is far from a new issue; it has been brewing since Ubuntu adopted systemd, and CoreOS introduced Rocket, a container system built around systemd.
Systemd vs. Docker
"This is Lennart Poettering," said Walsh, showing a picture. "This is Solomon Hykes", showing another. "Neither one of them is willing to compromise much. And I get to be in the middle between them."
Since Walsh was tasked with getting systemd to work with Docker, he detailed a history of code, personal, and operational conflicts between the two systems. In many ways, it was also a history of patch conflicts between Red Hat and Docker Inc. Poettering is the primary author of systemd and works for Red Hat, while Hykes is a founder and CTO of Docker, Inc.
According to Walsh's presentation, the root cause of the conflict is that the Docker daemon is designed to take over a lot of the functions that systemd also performs for Linux. These include initialization, service activation, security, and logging. "In a lot of ways Docker wants to be systemd," he claimed. "It dreams of being systemd."
The first conflict he detailed was about service initialization and restart. In the systemd model, all of this is controlled by systemd; in the Docker world, it is all controlled by the Docker daemon. For example, services can be defined in systemd unit files as "docker run" statements to run them as containers, or they can be defined as "autorestart" containers in the Docker daemon. Either approach can work, but mixing them doesn't. The Docker documentation recommends Docker autorestart, except when mixing containerized services with services not in a container; there it recommends systemd or Upstart.
Where this breaks down, however, is when services running as containers depend on other containerized services. For regular services, systemd has a feature called sd_notify that passes messages about when services are ready, so that services that depend on them can then be started. However, Docker has a client-server architecture. docker run and other commands are called in the client for each user session, but the containers are started and managed in the Docker daemon (the "server" in this relationship). The client can't send sd_notify status messages because it doesn't actually manage the container service and doesn't know when the services are up, and the daemon can't send them because it wasn't called by the systemd unit file. This resulted in Walsh's team attempting an elaborate workaround to enable sd_notify:
- systemd requests sd_notify from the Docker client
- That client sends an sd_notify message to the Docker daemon
- The daemon sets up a container to do sd_notify
- The daemon gets an sd_notify from the container
- The daemon sends an sd_notify message to the client
- The client sends an sd_notify message to tell systemd that the Docker container is ready
Walsh was unsurprised when the patches to enable this byzantine system were not accepted by the Docker project. sd_notify does work for the Docker daemon itself, so systemd services can depend on the daemon running. But there is still no way to do sd_notify for individual containerized services, so the Docker project still has no reliable way to manage containerized service dependency startup order.
Systemd has a feature called "socket activation", where services start automatically upon receiving a request to a particular network socket. This lets servers support "occasionally needed" services without running them all the time. There used to be support for socket activation of the Docker daemon itself, but the feature was disabled because it interfered with Docker autorestart.
Walsh's team was more interested in socket activation of individual containers. This would have the benefit of eliminating the overhead of "always on" containers. However, the developers realized that they'd have to do something similar to the sd_notify workaround, only they'd be passing around a socket instead of just a message. They didn't even try to implement it.
Linux control groups, or cgroups, let you define system resource allocations per service, such as CPU, memory, and I/O limits. Systemd allows defining cgroup limits in the initialization files, so that you can define resource profiles for services when they start. With Docker, though, this runs afoul of the client-server model again. The systemd cgroup settings affect only the client; they do not affect the daemon process, where the container is actually running. Instead, each one inherits the cgroup settings of the Docker daemon. Users can pass cgroup limits by passing flags to the docker run statement instead, which works but does not integrate with the overall administrative policies for the system.
The only success story Walsh had to relate was regarding logging. Docker logs also didn't work with systemd's journald. Logging of container output was local to each container, which would cause all logs to be automatically erased whenever a container was deleted. This was a major failing in the eyes of security auditors. Docker 1.9 now supports the --log‑driver=journald switch, which logs to journald instead. However, using journald is not the default for Docker containers, so the switch needs to be passed each time.
04-03-2016, 07:56 #2
Systemd inside containers
Walsh also wanted to get systemd working in Fedora, Red Hat Enterprise Linux (RHEL), and CentOS container base images, partly because many packages require the systemctl utility in order to install correctly. His first effort was something called "fakesystemd" that replaced systemctl with a service that satisfied the systemctl requirement for packages and did nothing else. This turned out to cause problems for users and he soon abandoned it, but not soon enough to prevent it from being released in RHEL 7.0.
In RHEL 7.1, the team added something called "systemd-container", that was a substantially reduced version of systemd. This still caused problems for users who needed full systemd for their software, and Poettering pressured the container team to change it. As of RHEL 7.2, containers have real systemd with decreased dependencies installed so that it can be a little smaller. Walsh's team is working on reducing these dependencies further.
The biggest problem with not having systemd in the container, according to Walsh, is that it goes "back to the days before init scripts." Each image author creates his or her own crazy startup script for the application inside the container, instead of using the startup scripts crafted by the packagers. He showed how easily service initialization is done inside a container that has systemd available, by showing the three-line Dockerfile that is all that is required to create a container running the Apache httpd server:
RUN yum -y install httpd; yum clean all; systemctl enable httpd;
CMD [ "/sbin/init" ]
There is a major roadblock to making systemd inside Docker work, though: running a container with systemd inside requires running it with the --privileged flag, which makes it insecure. This is because the Docker daemon requires the "service" application run by the container to always be PID 1. In a container with it, systemd is PID 1 and the application has some other PID, which causes Docker to think the container has failed and shut it down.
Poettering says that PID 1 has special requirements. One of these is killing "zombie" processes that have been abandoned by their calling session. This is a real problem for Docker since the application runs as PID 1 and does not handle the zombie processes. For example, containers running the Oracle database can end up with thousands of zombie processes. Another requirement is writing to syslog, which goes to /dev/null unless you've configured the container to log to journald.
Walsh tried several approaches to make systemd work in non-privileged containers, submitting four different pull requests (7685, 10994, 13525, and 13526) to the Docker project. Each of these pull requests (PRs) was rejected by the Docker maintainers. Arguments around these changes peaked when Jessie Frazelle, a Docker committer, came to DockerCon.EU 2015 with a the phrase "I say no to systemd specific PRs" printed on her badge (seen at right).
The future of systemd and containers
The Red Hat container team has also been heavily involved in developing the runC tool of the Open Container Project. That project is the practical output of the Open Container Initiative (OCI), the non-profit council established through the Linux Foundation in 2015 in order to set industry standards for container APIs. The OCI also maintains libcontainer, the library that Docker uses to launch containers. According to Walsh, Docker will eventually need to adopt runC as part of its stack in order to be able to operate on other platforms, particularly Windows.
Using work from runC, Red Hat staff have created a patch set called "oci-hooks" that adds a lot of the systemd-supporting functionality to Docker. It makes use of a "hook" that can activate any executables found in a specific directory between the time the container starts up and when the application is running. Among the things executed by this method is the RegisterMachine hook, which notifies systemd's machinectl on the host that the container is running. This lets users see all Docker containers, as well as runC containers, using the machinectl command:
# machinectl MACHINE CLASS SERVICE 9a65036e4a6dc769d0e40fa80871f95a container docker fd493b71a79c2b7913be54a1c9c77f1c container runc 2 machines listed.
Walsh also pointed out that cgroups, sd_notify, and socket activation all work out-of-the-box with runC. This is because runC does not use Docker's client-server model; it is just an executable. He does not see the breach between Docker Inc. and Red Hat over systemd healing over in the future. Walsh predicted that Red Hat would probably be moving more toward runC and away from the Docker daemon. According to him, Docker is working on "containerd", its new alternative to systemd, which will take over the functions of the init system.
Given the rapid changes in the Linux container ecosystem in the short time since the Docker project was launched, though, it is almost impossible to predict what the relationship between systemd, Docker, and runC will look like a year from now. Undoubtedly there will be plenty more changes and conflicts to report.