Categories: Open SourceSoftware

How Linux Kernel Cgroups And Namespaces Made Modern Containers Possible

The last two years have seen an explosion of interest in Linux Containers, with many tools emerging, including Docker, LXC, lmctfy, Kubernetes and more.

These tools provide different management interfaces, but in all cases the Linux Containers that they run are powered by two underlying Linux Kernel technologies: cgroups and namespaces. When namespaces matured around Linux 3.8, these were the two key pieces of underlying technology which made modern Linux Containers possible.

What are cgroups and namespaces?

cgroups, which stands for control groups, are a kernel mechanism for limiting and measuring the total resources used by a group of processes running on a system. For example, you can apply CPU, memory, network or IO quotas. cgroups were originally developed by Paul Menage and Rohit Seth of Google, and their first features were merged into Linux 2.6.24.

Namespaces are a kernel mechanism for limiting the visibility that a group of processes has of the rest of a system. For example you can limit visibility to certain process trees, network interfaces, user IDs or filesystem mounts. namespaces were originally developed by Eric Biederman, and the final major namespace was merged into Linux 3.8.

How are modern Containers built from cgroups and namespaces?

Both cgroups and namespaces can apply to any process running on a Linux system, and are very granular in terms of being able to apply individual limits separately. For example, you can use cgroups to set a cpu limit on a single process in a more sophisticated way than “nice” would achieve.

However, when you apply a full set of cgroups and of namespaces, you end up having a group of processes running inside a fully isolated environment within a Linux system. This is what makes a Linux Container. Linux Containers are limited with a full set of namespaces so that they can only see the directory from which they booted, their own processes, their own user ids and any network interfaces which they have been allowed to access. Similarly, Linux Containers are limited with a full set of cgroups to control their use of CPU, memory, network and IO. With all of these limits applied, the processes running inside the Linux Container cannot see any of the rest of the system, and so behave as if they were a separate server of their own – a more modern equivalent to traditional virtualisation.

So, how do I use cgroups and namespaces myself?

In most cases it doesn’t make sense for system administrators to directly use cgroups and namespaces – a container tool, such as Docker, LXC or lmctfy will do this for you. This article in intended to give you an understanding of what’s under the hood, rather than have you working with the kernel technologies directly. However, having said that…

cgroups are controlled via the /sys/fs/cgroup/ filesystem

Here is an example of running tar inside a cgroup with a kernel memory limit:

# mkdir -p /sys/fs/cgroup/test/
# cat /sys/fs/cgroup/cpuset.cpus > /sys/fs/cgroup/test/cpuset.cpus
# cat /sys/fs/cgroup/cpuset.mems > /sys/fs/cgroup/test/cpuset.mems
# echo $((1<<26)) >/sys/fs/cgroup/test/memory.kmem.limit_in_bytes
# echo $$ > /sys/fs/cgroup/test/tasks
# tar xfz linux-3.14.1.tar.gz

The first four lines setup up the “test” cgroup – creating it, allowing access to all physical hardware, but limiting kernel memory use.

The fifth line puts the bash prompt which you are currently using (and any child processes run from that bash) inside the “test” cgroup.

The sixth line runs tar in the normal fashion, but is within the cgroup and so subject to the cgroup limits.

There are many cgroup limits available, and the full documentation for these is at: https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

Namespaces are controlled with the “unshare” command

The “unshare” command to manipulate namespaces is available in util-linux-ng 2.17 and later. As an example, if you run:
# unshare –mount /bin/bash
# mount /dev/sda2 /mnt

Then the first line starts a bash prompt inside an isolated mount namespace. This means that the filesystem which you have mounted on the second line is visible from the bash prompt which you are currently using (and any child processes run from that bash), but that the rest of the system cannot see this filesystem mount.

Please see the man pages for unshare for a list of other namespaces which you can manipulate.

Putting it back together

If we combine our two simple examples above, we could create a bash prompt with limited kernel memory use, and private filesystem mounts from the rest of the system.
There are many more cgroups and namespaces for limiting other resources. As you can start to imagine, when a full set of cgroups and namespaces are applied, you end up with total isolation between the software running inside the limits and the rest of the system – this is a Linux Container.

It’s important to remember that there is a distinction in Linux Containers: Application Containers like Docker provide flexibility and agility for developers and ISVs, while Operating System (OS) Containers essentially replace the functions of virtual machines (VMs) for Linux users. OS Containers make more dynamic use of computing resources and allow greater insight into the server itself than VMs, and so are particularly suited for creating elastic infrastructure in IT systems.

Practically, you are best to continue using container tools such as Docker, LXC or lmctfy. But hopefully this article has given some interesting insight into the underlying Linux Kernel technologies which these all use.

Try our open source technology quiz!

Duncan Macrae

Duncan MacRae is former editor and now a contributor to TechWeekEurope. He previously edited Computer Business Review's print/digital magazines and CBR Online, as well as Arabian Computer News in the UAE.

Recent Posts

Craig Wright Sentenced For Contempt Of Court

Suspended prison sentence for Craig Wright for “flagrant breach” of court order, after his false…

2 days ago

El Salvador To Sell Or Discontinue Bitcoin Wallet, After IMF Deal

Cash-strapped south American country agrees to sell or discontinue its national Bitcoin wallet after signing…

2 days ago

UK’s ICO Labels Google ‘Irresponsible’ For Tracking Change

Google's change will allow advertisers to track customers' digital “fingerprints”, but UK data protection watchdog…

2 days ago

EU Publishes iOS Interoperability Plans

European Commission publishes preliminary instructions to Apple on how to open up iOS to rivals,…

3 days ago

Momeni Convicted In Bob Lee Murder

San Francisco jury finds Nima Momeni guilty of second-degree murder of Cash App founder Bob…

3 days ago