Namespaces and cgroups: How Containers Actually Work

Docker, podman, Kubernetes — all containers are built on two Linux kernel features: namespaces (isolate what a process can see) and cgroups (limit how much it can use). There’s no “container” abstraction in the kernel. A container is just a normal process with a specific combination of namespaces and cgroups applied.

Namespaces: isolation

A namespace makes a process see its own private version of a system resource. Linux has eight kinds:

Namespace Isolates
PID Process IDs (your container sees PID 1 = its main process)
Mount Filesystem mounts (your container has its own / )
Network Network interfaces, routes, sockets
UTS Hostname and domain name
IPC Shared memory, semaphores
User UIDs and GIDs (root in container ≠ root on host)
cgroup Hides cgroup hierarchy from container
Time CLOCK_BOOTTIME and CLOCK_MONOTONIC offsets

See your namespaces

# Your shell's namespaces
ls -la /proc/$$/ns

# Compare with another process
sudo ls -la /proc/1/ns        # PID 1 (init) namespaces

# If the inode numbers differ, you're in different namespaces.

Create a namespace by hand

# PID + mount + UTS namespace; runs bash inside
sudo unshare --pid --fork --mount --uts bash

# Inside, run:
hostname container-test           # only changes inside namespace
ps -ef                            # different processes visible? (need to mount /proc)
mount -t proc proc /proc          # now ps shows only THIS namespace
ps -ef                            # very few processes!
exit

You just made (most of) a container.

cgroups: resource limits

cgroups (control groups) limit and account for resources: CPU, memory, disk I/O, network. Combined with namespaces, you get isolated processes that can’t hog the machine.

cgroup v2 (modern, default on most distros)

The cgroup tree lives at /sys/fs/cgroup/. Each subdirectory is a control group; child processes are listed in cgroup.procs.

# See all current cgroups
ls /sys/fs/cgroup/

# What cgroup is THIS process in?
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/user@1000.service/...

Create a cgroup and limit memory

sudo mkdir /sys/fs/cgroup/myapp
echo "100M" | sudo tee /sys/fs/cgroup/myapp/memory.max

# Move a running process into it
echo $$ | sudo tee /sys/fs/cgroup/myapp/cgroup.procs

# Now this shell — and anything it runs — is capped at 100MB

Limit CPU

# Allow 50% of one CPU (50000 microseconds out of every 100000)
echo "50000 100000" | sudo tee /sys/fs/cgroup/myapp/cpu.max

The systemd way

systemd integrates cgroups deeply. Setting limits in unit files:

[Service]
MemoryMax=512M
CPUQuota=25%
TasksMax=50

Or one-shot via systemd-run:

sudo systemd-run --slice=myapp.slice -p MemoryMax=100M -p CPUQuota=50% sleep 1000

systemd-cgtop                      # live view of cgroup resource usage

Putting it together: a minimal container

Modern container runtimes (runc, crun) basically do this:

  1. Create namespaces (PID, mount, network, UTS, IPC, user).
  2. Set up a root filesystem (chroot/pivot_root to an image).
  3. Apply cgroups to limit resources.
  4. Drop capabilities (limit privileged syscalls).
  5. Apply seccomp/AppArmor/SELinux profiles for further restriction.
  6. Exec the container’s command.

Docker is a friendly wrapper around all of this.

Inspect a Docker container’s namespaces

docker run -d --name mynginx nginx
PID=$(docker inspect -f '{{.State.Pid}}' mynginx)
ls -la /proc/$PID/ns

# Compare with host
ls -la /proc/1/ns

# Inode numbers differ — different namespaces

Why this knowledge matters

  • Debug “container can’t see X” issues — usually a namespace mismatch.
  • Understand security boundaries — root inside a container is NOT root on the host (with user namespaces).
  • Resource issues — “container is OOM-killed” maps directly to memory.max in its cgroup.
  • Build minimal containers — once you know what’s actually happening, you can craft tighter setups.

Useful tools

lsns                   # list all namespaces
nsenter                # enter another process's namespaces
unshare                 # create new namespaces
systemd-cgtop          # live cgroup resource usage
systemd-cgls           # cgroup tree

Common mistakes

  • Treating containers like VMs — they share a kernel with the host. A kernel exploit escapes the container.
  • Running containers as root user inside, without user namespaces — that root has more host access than expected.
  • Forgetting cgroup memory limits in production — one runaway container OOM-kills the whole host.

What to learn next

Now that the kernel mechanics are clear, Docker — the most common way people actually use these features — is the next and final stop on the roadmap.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *