Namespaces and cgroups: How Containers Actually Work
Docker, podman, Kubernetes — all containers are built on two Linux kernel features: namespaces (isolate what a process can see) and cgroups (limit how much it can use). There’s no “container” abstraction in the kernel. A container is just a normal process with a specific combination of namespaces and cgroups applied.
Namespaces: isolation
A namespace makes a process see its own private version of a system resource. Linux has eight kinds:
| Namespace | Isolates |
|---|---|
| PID | Process IDs (your container sees PID 1 = its main process) |
| Mount | Filesystem mounts (your container has its own / ) |
| Network | Network interfaces, routes, sockets |
| UTS | Hostname and domain name |
| IPC | Shared memory, semaphores |
| User | UIDs and GIDs (root in container ≠ root on host) |
| cgroup | Hides cgroup hierarchy from container |
| Time | CLOCK_BOOTTIME and CLOCK_MONOTONIC offsets |
See your namespaces
# Your shell's namespaces
ls -la /proc/$$/ns
# Compare with another process
sudo ls -la /proc/1/ns # PID 1 (init) namespaces
# If the inode numbers differ, you're in different namespaces.
Create a namespace by hand
# PID + mount + UTS namespace; runs bash inside
sudo unshare --pid --fork --mount --uts bash
# Inside, run:
hostname container-test # only changes inside namespace
ps -ef # different processes visible? (need to mount /proc)
mount -t proc proc /proc # now ps shows only THIS namespace
ps -ef # very few processes!
exit
You just made (most of) a container.
cgroups: resource limits
cgroups (control groups) limit and account for resources: CPU, memory, disk I/O, network. Combined with namespaces, you get isolated processes that can’t hog the machine.
cgroup v2 (modern, default on most distros)
The cgroup tree lives at /sys/fs/cgroup/. Each subdirectory is a control group; child processes are listed in cgroup.procs.
# See all current cgroups
ls /sys/fs/cgroup/
# What cgroup is THIS process in?
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/user@1000.service/...
Create a cgroup and limit memory
sudo mkdir /sys/fs/cgroup/myapp
echo "100M" | sudo tee /sys/fs/cgroup/myapp/memory.max
# Move a running process into it
echo $$ | sudo tee /sys/fs/cgroup/myapp/cgroup.procs
# Now this shell — and anything it runs — is capped at 100MB
Limit CPU
# Allow 50% of one CPU (50000 microseconds out of every 100000)
echo "50000 100000" | sudo tee /sys/fs/cgroup/myapp/cpu.max
The systemd way
systemd integrates cgroups deeply. Setting limits in unit files:
[Service]
MemoryMax=512M
CPUQuota=25%
TasksMax=50
Or one-shot via systemd-run:
sudo systemd-run --slice=myapp.slice -p MemoryMax=100M -p CPUQuota=50% sleep 1000
systemd-cgtop # live view of cgroup resource usage
Putting it together: a minimal container
Modern container runtimes (runc, crun) basically do this:
- Create namespaces (PID, mount, network, UTS, IPC, user).
- Set up a root filesystem (chroot/pivot_root to an image).
- Apply cgroups to limit resources.
- Drop capabilities (limit privileged syscalls).
- Apply seccomp/AppArmor/SELinux profiles for further restriction.
- Exec the container’s command.
Docker is a friendly wrapper around all of this.
Inspect a Docker container’s namespaces
docker run -d --name mynginx nginx
PID=$(docker inspect -f '{{.State.Pid}}' mynginx)
ls -la /proc/$PID/ns
# Compare with host
ls -la /proc/1/ns
# Inode numbers differ — different namespaces
Why this knowledge matters
- Debug “container can’t see X” issues — usually a namespace mismatch.
- Understand security boundaries — root inside a container is NOT root on the host (with user namespaces).
- Resource issues — “container is OOM-killed” maps directly to memory.max in its cgroup.
- Build minimal containers — once you know what’s actually happening, you can craft tighter setups.
Useful tools
lsns # list all namespaces
nsenter # enter another process's namespaces
unshare # create new namespaces
systemd-cgtop # live cgroup resource usage
systemd-cgls # cgroup tree
Common mistakes
- Treating containers like VMs — they share a kernel with the host. A kernel exploit escapes the container.
- Running containers as root user inside, without user namespaces — that root has more host access than expected.
- Forgetting cgroup memory limits in production — one runaway container OOM-kills the whole host.
What to learn next
Now that the kernel mechanics are clear, Docker — the most common way people actually use these features — is the next and final stop on the roadmap.