runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2 by a7i · Pull Request #13070 · google/gvisor

a7i · 2026-05-04T17:42:13Z

What

Two-commit fix that makes runsc pods on cgroup v2 + systemd report container_* cAdvisor series equivalent in shape and spec values to what runc-managed pods produce on the same node:

Commit 1 — runsc/cgroup: create host-side compat dir for subcontainers on systemd v2: mkdir an empty per-subcontainer cgroup directory under the pod slice so cAdvisor (and other inotify-based discoverers under /sys/fs/cgroup) report metrics for non-pause containers.
Commit 2 — runsc/cgroup: populate spec files on subcontainer compat dirs: thread spec.Linux.Resources through the dispatcher so the limit/spec interface files cAdvisor reads as container_spec_* (memory.max, cpu.max, cpu.weight, memory.swap.max, memory.low, pids.max) are populated from the OCI resources on a best-effort basis.

Why

cAdvisor discovers per-container cgroups by inotify-watching /sys/fs/cgroup, and reads spec values for the container_spec_* series from the leaf cgroup files. Tools that consume those metrics (kubelet's /metrics/cadvisor, kubectl top, container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_network_*, VPA recommendations sourced from cAdvisor, etc.) all depend on a host-side cgroup directory existing for each user container and the spec files being populated.

#6500 / #6657 added empty subcontainer cgroup directories so cAdvisor would discover containers running inside a runsc sandbox. That fix only covers cgroup v1 (and non-systemd v2). On a cgroup v2 host with the systemd cgroup driver — the default for kubelet on most current distros — per-container cAdvisor metrics regress to "pause-only" for every gVisor pod, and even where the directory does exist (cgroup v1 today), the spec files end up empty because Install({}) is passed empty resources.

Reproduction and broader analysis are in the parent issue (#13067).

Commit 1: discoverability — root cause and fix

setupCgroupForSubcontainer calls cgroupInstall(...).Install({}) intending to mkdir an empty subcontainer cgroup directory for cAdvisor compat. On systemd v2, that lands in cgroupSystemd.Install (in runsc/cgroup/systemd.go), which only stages dbus properties; the cgroup directory is otherwise created by Join() via StartTransientUnitContext. Join() is wrong here: the compat cgroup is intentionally process-less, registering a transient unit for it would conflict with systemd's lifecycle expectations, and it'd be reaped the moment the dbus connection drops.

So no host-side directory is ever created for non-pause containers in a runsc pod on systemd v2. cAdvisor's inotify watcher under /sys/fs/cgroup therefore never discovers them, and per-container series are missing from /metrics/cadvisor. The pause container's scope is visible because containerd creates it itself before invoking the shim, independent of runsc.

Fix:

Add cgroupSystemd.installCompatDir which os.MkdirAll's the resolved scope path under the parent slice and tracks it in c.Own so the inherited cgroupV2.Uninstall reaps it at container destroy. Idempotent (won't double-track on retries).
Expose a single dispatcher cgroup.InstallSubcontainerCompatDir that routes systemd v2 cgroups to installCompatDir and falls back to the existing Install path for v1 / non-systemd v2 (which already mkdir the directory inside Install).
Wire setupCgroupForSubcontainer to use the dispatcher.

The shim's setPodCgroup is left alone; it doesn't need to change for this fix. Non-root containers reach setupCgroupForSubcontainer regardless of dev.gvisor.spec.cgroup-parent, and the pause scope is already created by containerd before the shim runs.

Commit 2: populate spec files

After commit 1, the new compat directories exist but their interface files are empty (or read kernel defaults of max), so cAdvisor's container_spec_* series for runsc pods read 0 / missing for the limit-bearing fields. The same shape applies on cgroup v1 today — Install({}) is passed empty resources, so v1 compat dirs also have no spec values.

Thread spec.Linux.Resources through InstallSubcontainerCompatDir:

v1 / non-systemd v2: Install(res) already iterates per-controller set() methods that write the limit files; just hand it real resources instead of empty.
systemd v2: extend installCompatDir to call controllers2["cpu"|"memory"|"pids"].set(res, path) after mkdir. Reuses all the existing runc-compatible conversion (convertCPUSharesToCgroupV2Value, convertMemorySwapToCgroupV2Value, cpu.max formatting). Limited to {cpu, memory, pids} — the controllers whose interface files cAdvisor reads as spec values; cpuset / io / hugetlb are intentionally excluded (don't surface as container_spec_* and widen the failure surface on hosts where they aren't enabled in the parent slice's cgroup.subtree_control).

Mapping (cgroup file written ↔ cAdvisor series populated):

cgroup file (v2 / v1)	cAdvisor series	OCI source
`memory.max` / `memory.limit_in_bytes`	`container_spec_memory_limit_bytes`	`Memory.Limit`
`memory.swap.max` / `memory.memsw.limit_in_bytes`	`container_spec_memory_swap_limit_bytes`	`Memory.Swap` (computed runc-style on v2)
`memory.low` / `memory.soft_limit_in_bytes`	`container_spec_memory_reservation_limit_bytes`	`Memory.Reservation`
`cpu.max` (quota) / `cpu.cfs_quota_us`	`container_spec_cpu_quota`	`CPU.Quota`
`cpu.max` (period) / `cpu.cfs_period_us`	`container_spec_cpu_period`	`CPU.Period`
`cpu.weight` / `cpu.shares`	`container_spec_cpu_shares`	`CPU.Shares` (back-converted on v2)
`pids.max`	(no cAdvisor series today; written for runc parity)	`Pids.Limit`

The compat cgroup is process-less, so any limits written here have no kernel-side accounting effect; they exist solely so cAdvisor's container_spec_* series report real values for runsc pods, matching what runc produces on the same node.

Best-effort writes on systemd v2. If a controller is not enabled in the parent slice's cgroup.subtree_control, setValue returns ENOENT / EROFS / EACCES, which we swallow and log. The compat path must never block container start (#6657 invariant).

Backwards compatibility

Commit 1 only diverges from the existing path when the underlying cgroup is *cgroupSystemd; v1 and non-systemd v2 still flow through the existing Install path. Compat directories created by either commit are tracked in c.Own so existing Uninstall removes them at container destroy. No new lifecycle.

Commit 2 changes InstallSubcontainerCompatDir's signature from (cg Cgroup) error to (cg Cgroup, res *specs.LinuxResources) error. The dispatcher has a single internal caller (setupCgroupForSubcontainer); no external callers.

Tests

Commit 1:

TestInstallCompatDir: directory is created at MakePath(""), tracked in c.Own, second call is idempotent (no double-track), Uninstall removes it.
TestInstallSubcontainerCompatDirSystemd: public dispatcher routes systemd v2 cgroups to the compat-dir path.

Commit 2:

TestInstallCompatDirSpecFiles: pre-touch leaf interface files (simulating kernel auto-creation when controllers are enabled in parent's subtree_control on a real cgroupfs mount), pass a full LinuxResources, assert each interface file contains the expected serialized value (incl. the runc-style swap-only computation and the cpu_shares → cpu.weight conversion).
TestInstallCompatDirBestEffort: deliberately do not seed leaf files, assert installCompatDir(res) swallows the resulting ENOENTs and still returns success with the directory created and tracked.
TestInstallSubcontainerCompatDirSystemd (extended): public dispatcher propagates non-nil resources end-to-end through to the per-controller set() methods.

Built and tested locally on aarch64 (lima). Unit tests pass; manual end-to-end verification on a Kubernetes cluster running cgroup v2 + systemd is in the comment below.

Refs: #6500, #6657, #13067

Out of scope: runtime accounting series (follow-up)

This PR restores cAdvisor discoverability (commit 1) and populates the cgroup limit files cAdvisor reads as container_spec_* (commit 2). It does not populate kernel-accounted runtime series — container_cpu_*_seconds_total, container_memory_* (instantaneous gauges), container_pressure_*, container_processes / container_threads / container_sockets / container_file_descriptors, network counters, etc.

Those stay zero because the compat scopes this PR creates are intentionally process-less: the user workload runs inside the gVisor sandbox (a single Linux process), so the host kernel has nothing to attribute to the per-container scopes. The data isn't lost — gVisor publishes per-container values via runsc events --stats, which is what containerd's CRI-stats plugin already consumes for /stats/summary (powering kubectl top, metrics-server, HPA). The remaining gap is plumbing those values into cAdvisor's output.

A reasonable follow-up shape: a coordinated cAdvisor + gVisor change that registers a cAdvisor ContainerHandlerFactory for runsc-managed cgroups, bypassing libcontainer's /sys/fs/cgroup reads and consulting runsc events --stats (or a stable equivalent), emitting the same metric names with the same label set as the kernel-cgroup-backed path so consumers need no code changes. Happy to drive both PRs as a follow-up to this one if maintainers think that's the right direction. The same shape would also help Kata and other runtimes whose user processes don't live in host cgroups.

This PR remains a strict prerequisite for that follow-up: without per-container scope dirs and populated spec files on the host, cAdvisor has no ContainerHandler to attach a runsc-aware stats source to.

…d v2 cAdvisor discovers per-container cgroups by inotify-watching /sys/fs/cgroup. Tools that consume its metrics (kubelet's /metrics/cadvisor, kubectl top, container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_network_*, VPA recommendations sourced from cAdvisor, etc.) all depend on a host-side cgroup directory existing for each user container. Issue google#6500 / PR google#6657 added empty subcontainer cgroup directories so cAdvisor would discover containers running under runsc; that fix only covers cgroup v1 (and non-systemd v2). On a cgroup v2 host with the systemd cgroup driver -- the default for kubelet on most current distros -- per-container cAdvisor metrics regress to "pause-only" for every gVisor pod, even though kubelet's CRI-backed /stats/summary reports CPU/memory for those containers fine. Root cause: setupCgroupForSubcontainer reaches cgroupInstall(...).Install({}) intending to mkdir an empty subcontainer cgroup directory for cAdvisor compat. On systemd v2, that lands in cgroupSystemd.Install (in runsc/cgroup/systemd.go), which only stages dbus properties; the cgroup directory is otherwise created by Join() via StartTransientUnitContext, which is inappropriate for a process-less compat cgroup (and would conflict with systemd's lifecycle expectations). Result: no host-side directory is ever created for non-pause containers in a runsc pod on systemd v2. cAdvisor's inotify watcher under /sys/fs/cgroup therefore never discovers them, and per-container series are missing from /metrics/cadvisor. The pause container's scope is visible because containerd creates that scope itself before invoking the shim, independent of runsc. Fix: 1. Add cgroupSystemd.installCompatDir which os.MkdirAll's the resolved scope path under the parent slice and tracks it in c.Own so the inherited cgroupV2.Uninstall reaps it at container destroy. Idempotent (won't double-track on retries). 2. Expose a single dispatcher cgroup.InstallSubcontainerCompatDir that routes systemd v2 cgroups to installCompatDir and falls back to the existing Install({}) path for v1 / non-systemd v2 (which already mkdir the directory inside Install). 3. Wire setupCgroupForSubcontainer to use the dispatcher. Backwards compatibility: only diverges from the existing path when the underlying cgroup is *cgroupSystemd; v1 and non-systemd v2 still flow through the existing Install({}) path. Compat directories created here are tracked in c.Own so existing Uninstall removes them at container destroy. No new lifecycle. Tests: add TestInstallCompatDir asserting the directory is created at MakePath(""), tracked in c.Own, that a second call is idempotent (no double-track), and that Uninstall removes it; add TestInstallSubcontainerCompatDirSystemd asserting the public dispatcher routes to the compat-dir path. Refs: google#6500, google#6657, google#13067

a7i · 2026-05-04T19:21:23Z

Verified end-to-end on a Kubernetes node running cgroup v2 + systemd with the patched runsc (build release-20260427.0-13-g41855b3f08, both commits in this PR applied).

A pod under RuntimeClass: gvisor now produces:

Per-container scope dirs under the pod slice, discoverable by cAdvisor's inotify watcher (commit 1).
Populated container_spec_* series reflecting the real OCI limits, written into the new compat dirs (commit 2).

$ kubectl get --raw "/api/v1/nodes/${NODE}/proxy/metrics/cadvisor" \
    | grep 'namespace="amir"' | grep 'container="perf"'

container_spec_cpu_period{...,container="perf",...}                              100000
container_spec_cpu_quota{...,container="perf",...}                               100000
container_spec_cpu_shares{...,container="perf",...}                              2
container_spec_memory_limit_bytes{...,container="perf",...}                      1.073741824e+09
container_spec_memory_reservation_limit_bytes{...,container="perf",...}          0
container_spec_memory_swap_limit_bytes{...,container="perf",...}                 0
container_start_time_seconds{...,container="perf",...}                           1.778029974e+09
container_ulimits_soft{...,container="perf",ulimit="max_open_files",...}         1.048576e+06

# kernel-accounted runtime series — all zero by design (see "Out of scope" below):
container_cpu_user_seconds_total                                                 0
container_cpu_system_seconds_total                                               0
container_cpu_cfs_throttled_periods_total / _seconds_total                       0
container_memory_usage_bytes / _working_set_bytes / _rss / _cache / _swap        0
container_memory_max_usage_bytes / _kernel_usage / _mapped_file                  0
container_memory_failures_total / _failcnt                                       0
container_memory_total_active_file_bytes / _total_inactive_file_bytes            0
container_processes / container_threads / container_sockets                      0
container_file_descriptors / container_threads_max                               0
container_oom_events_total                                                       0
container_pressure_cpu_*  / _memory_*  / _io_*                                   0
container_tasks_state{state="running"|"sleeping"|"stopped"|...}                  0

The id label resolves to the per-container scope inside the pod slice —

/kubepods.slice/
  kubepods-burstable.slice/
    kubepods-burstable-pod<uid>.slice/
      cri-containerd-<container_id>.scope    <-- created by commit 1
        memory.max         = 1073741824        <-- written by commit 2
        memory.swap.max    = 0
        memory.low         = 0
        cpu.max            = 100000 100000
        cpu.weight         = 1                 <-- 2 shares back-converts to weight 1
        pids.max           = max

— which is the host-side directory cAdvisor's inotify watcher needs.

series	before this PR	after this PR	source
`container_spec_memory_limit_bytes`	`0`	`1073741824` (real limit)	`memory.max` written by commit 2
`container_spec_cpu_quota`	absent	`100000` (real limit)	`cpu.max` quota half written by commit 2
`container_spec_memory_swap_limit_bytes`	`0`	`0` (no swap on this host)	`memory.swap.max` written by commit 2
`container_spec_memory_reservation_limit_bytes`	`0`	`0` (no `Memory.Reservation` in spec)	`memory.low` written by commit 2
`container_spec_cpu_period`	`100000`	`100000`	unchanged (also written by commit 2)
`container_spec_cpu_shares`	OCI value	OCI value	unchanged (also written by commit 2; this pod has no CPU request → k8s default `2`)

kubectl top pod -n amir gvisor and CRI-backed /stats/summary were already correct on this cluster (CRI's cri-stats plumbing doesn't go through /sys/fs/cgroup); they remain correct, and /metrics/cadvisor now reports real spec limits in addition to matching the runc label shape.

Out of scope: runtime accounting series

The kernel-accounted runtime series (container_cpu_*_seconds_total, instantaneous container_memory_*, container_pressure_*, process/thread/socket/fd gauges, network and disk counters, etc.) remain 0. The compat scope is process-less: the user workload runs inside the gVisor sandbox process, which lives in a single host cgroup, so the host kernel has nothing to attribute to the per-container scopes.

The data exists — gVisor already produces per-container CPU/memory/network values via runsc events --stats, which is what containerd's CRI-stats plugin already consumes for /stats/summary (so kubectl top, metrics-server, and HPA all work correctly today). Plumbing those values into cAdvisor's /metrics/cadvisor output is a separate, coordinated cAdvisor + gVisor change.

A reasonable follow-up shape:

gVisor side: stabilize runsc events --stats <container-id> as a public contract (it already serves CRI-stats), or add a new runsc stats mode formatted exactly to cAdvisor's v2.ContainerStats field set (cpu_usage_user_us, memory_working_set_bytes, memory_rss, memory_cache, processes, threads, tasks_state.*, pressure.*, oom_events_total, …). Option 1 is smaller and reuses an interface containerd already depends on; option 2 is cleaner but creates a second stats interface.
cAdvisor side: register a ContainerHandlerFactory for runsc-managed cgroups that bypasses libcontainer's manager.GetStats() path against /sys/fs/cgroup and consults the gVisor interface instead. Detection signal could be "scope is empty AND there's a matching runsc container" or, more reliably, the runtime label containerd already attaches.

The result: same metric names, same labels, real numbers. Consumers need no code changes. The same shape would also help Kata and other runtimes whose user processes don't live in host cgroups.

Question for maintainers

Does that direction make sense for the runtime-counter follow-up? Any preference on which gVisor-side interface should be the cAdvisor contract? Happy to drive both PRs as follow-ups to this one once it lands and there's alignment on shape.

a7i · 2026-05-04T19:22:46Z

@ayushr2 would appreciate your input on this PR, thanks!

After the parent commit creates the host-side compat directory for subcontainers on systemd v2, the directory is empty: cAdvisor's container_spec_* series read 0 (or absent) for those containers because the leaf cgroup interface files have no values written to them. Same shape on cgroup v1 today via Install({}) -- no spec values are threaded through. Thread spec.Linux.Resources through InstallSubcontainerCompatDir so the existing per-controller set() methods (v1, non-systemd v2) and a new best-effort path (systemd v2) populate the cgroup interface files cAdvisor reads as container_spec_*: cgroup file (v2 / v1) -> cAdvisor series --------------------------------- -> ----------------------------------- memory.max / memory.limit_in_bytes container_spec_memory_limit_bytes memory.swap.max / memory.memsw.limit_in_bytes container_spec_memory_swap_limit_bytes memory.low / memory.soft_limit_in_bytes container_spec_memory_reservation_limit_bytes cpu.max (quota) / cpu.cfs_quota_us container_spec_cpu_quota cpu.max (period) / cpu.cfs_period_us container_spec_cpu_period cpu.weight / cpu.shares container_spec_cpu_shares pids.max (no cAdvisor series today; written for parity with runc) The compat cgroup is process-less, so any limits written here have no kernel-side accounting effect; they exist solely so cAdvisor's container_spec_* series report real values for runsc pods, matching what runc produces on the same node. Runtime counter series (container_*_total, container_memory_*, container_pressure_*, ...) remain zero by design and need a separate follow-up that delegates GetStats to a runsc-aware ContainerHandler (google#13067). On systemd v2 the writes are best-effort: if a controller is not enabled in the parent slice's cgroup.subtree_control, setValue returns ENOENT/EROFS/EACCES, which we swallow and log. The compat path must never block container start (google#6657 invariant). Limited to {cpu, memory, pids} -- the controllers whose interface files cAdvisor reads as spec values; cpuset/io/hugetlb are intentionally excluded. Tests: - TestInstallCompatDirSpecFiles asserts each interface file contains the expected serialized value (incl. the runc-style swap-only computation and the cpu_shares -> cpu.weight conversion). - TestInstallCompatDirBestEffort asserts errors are swallowed when leaf files don't exist. - TestInstallSubcontainerCompatDirSystemd is extended to assert end-to-end propagation of resources through the public dispatcher. Refs: google#6500, google#6657, google#13067

a7i force-pushed the fix/cadvisor-systemd-v2-subcontainer-cgroups branch from 219299a to 6772750 Compare May 4, 2026 19:05

a7i changed the title ~~shim,runsc: surface gVisor subcontainers to cAdvisor on cgroup v2 + systemd~~ runsc/cgroup: create host-side compat dir for subcontainers on systemd v2 May 4, 2026

a7i mentioned this pull request May 4, 2026

cAdvisor only sees pause container scope on cgroup v2 + systemd-cgroup (subcontainer compat dirs not created) #13067

Open

a7i marked this pull request as ready for review May 4, 2026 19:22

a7i marked this pull request as draft May 6, 2026 00:10

a7i changed the title ~~runsc/cgroup: create host-side compat dir for subcontainers on systemd v2~~ runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2 May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2#13070

runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2#13070
a7i wants to merge 2 commits intogoogle:masterfrom
a7i:fix/cadvisor-systemd-v2-subcontainer-cgroups

a7i commented May 4, 2026 •

edited

Loading

Uh oh!

a7i commented May 4, 2026 •

edited

Loading

Uh oh!

a7i commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

a7i commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Commit 1: discoverability — root cause and fix

Commit 2: populate spec files

Backwards compatibility

Tests

Out of scope: runtime accounting series (follow-up)

Uh oh!

a7i commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Out of scope: runtime accounting series

Question for maintainers

Uh oh!

a7i commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

a7i commented May 4, 2026 •

edited

Loading

a7i commented May 4, 2026 •

edited

Loading