Skip to content

runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2#13070

Draft
a7i wants to merge 2 commits intogoogle:masterfrom
a7i:fix/cadvisor-systemd-v2-subcontainer-cgroups
Draft

runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2#13070
a7i wants to merge 2 commits intogoogle:masterfrom
a7i:fix/cadvisor-systemd-v2-subcontainer-cgroups

Conversation

@a7i
Copy link
Copy Markdown
Contributor

@a7i a7i commented May 4, 2026

What

Two-commit fix that makes runsc pods on cgroup v2 + systemd report container_* cAdvisor series equivalent in shape and spec values to what runc-managed pods produce on the same node:

  • Commit 1 — runsc/cgroup: create host-side compat dir for subcontainers on systemd v2: mkdir an empty per-subcontainer cgroup directory under the pod slice so cAdvisor (and other inotify-based discoverers under /sys/fs/cgroup) report metrics for non-pause containers.
  • Commit 2 — runsc/cgroup: populate spec files on subcontainer compat dirs: thread spec.Linux.Resources through the dispatcher so the limit/spec interface files cAdvisor reads as container_spec_* (memory.max, cpu.max, cpu.weight, memory.swap.max, memory.low, pids.max) are populated from the OCI resources on a best-effort basis.

Why

cAdvisor discovers per-container cgroups by inotify-watching /sys/fs/cgroup, and reads spec values for the container_spec_* series from the leaf cgroup files. Tools that consume those metrics (kubelet's /metrics/cadvisor, kubectl top, container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_network_*, VPA recommendations sourced from cAdvisor, etc.) all depend on a host-side cgroup directory existing for each user container and the spec files being populated.

#6500 / #6657 added empty subcontainer cgroup directories so cAdvisor would discover containers running inside a runsc sandbox. That fix only covers cgroup v1 (and non-systemd v2). On a cgroup v2 host with the systemd cgroup driver — the default for kubelet on most current distros — per-container cAdvisor metrics regress to "pause-only" for every gVisor pod, and even where the directory does exist (cgroup v1 today), the spec files end up empty because Install({}) is passed empty resources.

Reproduction and broader analysis are in the parent issue (#13067).

Commit 1: discoverability — root cause and fix

setupCgroupForSubcontainer calls cgroupInstall(...).Install({}) intending to mkdir an empty subcontainer cgroup directory for cAdvisor compat. On systemd v2, that lands in cgroupSystemd.Install (in runsc/cgroup/systemd.go), which only stages dbus properties; the cgroup directory is otherwise created by Join() via StartTransientUnitContext. Join() is wrong here: the compat cgroup is intentionally process-less, registering a transient unit for it would conflict with systemd's lifecycle expectations, and it'd be reaped the moment the dbus connection drops.

So no host-side directory is ever created for non-pause containers in a runsc pod on systemd v2. cAdvisor's inotify watcher under /sys/fs/cgroup therefore never discovers them, and per-container series are missing from /metrics/cadvisor. The pause container's scope is visible because containerd creates it itself before invoking the shim, independent of runsc.

Fix:

  1. Add cgroupSystemd.installCompatDir which os.MkdirAll's the resolved scope path under the parent slice and tracks it in c.Own so the inherited cgroupV2.Uninstall reaps it at container destroy. Idempotent (won't double-track on retries).
  2. Expose a single dispatcher cgroup.InstallSubcontainerCompatDir that routes systemd v2 cgroups to installCompatDir and falls back to the existing Install path for v1 / non-systemd v2 (which already mkdir the directory inside Install).
  3. Wire setupCgroupForSubcontainer to use the dispatcher.

The shim's setPodCgroup is left alone; it doesn't need to change for this fix. Non-root containers reach setupCgroupForSubcontainer regardless of dev.gvisor.spec.cgroup-parent, and the pause scope is already created by containerd before the shim runs.

Commit 2: populate spec files

After commit 1, the new compat directories exist but their interface files are empty (or read kernel defaults of max), so cAdvisor's container_spec_* series for runsc pods read 0 / missing for the limit-bearing fields. The same shape applies on cgroup v1 today — Install({}) is passed empty resources, so v1 compat dirs also have no spec values.

Thread spec.Linux.Resources through InstallSubcontainerCompatDir:

  • v1 / non-systemd v2: Install(res) already iterates per-controller set() methods that write the limit files; just hand it real resources instead of empty.
  • systemd v2: extend installCompatDir to call controllers2["cpu"|"memory"|"pids"].set(res, path) after mkdir. Reuses all the existing runc-compatible conversion (convertCPUSharesToCgroupV2Value, convertMemorySwapToCgroupV2Value, cpu.max formatting). Limited to {cpu, memory, pids} — the controllers whose interface files cAdvisor reads as spec values; cpuset / io / hugetlb are intentionally excluded (don't surface as container_spec_* and widen the failure surface on hosts where they aren't enabled in the parent slice's cgroup.subtree_control).

Mapping (cgroup file written ↔ cAdvisor series populated):

cgroup file (v2 / v1) cAdvisor series OCI source
memory.max / memory.limit_in_bytes container_spec_memory_limit_bytes Memory.Limit
memory.swap.max / memory.memsw.limit_in_bytes container_spec_memory_swap_limit_bytes Memory.Swap (computed runc-style on v2)
memory.low / memory.soft_limit_in_bytes container_spec_memory_reservation_limit_bytes Memory.Reservation
cpu.max (quota) / cpu.cfs_quota_us container_spec_cpu_quota CPU.Quota
cpu.max (period) / cpu.cfs_period_us container_spec_cpu_period CPU.Period
cpu.weight / cpu.shares container_spec_cpu_shares CPU.Shares (back-converted on v2)
pids.max (no cAdvisor series today; written for runc parity) Pids.Limit

The compat cgroup is process-less, so any limits written here have no kernel-side accounting effect; they exist solely so cAdvisor's container_spec_* series report real values for runsc pods, matching what runc produces on the same node.

Best-effort writes on systemd v2. If a controller is not enabled in the parent slice's cgroup.subtree_control, setValue returns ENOENT / EROFS / EACCES, which we swallow and log. The compat path must never block container start (#6657 invariant).

Backwards compatibility

Commit 1 only diverges from the existing path when the underlying cgroup is *cgroupSystemd; v1 and non-systemd v2 still flow through the existing Install path. Compat directories created by either commit are tracked in c.Own so existing Uninstall removes them at container destroy. No new lifecycle.

Commit 2 changes InstallSubcontainerCompatDir's signature from (cg Cgroup) error to (cg Cgroup, res *specs.LinuxResources) error. The dispatcher has a single internal caller (setupCgroupForSubcontainer); no external callers.

Tests

Commit 1:

  • TestInstallCompatDir: directory is created at MakePath(""), tracked in c.Own, second call is idempotent (no double-track), Uninstall removes it.
  • TestInstallSubcontainerCompatDirSystemd: public dispatcher routes systemd v2 cgroups to the compat-dir path.

Commit 2:

  • TestInstallCompatDirSpecFiles: pre-touch leaf interface files (simulating kernel auto-creation when controllers are enabled in parent's subtree_control on a real cgroupfs mount), pass a full LinuxResources, assert each interface file contains the expected serialized value (incl. the runc-style swap-only computation and the cpu_sharescpu.weight conversion).
  • TestInstallCompatDirBestEffort: deliberately do not seed leaf files, assert installCompatDir(res) swallows the resulting ENOENTs and still returns success with the directory created and tracked.
  • TestInstallSubcontainerCompatDirSystemd (extended): public dispatcher propagates non-nil resources end-to-end through to the per-controller set() methods.

Built and tested locally on aarch64 (lima). Unit tests pass; manual end-to-end verification on a Kubernetes cluster running cgroup v2 + systemd is in the comment below.

Refs: #6500, #6657, #13067

Out of scope: runtime accounting series (follow-up)

This PR restores cAdvisor discoverability (commit 1) and populates the cgroup limit files cAdvisor reads as container_spec_* (commit 2). It does not populate kernel-accounted runtime series — container_cpu_*_seconds_total, container_memory_* (instantaneous gauges), container_pressure_*, container_processes / container_threads / container_sockets / container_file_descriptors, network counters, etc.

Those stay zero because the compat scopes this PR creates are intentionally process-less: the user workload runs inside the gVisor sandbox (a single Linux process), so the host kernel has nothing to attribute to the per-container scopes. The data isn't lost — gVisor publishes per-container values via runsc events --stats, which is what containerd's CRI-stats plugin already consumes for /stats/summary (powering kubectl top, metrics-server, HPA). The remaining gap is plumbing those values into cAdvisor's output.

A reasonable follow-up shape: a coordinated cAdvisor + gVisor change that registers a cAdvisor ContainerHandlerFactory for runsc-managed cgroups, bypassing libcontainer's /sys/fs/cgroup reads and consulting runsc events --stats (or a stable equivalent), emitting the same metric names with the same label set as the kernel-cgroup-backed path so consumers need no code changes. Happy to drive both PRs as a follow-up to this one if maintainers think that's the right direction. The same shape would also help Kata and other runtimes whose user processes don't live in host cgroups.

This PR remains a strict prerequisite for that follow-up: without per-container scope dirs and populated spec files on the host, cAdvisor has no ContainerHandler to attach a runsc-aware stats source to.

…d v2

cAdvisor discovers per-container cgroups by inotify-watching /sys/fs/cgroup.
Tools that consume its metrics (kubelet's /metrics/cadvisor, kubectl top,
container_cpu_usage_seconds_total, container_memory_working_set_bytes,
container_network_*, VPA recommendations sourced from cAdvisor, etc.) all
depend on a host-side cgroup directory existing for each user container.
Issue google#6500 / PR google#6657 added empty subcontainer cgroup directories so
cAdvisor would discover containers running under runsc; that fix only
covers cgroup v1 (and non-systemd v2). On a cgroup v2 host with the
systemd cgroup driver -- the default for kubelet on most current
distros -- per-container cAdvisor metrics regress to "pause-only" for
every gVisor pod, even though kubelet's CRI-backed /stats/summary reports
CPU/memory for those containers fine.

Root cause: setupCgroupForSubcontainer reaches cgroupInstall(...).Install({})
intending to mkdir an empty subcontainer cgroup directory for cAdvisor
compat. On systemd v2, that lands in cgroupSystemd.Install (in
runsc/cgroup/systemd.go), which only stages dbus properties; the cgroup
directory is otherwise created by Join() via StartTransientUnitContext,
which is inappropriate for a process-less compat cgroup (and would
conflict with systemd's lifecycle expectations).

Result: no host-side directory is ever created for non-pause containers
in a runsc pod on systemd v2. cAdvisor's inotify watcher under
/sys/fs/cgroup therefore never discovers them, and per-container series
are missing from /metrics/cadvisor. The pause container's scope is
visible because containerd creates that scope itself before invoking the
shim, independent of runsc.

Fix:

  1. Add cgroupSystemd.installCompatDir which os.MkdirAll's the resolved
     scope path under the parent slice and tracks it in c.Own so the
     inherited cgroupV2.Uninstall reaps it at container destroy.
     Idempotent (won't double-track on retries).

  2. Expose a single dispatcher cgroup.InstallSubcontainerCompatDir that
     routes systemd v2 cgroups to installCompatDir and falls back to the
     existing Install({}) path for v1 / non-systemd v2 (which already
     mkdir the directory inside Install).

  3. Wire setupCgroupForSubcontainer to use the dispatcher.

Backwards compatibility: only diverges from the existing path when the
underlying cgroup is *cgroupSystemd; v1 and non-systemd v2 still flow
through the existing Install({}) path. Compat directories created here
are tracked in c.Own so existing Uninstall removes them at container
destroy. No new lifecycle.

Tests: add TestInstallCompatDir asserting the directory is created at
MakePath(""), tracked in c.Own, that a second call is idempotent (no
double-track), and that Uninstall removes it; add
TestInstallSubcontainerCompatDirSystemd asserting the public dispatcher
routes to the compat-dir path.

Refs: google#6500, google#6657, google#13067
@a7i a7i force-pushed the fix/cadvisor-systemd-v2-subcontainer-cgroups branch from 219299a to 6772750 Compare May 4, 2026 19:05
@a7i a7i changed the title shim,runsc: surface gVisor subcontainers to cAdvisor on cgroup v2 + systemd runsc/cgroup: create host-side compat dir for subcontainers on systemd v2 May 4, 2026
@a7i
Copy link
Copy Markdown
Contributor Author

a7i commented May 4, 2026

Verified end-to-end on a Kubernetes node running cgroup v2 + systemd with the patched runsc (build release-20260427.0-13-g41855b3f08, both commits in this PR applied).

A pod under RuntimeClass: gvisor now produces:

  1. Per-container scope dirs under the pod slice, discoverable by cAdvisor's inotify watcher (commit 1).
  2. Populated container_spec_* series reflecting the real OCI limits, written into the new compat dirs (commit 2).
$ kubectl get --raw "/api/v1/nodes/${NODE}/proxy/metrics/cadvisor" \
    | grep 'namespace="amir"' | grep 'container="perf"'

container_spec_cpu_period{...,container="perf",...}                              100000
container_spec_cpu_quota{...,container="perf",...}                               100000
container_spec_cpu_shares{...,container="perf",...}                              2
container_spec_memory_limit_bytes{...,container="perf",...}                      1.073741824e+09
container_spec_memory_reservation_limit_bytes{...,container="perf",...}          0
container_spec_memory_swap_limit_bytes{...,container="perf",...}                 0
container_start_time_seconds{...,container="perf",...}                           1.778029974e+09
container_ulimits_soft{...,container="perf",ulimit="max_open_files",...}         1.048576e+06

# kernel-accounted runtime series — all zero by design (see "Out of scope" below):
container_cpu_user_seconds_total                                                 0
container_cpu_system_seconds_total                                               0
container_cpu_cfs_throttled_periods_total / _seconds_total                       0
container_memory_usage_bytes / _working_set_bytes / _rss / _cache / _swap        0
container_memory_max_usage_bytes / _kernel_usage / _mapped_file                  0
container_memory_failures_total / _failcnt                                       0
container_memory_total_active_file_bytes / _total_inactive_file_bytes            0
container_processes / container_threads / container_sockets                      0
container_file_descriptors / container_threads_max                               0
container_oom_events_total                                                       0
container_pressure_cpu_*  / _memory_*  / _io_*                                   0
container_tasks_state{state="running"|"sleeping"|"stopped"|...}                  0

The id label resolves to the per-container scope inside the pod slice

/kubepods.slice/
  kubepods-burstable.slice/
    kubepods-burstable-pod<uid>.slice/
      cri-containerd-<container_id>.scope    <-- created by commit 1
        memory.max         = 1073741824        <-- written by commit 2
        memory.swap.max    = 0
        memory.low         = 0
        cpu.max            = 100000 100000
        cpu.weight         = 1                 <-- 2 shares back-converts to weight 1
        pids.max           = max

— which is the host-side directory cAdvisor's inotify watcher needs.

series before this PR after this PR source
container_spec_memory_limit_bytes 0 1073741824 (real limit) memory.max written by commit 2
container_spec_cpu_quota absent 100000 (real limit) cpu.max quota half written by commit 2
container_spec_memory_swap_limit_bytes 0 0 (no swap on this host) memory.swap.max written by commit 2
container_spec_memory_reservation_limit_bytes 0 0 (no Memory.Reservation in spec) memory.low written by commit 2
container_spec_cpu_period 100000 100000 unchanged (also written by commit 2)
container_spec_cpu_shares OCI value OCI value unchanged (also written by commit 2; this pod has no CPU request → k8s default 2)

kubectl top pod -n amir gvisor and CRI-backed /stats/summary were already correct on this cluster (CRI's cri-stats plumbing doesn't go through /sys/fs/cgroup); they remain correct, and /metrics/cadvisor now reports real spec limits in addition to matching the runc label shape.

Out of scope: runtime accounting series

The kernel-accounted runtime series (container_cpu_*_seconds_total, instantaneous container_memory_*, container_pressure_*, process/thread/socket/fd gauges, network and disk counters, etc.) remain 0. The compat scope is process-less: the user workload runs inside the gVisor sandbox process, which lives in a single host cgroup, so the host kernel has nothing to attribute to the per-container scopes.

The data exists — gVisor already produces per-container CPU/memory/network values via runsc events --stats, which is what containerd's CRI-stats plugin already consumes for /stats/summary (so kubectl top, metrics-server, and HPA all work correctly today). Plumbing those values into cAdvisor's /metrics/cadvisor output is a separate, coordinated cAdvisor + gVisor change.

A reasonable follow-up shape:

  1. gVisor side: stabilize runsc events --stats <container-id> as a public contract (it already serves CRI-stats), or add a new runsc stats mode formatted exactly to cAdvisor's v2.ContainerStats field set (cpu_usage_user_us, memory_working_set_bytes, memory_rss, memory_cache, processes, threads, tasks_state.*, pressure.*, oom_events_total, …). Option 1 is smaller and reuses an interface containerd already depends on; option 2 is cleaner but creates a second stats interface.
  2. cAdvisor side: register a ContainerHandlerFactory for runsc-managed cgroups that bypasses libcontainer's manager.GetStats() path against /sys/fs/cgroup and consults the gVisor interface instead. Detection signal could be "scope is empty AND there's a matching runsc container" or, more reliably, the runtime label containerd already attaches.

The result: same metric names, same labels, real numbers. Consumers need no code changes. The same shape would also help Kata and other runtimes whose user processes don't live in host cgroups.

Question for maintainers

Does that direction make sense for the runtime-counter follow-up? Any preference on which gVisor-side interface should be the cAdvisor contract? Happy to drive both PRs as follow-ups to this one once it lands and there's alignment on shape.

@a7i a7i marked this pull request as ready for review May 4, 2026 19:22
@a7i
Copy link
Copy Markdown
Contributor Author

a7i commented May 4, 2026

@ayushr2 would appreciate your input on this PR, thanks!

@a7i a7i marked this pull request as draft May 6, 2026 00:10
After the parent commit creates the host-side compat directory for
subcontainers on systemd v2, the directory is empty: cAdvisor's
container_spec_* series read 0 (or absent) for those containers because
the leaf cgroup interface files have no values written to them. Same
shape on cgroup v1 today via Install({}) -- no spec values are
threaded through.

Thread spec.Linux.Resources through InstallSubcontainerCompatDir so the
existing per-controller set() methods (v1, non-systemd v2) and a new
best-effort path (systemd v2) populate the cgroup interface files
cAdvisor reads as container_spec_*:

  cgroup file (v2 / v1)             -> cAdvisor series
  --------------------------------- -> -----------------------------------
  memory.max / memory.limit_in_bytes
                                       container_spec_memory_limit_bytes
  memory.swap.max / memory.memsw.limit_in_bytes
                                       container_spec_memory_swap_limit_bytes
  memory.low / memory.soft_limit_in_bytes
                                       container_spec_memory_reservation_limit_bytes
  cpu.max (quota) / cpu.cfs_quota_us
                                       container_spec_cpu_quota
  cpu.max (period) / cpu.cfs_period_us
                                       container_spec_cpu_period
  cpu.weight / cpu.shares              container_spec_cpu_shares
  pids.max                             (no cAdvisor series today; written
                                        for parity with runc)

The compat cgroup is process-less, so any limits written here have no
kernel-side accounting effect; they exist solely so cAdvisor's
container_spec_* series report real values for runsc pods, matching
what runc produces on the same node. Runtime counter series
(container_*_total, container_memory_*, container_pressure_*, ...)
remain zero by design and need a separate follow-up that delegates
GetStats to a runsc-aware ContainerHandler (google#13067).

On systemd v2 the writes are best-effort: if a controller is not
enabled in the parent slice's cgroup.subtree_control, setValue returns
ENOENT/EROFS/EACCES, which we swallow and log. The compat path must
never block container start (google#6657 invariant). Limited to
{cpu, memory, pids} -- the controllers whose interface files cAdvisor
reads as spec values; cpuset/io/hugetlb are intentionally excluded.

Tests:
  - TestInstallCompatDirSpecFiles asserts each interface file contains
    the expected serialized value (incl. the runc-style swap-only
    computation and the cpu_shares -> cpu.weight conversion).
  - TestInstallCompatDirBestEffort asserts errors are swallowed when
    leaf files don't exist.
  - TestInstallSubcontainerCompatDirSystemd is extended to assert
    end-to-end propagation of resources through the public dispatcher.

Refs: google#6500, google#6657, google#13067
@a7i a7i changed the title runsc/cgroup: create host-side compat dir for subcontainers on systemd v2 runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2 May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant