runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2#13070
runsc/cgroup: discover and populate spec files for subcontainer compat dirs on systemd v2#13070a7i wants to merge 2 commits intogoogle:masterfrom
Conversation
…d v2 cAdvisor discovers per-container cgroups by inotify-watching /sys/fs/cgroup. Tools that consume its metrics (kubelet's /metrics/cadvisor, kubectl top, container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_network_*, VPA recommendations sourced from cAdvisor, etc.) all depend on a host-side cgroup directory existing for each user container. Issue google#6500 / PR google#6657 added empty subcontainer cgroup directories so cAdvisor would discover containers running under runsc; that fix only covers cgroup v1 (and non-systemd v2). On a cgroup v2 host with the systemd cgroup driver -- the default for kubelet on most current distros -- per-container cAdvisor metrics regress to "pause-only" for every gVisor pod, even though kubelet's CRI-backed /stats/summary reports CPU/memory for those containers fine. Root cause: setupCgroupForSubcontainer reaches cgroupInstall(...).Install({}) intending to mkdir an empty subcontainer cgroup directory for cAdvisor compat. On systemd v2, that lands in cgroupSystemd.Install (in runsc/cgroup/systemd.go), which only stages dbus properties; the cgroup directory is otherwise created by Join() via StartTransientUnitContext, which is inappropriate for a process-less compat cgroup (and would conflict with systemd's lifecycle expectations). Result: no host-side directory is ever created for non-pause containers in a runsc pod on systemd v2. cAdvisor's inotify watcher under /sys/fs/cgroup therefore never discovers them, and per-container series are missing from /metrics/cadvisor. The pause container's scope is visible because containerd creates that scope itself before invoking the shim, independent of runsc. Fix: 1. Add cgroupSystemd.installCompatDir which os.MkdirAll's the resolved scope path under the parent slice and tracks it in c.Own so the inherited cgroupV2.Uninstall reaps it at container destroy. Idempotent (won't double-track on retries). 2. Expose a single dispatcher cgroup.InstallSubcontainerCompatDir that routes systemd v2 cgroups to installCompatDir and falls back to the existing Install({}) path for v1 / non-systemd v2 (which already mkdir the directory inside Install). 3. Wire setupCgroupForSubcontainer to use the dispatcher. Backwards compatibility: only diverges from the existing path when the underlying cgroup is *cgroupSystemd; v1 and non-systemd v2 still flow through the existing Install({}) path. Compat directories created here are tracked in c.Own so existing Uninstall removes them at container destroy. No new lifecycle. Tests: add TestInstallCompatDir asserting the directory is created at MakePath(""), tracked in c.Own, that a second call is idempotent (no double-track), and that Uninstall removes it; add TestInstallSubcontainerCompatDirSystemd asserting the public dispatcher routes to the compat-dir path. Refs: google#6500, google#6657, google#13067
219299a to
6772750
Compare
|
Verified end-to-end on a Kubernetes node running cgroup v2 + systemd with the patched A pod under
The — which is the host-side directory cAdvisor's inotify watcher needs.
Out of scope: runtime accounting seriesThe kernel-accounted runtime series ( The data exists — gVisor already produces per-container CPU/memory/network values via A reasonable follow-up shape:
The result: same metric names, same labels, real numbers. Consumers need no code changes. The same shape would also help Kata and other runtimes whose user processes don't live in host cgroups. Question for maintainersDoes that direction make sense for the runtime-counter follow-up? Any preference on which gVisor-side interface should be the cAdvisor contract? Happy to drive both PRs as follow-ups to this one once it lands and there's alignment on shape. |
|
@ayushr2 would appreciate your input on this PR, thanks! |
After the parent commit creates the host-side compat directory for
subcontainers on systemd v2, the directory is empty: cAdvisor's
container_spec_* series read 0 (or absent) for those containers because
the leaf cgroup interface files have no values written to them. Same
shape on cgroup v1 today via Install({}) -- no spec values are
threaded through.
Thread spec.Linux.Resources through InstallSubcontainerCompatDir so the
existing per-controller set() methods (v1, non-systemd v2) and a new
best-effort path (systemd v2) populate the cgroup interface files
cAdvisor reads as container_spec_*:
cgroup file (v2 / v1) -> cAdvisor series
--------------------------------- -> -----------------------------------
memory.max / memory.limit_in_bytes
container_spec_memory_limit_bytes
memory.swap.max / memory.memsw.limit_in_bytes
container_spec_memory_swap_limit_bytes
memory.low / memory.soft_limit_in_bytes
container_spec_memory_reservation_limit_bytes
cpu.max (quota) / cpu.cfs_quota_us
container_spec_cpu_quota
cpu.max (period) / cpu.cfs_period_us
container_spec_cpu_period
cpu.weight / cpu.shares container_spec_cpu_shares
pids.max (no cAdvisor series today; written
for parity with runc)
The compat cgroup is process-less, so any limits written here have no
kernel-side accounting effect; they exist solely so cAdvisor's
container_spec_* series report real values for runsc pods, matching
what runc produces on the same node. Runtime counter series
(container_*_total, container_memory_*, container_pressure_*, ...)
remain zero by design and need a separate follow-up that delegates
GetStats to a runsc-aware ContainerHandler (google#13067).
On systemd v2 the writes are best-effort: if a controller is not
enabled in the parent slice's cgroup.subtree_control, setValue returns
ENOENT/EROFS/EACCES, which we swallow and log. The compat path must
never block container start (google#6657 invariant). Limited to
{cpu, memory, pids} -- the controllers whose interface files cAdvisor
reads as spec values; cpuset/io/hugetlb are intentionally excluded.
Tests:
- TestInstallCompatDirSpecFiles asserts each interface file contains
the expected serialized value (incl. the runc-style swap-only
computation and the cpu_shares -> cpu.weight conversion).
- TestInstallCompatDirBestEffort asserts errors are swallowed when
leaf files don't exist.
- TestInstallSubcontainerCompatDirSystemd is extended to assert
end-to-end propagation of resources through the public dispatcher.
Refs: google#6500, google#6657, google#13067
What
Two-commit fix that makes runsc pods on cgroup v2 + systemd report
container_*cAdvisor series equivalent in shape and spec values to whatrunc-managed pods produce on the same node:runsc/cgroup: create host-side compat dir for subcontainers on systemd v2: mkdir an empty per-subcontainer cgroup directory under the pod slice so cAdvisor (and other inotify-based discoverers under/sys/fs/cgroup) report metrics for non-pause containers.runsc/cgroup: populate spec files on subcontainer compat dirs: threadspec.Linux.Resourcesthrough the dispatcher so the limit/spec interface files cAdvisor reads ascontainer_spec_*(memory.max,cpu.max,cpu.weight,memory.swap.max,memory.low,pids.max) are populated from the OCI resources on a best-effort basis.Why
cAdvisor discovers per-container cgroups by inotify-watching
/sys/fs/cgroup, and reads spec values for thecontainer_spec_*series from the leaf cgroup files. Tools that consume those metrics (kubelet's/metrics/cadvisor,kubectl top,container_cpu_usage_seconds_total,container_memory_working_set_bytes,container_network_*, VPA recommendations sourced from cAdvisor, etc.) all depend on a host-side cgroup directory existing for each user container and the spec files being populated.#6500 / #6657 added empty subcontainer cgroup directories so cAdvisor would discover containers running inside a runsc sandbox. That fix only covers cgroup v1 (and non-systemd v2). On a cgroup v2 host with the systemd cgroup driver — the default for kubelet on most current distros — per-container cAdvisor metrics regress to "pause-only" for every gVisor pod, and even where the directory does exist (cgroup v1 today), the spec files end up empty because
Install({})is passed empty resources.Reproduction and broader analysis are in the parent issue (#13067).
Commit 1: discoverability — root cause and fix
setupCgroupForSubcontainercallscgroupInstall(...).Install({})intending to mkdir an empty subcontainer cgroup directory for cAdvisor compat. On systemd v2, that lands incgroupSystemd.Install(inrunsc/cgroup/systemd.go), which only stages dbus properties; the cgroup directory is otherwise created byJoin()viaStartTransientUnitContext.Join()is wrong here: the compat cgroup is intentionally process-less, registering a transient unit for it would conflict with systemd's lifecycle expectations, and it'd be reaped the moment the dbus connection drops.So no host-side directory is ever created for non-pause containers in a runsc pod on systemd v2. cAdvisor's inotify watcher under
/sys/fs/cgrouptherefore never discovers them, and per-container series are missing from/metrics/cadvisor. The pause container's scope is visible because containerd creates it itself before invoking the shim, independent of runsc.Fix:
cgroupSystemd.installCompatDirwhichos.MkdirAll's the resolved scope path under the parent slice and tracks it inc.Ownso the inheritedcgroupV2.Uninstallreaps it at container destroy. Idempotent (won't double-track on retries).cgroup.InstallSubcontainerCompatDirthat routes systemd v2 cgroups toinstallCompatDirand falls back to the existingInstallpath for v1 / non-systemd v2 (which already mkdir the directory insideInstall).setupCgroupForSubcontainerto use the dispatcher.The shim's
setPodCgroupis left alone; it doesn't need to change for this fix. Non-root containers reachsetupCgroupForSubcontainerregardless ofdev.gvisor.spec.cgroup-parent, and the pause scope is already created by containerd before the shim runs.Commit 2: populate spec files
After commit 1, the new compat directories exist but their interface files are empty (or read kernel defaults of
max), so cAdvisor'scontainer_spec_*series for runsc pods read0/ missing for the limit-bearing fields. The same shape applies on cgroup v1 today —Install({})is passed empty resources, so v1 compat dirs also have no spec values.Thread
spec.Linux.ResourcesthroughInstallSubcontainerCompatDir:Install(res)already iterates per-controllerset()methods that write the limit files; just hand it real resources instead of empty.installCompatDirto callcontrollers2["cpu"|"memory"|"pids"].set(res, path)after mkdir. Reuses all the existing runc-compatible conversion (convertCPUSharesToCgroupV2Value,convertMemorySwapToCgroupV2Value,cpu.maxformatting). Limited to{cpu, memory, pids}— the controllers whose interface files cAdvisor reads as spec values; cpuset / io / hugetlb are intentionally excluded (don't surface ascontainer_spec_*and widen the failure surface on hosts where they aren't enabled in the parent slice'scgroup.subtree_control).Mapping (cgroup file written ↔ cAdvisor series populated):
memory.max/memory.limit_in_bytescontainer_spec_memory_limit_bytesMemory.Limitmemory.swap.max/memory.memsw.limit_in_bytescontainer_spec_memory_swap_limit_bytesMemory.Swap(computed runc-style on v2)memory.low/memory.soft_limit_in_bytescontainer_spec_memory_reservation_limit_bytesMemory.Reservationcpu.max(quota) /cpu.cfs_quota_uscontainer_spec_cpu_quotaCPU.Quotacpu.max(period) /cpu.cfs_period_uscontainer_spec_cpu_periodCPU.Periodcpu.weight/cpu.sharescontainer_spec_cpu_sharesCPU.Shares(back-converted on v2)pids.maxPids.LimitThe compat cgroup is process-less, so any limits written here have no kernel-side accounting effect; they exist solely so cAdvisor's
container_spec_*series report real values for runsc pods, matching what runc produces on the same node.Best-effort writes on systemd v2. If a controller is not enabled in the parent slice's
cgroup.subtree_control,setValuereturnsENOENT/EROFS/EACCES, which we swallow and log. The compat path must never block container start (#6657 invariant).Backwards compatibility
Commit 1 only diverges from the existing path when the underlying cgroup is
*cgroupSystemd; v1 and non-systemd v2 still flow through the existingInstallpath. Compat directories created by either commit are tracked inc.Ownso existingUninstallremoves them at container destroy. No new lifecycle.Commit 2 changes
InstallSubcontainerCompatDir's signature from(cg Cgroup) errorto(cg Cgroup, res *specs.LinuxResources) error. The dispatcher has a single internal caller (setupCgroupForSubcontainer); no external callers.Tests
Commit 1:
TestInstallCompatDir: directory is created atMakePath(""), tracked inc.Own, second call is idempotent (no double-track),Uninstallremoves it.TestInstallSubcontainerCompatDirSystemd: public dispatcher routes systemd v2 cgroups to the compat-dir path.Commit 2:
TestInstallCompatDirSpecFiles: pre-touch leaf interface files (simulating kernel auto-creation when controllers are enabled in parent'ssubtree_controlon a real cgroupfs mount), pass a fullLinuxResources, assert each interface file contains the expected serialized value (incl. the runc-style swap-only computation and thecpu_shares→cpu.weightconversion).TestInstallCompatDirBestEffort: deliberately do not seed leaf files, assertinstallCompatDir(res)swallows the resultingENOENTs and still returns success with the directory created and tracked.TestInstallSubcontainerCompatDirSystemd(extended): public dispatcher propagates non-nil resources end-to-end through to the per-controllerset()methods.Built and tested locally on aarch64 (lima). Unit tests pass; manual end-to-end verification on a Kubernetes cluster running cgroup v2 + systemd is in the comment below.
Refs: #6500, #6657, #13067
Out of scope: runtime accounting series (follow-up)
This PR restores cAdvisor discoverability (commit 1) and populates the cgroup limit files cAdvisor reads as
container_spec_*(commit 2). It does not populate kernel-accounted runtime series —container_cpu_*_seconds_total,container_memory_*(instantaneous gauges),container_pressure_*,container_processes/container_threads/container_sockets/container_file_descriptors, network counters, etc.Those stay zero because the compat scopes this PR creates are intentionally process-less: the user workload runs inside the gVisor sandbox (a single Linux process), so the host kernel has nothing to attribute to the per-container scopes. The data isn't lost — gVisor publishes per-container values via
runsc events --stats, which is what containerd's CRI-stats plugin already consumes for/stats/summary(poweringkubectl top, metrics-server, HPA). The remaining gap is plumbing those values into cAdvisor's output.A reasonable follow-up shape: a coordinated cAdvisor + gVisor change that registers a cAdvisor
ContainerHandlerFactoryfor runsc-managed cgroups, bypassing libcontainer's/sys/fs/cgroupreads and consultingrunsc events --stats(or a stable equivalent), emitting the same metric names with the same label set as the kernel-cgroup-backed path so consumers need no code changes. Happy to drive both PRs as a follow-up to this one if maintainers think that's the right direction. The same shape would also help Kata and other runtimes whose user processes don't live in host cgroups.This PR remains a strict prerequisite for that follow-up: without per-container scope dirs and populated spec files on the host, cAdvisor has no
ContainerHandlerto attach a runsc-aware stats source to.