With Kubernetes v1.36, Pressure Stall Information (PSI) metrics have reached general availability, offering cloud-native teams deeper insight into resource saturation at node, pod, and container levels. This advancement enhances observability by moving beyond basic utilization stats to track stalled task time and queuing delays with negligible overhead.
- PSI metrics deliver detailed stall time insights for CPU, memory, and I/O
- Performance tests confirm under 3.2% total node CPU impact on 4-core machines
- PSI data integrates seamlessly with Prometheus and Kubernetes Summary API
Infrastructure signal
Pressure Stall Information (PSI) metrics are now a stable feature in Kubernetes v1.36. These metrics extend beyond traditional CPU and memory utilization by quantifying the duration processes are stalled due to resource contention. PSI covers CPU, memory, and I/O resources, enabling detection of bottlenecks not evident in simple usage percentages. This is vital to forecasting potential outages and performance degradation in cloud-native environments running complex workloads.
Performance evaluations validate that enabling PSI on the Linux kernel and through the kubelet adds minimal system overhead. Kernel-level PSI tracking incurs up to approximately 3.1% CPU usage on 4-core nodes under heavy contention, while the kubelet’s metric collection tasks remain lightweight and transient, within 2.5% of node capacity. This efficiency ensures Kubernetes clusters can produce richer telemetry without significant trade-offs in cloud resource consumption or node reliability.
Developer impact
Developers benefit from granular PSI telemetry by gaining visibility into task stalls at pod and container levels, improving troubleshooting and capacity planning workflows. Since these metrics include moving averages over 10, 60, and 300 seconds, developers can distinguish between short-lived spikes and chronic saturation, leading to higher signal fidelity when diagnosing slowdowns or slow scheduling.
Kubernetes 1.36 removes the need for feature gate toggles for PSI metrics, simplifying adoption for dev teams. PSI metrics are accessible through existing endpoints like /metrics/cadvisor compatible with Prometheus and the Summary API, minimizing integration work within current monitoring stacks. However, PSI is restricted to Linux nodes; Windows nodes omit these metrics, requiring teams to account for hybrid cluster environments.
What teams should watch
Cloud operations and reliability teams should monitor PSI metrics for early warnings of node resource pressure before utilization limits are hit. These signals can inform autoscaling policies, pod scheduling decisions, and alerting thresholds calibrated on observed stall duration rather than raw CPU or memory percentage alone, reducing false positives and outages.
Infrastructure teams are advised to evaluate workload density and node sizing with PSI overhead in mind since the added kernel bookkeeping remains minor but measurable under the highest densities tested (~80 pods on 4 cores). Teams running mixed OS environments must also build workflows that handle absent PSI metrics on Windows nodes gracefully to maintain consistent observability.
Developers should incorporate PSI-aware performance profiling in CI/CD pipelines and incident response playbooks. By leveraging the detailed stall insights, applications can be optimized for resource contention scenarios early in development, improving both end-user experience and efficient cloud spend.