← Back to case studies

Diagnosis — scheduler & concurrency

The system spending more time waiting to run than actually running — scheduler latency via DTrace and misconfigured worker pools

PlatformFreeBSD · 16-core server

StackPHP-FPM · background job workers · system daemons

ToolsDTrace sched:::on-cpu · sched:::off-cpu

Aggregated byexecname · pid · off-CPU time distribution

Before~100 runnable threads · 16 cores

Hardware changeNone

The situation

The server wasn't in distress. CPU utilization was unremarkable. Load average was elevated but not alarming. There were no I/O bottlenecks, no lock contention, no obvious errors in any log. By every metric that was being actively monitored, the system was healthy.

But the application felt sluggish. Not broken — sluggish. Response times were higher than they should have been given the apparent load. Things that should have been fast were taking a beat longer than expected. The kind of degradation that's hard to file a ticket about because nothing is technically wrong, but that accumulates into user frustration and eventually reaches the point where someone asks why the server "feels slow."

This is one of the hardest categories of production problem to take seriously, because the standard diagnostic tools show nothing worth worrying about. And it is one of the most common — because the configuration that causes it drifts into place gradually, over months or years, and there is rarely a moment where something breaks. The system just becomes quietly, persistently inefficient.

We found a system spending more time waiting to run than actually running — without ever appearing busy.

How the configuration got this way

The server had all the classic fingerprints of configuration drift. It had originally been deployed on a different system — larger, differently shaped, a different workload mix. Worker counts had been chosen as round numbers, or carried over from what had worked before, and then never revisited. Over time, background processing requirements had grown. More workers were added to handle increased queue depth. None were removed when other things changed. Each individual addition made sense at the time. The cumulative effect was a system running approximately 100 concurrently runnable threads on 16 CPU cores.

Nothing broke when this happened. The OS scheduler — which is very good at its job — kept everything moving. But "moving" and "efficient" are not the same thing. A scheduler managing 100 threads across 16 cores is spending a meaningful fraction of its time making scheduling decisions, performing context switches, and flushing CPU caches and TLB entries between thread switches. None of that overhead appears in CPU utilization graphs, because context switch overhead looks like CPU work. It looks indistinguishable from useful computation until you measure it directly.

The diagnostic process

Instrumenting the scheduler with DTrace

The investigation used two DTrace scheduler probes: sched:::on-cpu, which fires when a thread is scheduled onto a CPU core, and sched:::off-cpu, which fires when a thread is descheduled. By recording timestamps at both events and aggregating by execname and pid, it was possible to measure not just how much CPU time each process was using, but how it was using it — specifically the ratio of time spent executing versus time spent waiting to be scheduled.

This is a fundamentally different question from "how much CPU is this process using." A process can use very little CPU and still be a significant source of scheduler pressure, if it generates a high volume of very short execution bursts. Each burst requires a context switch in and a context switch out. Each switch has overhead. At sufficient frequency, that overhead becomes measurable — not in CPU utilization, but in the latency experienced by every other thread on the system.

What the off-CPU distribution showed

The off-CPU time distribution was the key output. Rather than looking at aggregate CPU consumption, the probe data was aggregated to show the distribution of time each process spent waiting to be scheduled after becoming runnable. The pattern that emerged was clear and specific.

PHP-FPM workers showed very short CPU bursts — sub-millisecond execution slices — followed by frequent off-CPU transitions. Individual workers were spending roughly 1–2ms on CPU and 3–6ms off CPU, repeated thousands of times per second across the pool. The off-CPU time wasn't I/O wait — it was scheduler wait. Workers that were ready to run were sitting in the run queue, waiting for a core to become available, because the cores were occupied by other workers in the same situation.

Background job workers showed a different pattern: longer individual CPU slices, heavier per-execution work, but constantly preempted by the PHP-FPM pool's high-frequency scheduling pressure. The background workers were being interrupted mid-execution not because their work was done but because the scheduler was allocating time slices across too many competing threads.

The two-dimensional finding

When the off-CPU data was aggregated two ways — by event count and by total off-CPU time — the picture became unambiguous:

PHP-FPM dominated the event count. The sheer number of context switches it was generating was disproportionate to its actual workload — each short-lived request execution was generating multiple scheduling events, creating constant churn in the run queue.

Background workers dominated the total off-CPU time. Their individual waits were longer because the PHP-FPM churn was consistently displacing them. A background job that needed 50ms of uninterrupted CPU time to complete a batch operation was getting 2–3ms slices at a time, interrupted repeatedly, with the overhead of each interruption adding to its total wall-clock execution time without contributing anything to throughput.

Neither process looked bad in isolation. PHP-FPM's per-request CPU consumption was normal. Background workers' per-job CPU consumption was normal. The problem was the interaction between them — two pools with incompatible execution profiles, both oversubscribed relative to the available cores, thrashing the scheduler between them.

The fix — aligning concurrency with reality

Before: ~100 runnable threads on 16 cores

PHP-FPM was configured with 70–80 child processes. Background workers were running 20–30 concurrent processes. A handful of system daemons accounted for the remainder. No single number was obviously wrong. The total — approximately 100 runnable threads competing for 16 cores — was the problem, and it had no single origin.

After: aligned to the hardware

PHP-FPM children were reduced to 32–40 — roughly twice the core count, which accounts for the I/O wait time inherent in request processing without massively oversubscribing the scheduler. Background workers were capped at 8–12 concurrent processes — enough to keep the queue moving without competing destructively with PHP-FPM's bursty pattern. Some queue processing was batched to reduce the frequency of background process spawning, which reduced scheduler churn from process lifecycle events.

Nothing exotic. No new hardware. No architectural changes. The changes were made in configuration files and measured against the same DTrace probes used for diagnosis.

What the probe data showed after

The off-CPU event count for PHP-FPM dropped substantially — not because requests were fewer, but because each worker was getting longer uninterrupted execution windows before being descheduled. Background workers' total off-CPU time normalized — they were completing their work in fewer, longer CPU slices rather than many fragmented ones. The on-CPU to off-CPU ratio for all process classes moved toward a healthier balance.

CPU utilization increased. The system was doing more actual work per unit time because it was spending less time on scheduling overhead. This is the counterintuitive result that the DTrace data makes legible: a system can use more CPU and be faster, if the additional CPU usage is productive work rather than context-switch overhead.

Why top and load average couldn't see this

top measures CPU utilization — the fraction of time CPUs are executing instructions. Context switches look like CPU execution. The overhead of a context switch — saving registers, flushing the TLB, loading the next thread's state — is real CPU work. It appears in the CPU utilization number as time spent doing something. top cannot distinguish between "CPU time spent on useful application work" and "CPU time spent on scheduler bookkeeping." They look identical.

Load average measures the length of the run queue — how many threads are runnable or running at any given moment. A load average of 6.0 on a 16-core system looks modest. It doesn't reveal that the 6.0 average is the result of 100 threads taking turns in the queue, each getting very short slices, generating hundreds of thousands of context switches per second. The number is low. The efficiency is not.

DTrace's scheduler probes measure something neither metric touches: the distribution of waiting time. Not "are there threads waiting" — load average covers that — but "how long are threads waiting, how often, and which processes are responsible for making others wait." That's the measurement that reveals scheduler pressure, and it's only visible below the abstraction layer that top and load average operate at.

CPU during problem

Unremarkable. Context switch overhead is indistinguishable from useful work in utilization metrics.

Thread ratio

Threads spending 3–6ms waiting for every 1–2ms of execution. Thousands of times per second.

Root cause

~100 runnable threads on 16 cores. PHP-FPM churn creating pressure. Background workers starved of uninterrupted slices.

Fix

PHP-FPM 70–80 → 32–40. Background workers 20–30 → 8–12. Config files. No hardware.

Result

CPU utilization increased. System felt dramatically faster. More useful work per unit time, less scheduling overhead.

The lesson

CPU utilization is a measure of how busy the CPU is. It is not a measure of how efficiently it is being used. A system running at 40% CPU utilization can be wasting a significant fraction of that 40% on context switch overhead — and the only way to see that waste is to measure it where it lives, in the scheduler, at the system call level. Load average tells you how many threads want to run. It does not tell you how long they wait before they get to. Those are different numbers, and in an oversubscribed system they can diverge dramatically.

The configuration that caused this didn't arrive suddenly. It drifted into place over months — worker counts set to round numbers, copied from other servers, incremented when queues backed up, never decremented when workloads changed. Every individual decision made sense. The cumulative result was a server spending more time shuffling threads than executing them. Two DTrace probes and an afternoon of analysis made that visible. Configuration changes that could have been made at any point in the preceding months fixed it the same day.

Most systems running for more than a year without deliberate concurrency review are probably oversubscribed. The ones that aren't getting slower are the ones nobody's looked at yet.

← Previous: MariaDB InnoDB stall All case studies →