Kernel-level diagnosis
The situation
The server looked fine. A FreeBSD host with 64GB of RAM running nginx, PHP-FPM, and a Valkey node — the modern Redis fork — showed CPU utilization in single digits, load average comfortably low, and memory headroom that appeared healthy by every conventional metric. Nothing in top, nothing in htop, nothing in the basic monitoring dashboards suggested a system under any meaningful stress.
And yet: requests stalled intermittently. Latency spiked under traffic loads that the hardware should have handled without blinking. The spikes weren't long enough to trigger uptime alerts. They were short enough that most requests didn't catch them at all. But the ones that did were noticeably slow — slow enough that users noticed, slow enough that the client noticed, not slow enough that any standard diagnostic tool pointed anywhere useful.
This is the hardest category of performance problem to diagnose: a system that is objectively underutilized by every measure that's easy to collect, but that intermittently fails to deliver on what that utilization number implies it should be capable of.
CPU at 8%. Load average under 2. Memory showing gigabytes free. Every dashboard green. And yet something was making requests wait — briefly, invisibly, and consistently enough to matter.
The wrong hypotheses
The initial assumptions followed the standard diagnostic ladder. PHP-FPM pool saturation was the first candidate — a pool that was too small, workers blocking on slow upstream calls, request queuing at the FPM layer. The pool metrics showed headroom. PHP-FPM wasn't saturated.
Network jitter was considered — latency introduced between the load balancer and the application server, or between the application server and the database. Network traces showed nothing consistent with the pattern. The jitter, where it existed, didn't correlate with the latency spikes.
ZFS ARC was the assumption nobody wanted to examine, because the numbers looked so comfortable. 64GB of RAM, ARC sitting at what appeared to be a reasonable size, the standard assumption being that plenty of memory meant the ARC was doing its job. That assumption was wrong.
The diagnostic process — going below the application
The usual tools had been exhausted. The problem was invisible to anything that measured at the application or OS process level because it wasn't a process-level problem. The latency was originating somewhere in the interaction between the kernel's memory management, the filesystem cache, and the allocator behavior of a userspace application. To see it, the instrumentation had to cross that boundary.
DTrace is FreeBSD's dynamic tracing framework — it allows arbitrary instrumentation of the running kernel and userspace processes without modifying binaries, without rebooting, and without the overhead of kernel debug builds. Probes can be attached to virtually any function in the kernel, any system call, any scheduler event, and many application-level functions if the application was compiled with DTrace support. The output is programmable: not a raw dump of events, but aggregated, filtered, time-correlated data shaped by the script that collected it.
The first instrumentation pass targeted the Virtual File System layer — specifically vfs_read and vfs_write — to measure how long individual filesystem operations were taking and whether there was a population of outlier reads that corresponded to the latency spike timing. There was. A subset of read operations was taking significantly longer than the median — not long enough to show up as high I/O wait in aggregate statistics, but long enough to introduce measurable stalls in request handling threads that were waiting on those reads.
The reads that were slow were reads that should have been served from the ZFS ARC — data that had been accessed recently enough that it ought to be in cache. They weren't being served from cache. They were going to disk.
The second instrumentation pass targeted ZFS ARC internals using the arc_* DTrace probes available in FreeBSD's ZFS implementation. The ARC — Adaptive Replacement Cache — is ZFS's in-memory read cache, and it is one of the primary reasons ZFS performs well on read-heavy workloads. When it's working correctly, frequently accessed data lives in RAM and disk reads are rare. When it isn't, reads that appear to be cache hits at the application level are actually going to disk because the data was evicted.
The ARC probe data showed active eviction occurring at a rate inconsistent with the apparent memory availability. The ARC was shrinking — shedding cached data — at moments that correlated precisely with the slow VFS reads and with the application-level latency spikes. The ARC was not, in fact, large enough to hold the working set. It was being compressed by something competing for the same physical memory.
A third pass instrumented scheduler wake latency — the time between a thread becoming runnable and actually being scheduled to run. During ARC eviction events, threads that had been sleeping waiting for I/O were experiencing longer-than-expected wake latency. The system wasn't CPU-bound — there were idle cycles available — but the combination of I/O completion events and memory pressure was introducing measurable scheduling jitter. This was the micro-stall mechanism: not CPU saturation, not I/O saturation in isolation, but the combined effect of synchronous reads from disk and the scheduling overhead around them.
The root cause — two interacting problems
The ARC eviction was being driven by memory pressure from Valkey. The Valkey node's maxmemory configuration was set without accounting for the actual memory footprint of the process — and the actual footprint was substantially larger than the configured limit, because of jemalloc fragmentation.
jemalloc is the default allocator on FreeBSD and is also used by Valkey internally. Fragmentation occurs when allocated memory is freed in patterns that leave gaps in the allocator's internal structures — gaps that count against the process's virtual memory but aren't usable for new allocations. A Valkey process configured for a 20GB maxmemory limit was actually consuming significantly more physical RAM than that limit implied, because the reported RSS included fragmented allocator space that Valkey's own memory accounting didn't recognize as "used."
The result: the OS saw a process consuming far more physical memory than Valkey believed it was using. The ZFS ARC — which yields memory to other processes under pressure — responded by evicting cached data to accommodate what looked like legitimate memory demand. From the ARC's perspective, it was behaving correctly. From Valkey's perspective, it was behaving correctly. The problem existed entirely in the gap between the two accounting systems.
Neither ZFS nor Valkey was misbehaving in isolation. The problem was in the boundary between them — an accounting gap that was invisible to any tool that looked at only one side of it.
The fix
The ZFS ARC's upper memory limit was explicitly set via vfs.zfs.arc_max — reserving a defined minimum for the ARC that Valkey's pressure could not displace. Previously, the ARC was operating under the default behavior of dynamically sizing itself based on available memory, which made it vulnerable to the fragmentation-inflated RSS of the Valkey process. With an explicit floor, the ARC held its working set under pressure.
Valkey's maxmemory was adjusted downward to leave explicit headroom for allocator fragmentation overhead — sized based on the observed RSS-to-logical-usage ratio rather than the theoretical maximum. The activedefrag setting and active-defrag-ignore-bytes threshold were configured to allow Valkey to reclaim fragmented allocator space proactively rather than letting fragmentation accumulate until the next eviction cycle.
With the changes in place, the ARC probe data was collected again under equivalent load. ARC eviction events dropped to near zero for the working set. VFS read latency returned to the sub-millisecond range consistently. Scheduler wake latency normalized. The latency spikes in the application layer disappeared.
The ARC hit ratio — the fraction of reads served from memory rather than disk — stabilized at a level consistent with the working set fitting comfortably in the available cache space. The system that had appeared to have plenty of memory turned out to have been starving its filesystem cache to feed a fragmentation artifact. With the two sides of the memory budget made explicit, both worked correctly.
Visible symptoms
Intermittent latency spikes under light load. CPU and memory dashboards showed nothing wrong.
Root cause
jemalloc fragmentation inflating Valkey RSS, displacing ZFS ARC and forcing synchronous disk reads.
Diagnostic method
DTrace arc_*, vfs_read, and scheduler probes — kernel-level visibility unavailable to conventional tools.
Fix
Explicit arc_max reservation + corrected Valkey maxmemory + activedefrag policy.
Hardware upgrade
None required. The existing 64GB was sufficient — it was being accounted for incorrectly.
The lesson
CPU utilization, load average, and top-level memory statistics are process-level accounting. They cannot see contention at the boundary between subsystems — between the ZFS ARC and a userspace allocator, between kernel memory reclaim and application memory accounting, between what an allocator reports as free and what the OS actually has available for caching. These boundaries are where the hardest performance problems live, because they're invisible to every tool that looks at only one side.
DTrace crosses those boundaries. The ability to simultaneously instrument vfs_read latency, ARC eviction events, and scheduler wake time — and correlate them in a single aggregated output — is what made this diagnosis possible. Without that cross-layer visibility, the standard diagnostic path would have continued eliminating process-level candidates while the actual cause remained invisible. The system would have eventually been "fixed" with a hardware upgrade that wasn't needed, or left in a state of unexplained intermittent slowness that everyone had learned to tolerate.
The fix was three configuration changes. The diagnosis was the hard part.
Remote-first. Dallas-based. Available until 2am CT.