Diagnosis — load balancer & FastCGI
The situation
Users were hitting 500 and 502 errors. Not constantly — intermittently, under load, in a pattern that resisted easy reproduction. The errors would cluster, resolve, cluster again. Support tickets accumulated. The development team looked at the application logs and found nothing that correlated. The operations team looked at the infrastructure and found nothing obviously wrong.
HAProxy reported all backends healthy. The PHP-FPM pools were up. The load balancer's health check dashboard showed green across the board. From every vantage point that had a dashboard, the system was functioning normally. The 502s were coming from somewhere, but nothing that was being monitored could see where.
The initial theories followed the obvious path. An application bug that only surfaced under concurrency. An upstream timeout — the database taking too long on certain queries, requests timing out before a response was ready. Both plausible, neither confirmed by the evidence available. The logs didn't point anywhere specific. The errors were real and the monitoring was useless.
HAProxy said the backends were healthy. PHP-FPM said the pools were up. Both statements were technically true and operationally meaningless — because the health checks weren't testing the thing that was broken.
Understanding the failure mode
Before reaching for diagnostic tools, it was worth understanding what HAProxy's FastCGI health checks actually verify. HAProxy's built-in FastCGI health check — the standard configuration for PHP-FPM backends — establishes a TCP connection to the FPM socket, completes a FastCGI BEGIN_REQUEST / PARAMS handshake, and marks the backend healthy if that handshake succeeds. It does not send a real request. It does not wait for PHP to execute any code. It does not touch the database, the cache, or any upstream dependency.
A PHP-FPM worker that is blocked waiting for a database response will still accept a new TCP connection. It will still complete a FastCGI handshake on that connection. It will pass the health check. And it will be unable to actually process a request, because it is already committed to a blocking operation that hasn't returned.
This is the black hole: a worker that looks healthy from the outside because it can shake hands, but is functionally unavailable because it's already waiting for something else. As more workers entered this state — blocked on slow or hung database connections — the pool's effective capacity shrank while the reported capacity stayed constant. HAProxy kept sending requests to workers that couldn't handle them. 502s accumulated. Health checks kept passing.
The diagnostic process
ktrace is FreeBSD's kernel call tracer — it records every system call made by a process and its children, capturing the call arguments, return values, and timing. It is lower-level than application logging and more targeted than DTrace for tracing a specific process's interaction with the kernel. Attached to the PHP-FPM worker processes, it produced a detailed record of what each worker was actually doing: which system calls it was making, in what sequence, and crucially — where it was blocking.
The trace showed a consistent pattern. Workers would accept a connection from HAProxy's health check — completing the handshake that marked them healthy. Then, when an actual request arrived, they would read the FastCGI request data, begin processing, make a downstream connection for a database query, and block on kevent — waiting for the database socket to become readable. The worker was now in an uninterruptible wait state. It had accepted the connection from HAProxy that signaled it was ready, but it was not in any meaningful sense ready. It was stuck.
The kevent blocks were long — far longer than any acceptable request processing time. Some workers were blocked for tens of seconds. During that time, they were accepting health check handshakes on one socket while blocking on a database socket on another. The two operations were happening concurrently and independently, which is exactly why the health checks appeared clean while requests were failing.
HAProxy's detailed logging — with option log-health-checks and the full timing fields in the access log — was compared against the ktrace output. The correlation was clear: 502s were occurring when the worker count available for new requests dropped below the concurrent request rate, driven by workers stuck in the kevent wait. The health checks weren't reflected in this at all — they were succeeding independently of the worker availability state because they were testing a different thing entirely.
The fix — three changes, each addressing a different layer
The HAProxy health check configuration was replaced with a check that exercised a real application endpoint — a lightweight purpose-built status route that performed a representative database read and returned a response indicating actual backend health. This check would fail if the worker was blocked, if the database was unreachable, or if the application layer between them was broken. It would not pass simply because a TCP handshake succeeded.
This change alone eliminated the false-healthy backend state. Workers that were deadlocked on database I/O now failed their health checks and were removed from the backend pool by HAProxy rather than continuing to receive requests they couldn't serve. The 502 rate dropped immediately — not because the underlying blocking was fixed, but because HAProxy was now routing around affected workers rather than continuing to send traffic to them.
The root condition enabling the deadlock was database queries with no enforced timeout. A slow or hung query would hold a worker indefinitely — not for the duration of a request, but potentially until the database connection was dropped by the server's own wait_timeout. On a pool of limited size, a handful of these was enough to saturate available workers.
Query timeouts were introduced at both the application level and the database level. At the application level, connection and query timeouts were set explicitly rather than relying on the database server's defaults. At the database level, wait_timeout and interactive_timeout were tuned to ensure that idle connections were not held indefinitely. The combination ensured that a blocked worker would eventually unblock — and that the maximum time a worker could be held in the blocked state was bounded and predictable.
The PHP-FPM pool's max_children setting had been sized based on available memory per worker — a reasonable starting point but one that doesn't account for the database connection limit. If every worker simultaneously opened a database connection, the total connection count would exceed the database's max_connections, causing connection failures for workers that were otherwise healthy. max_children was set to a value that allowed headroom below the database's connection limit, and a connection pool was introduced to manage database connections across the worker pool more efficiently.
What changed — and why it stayed fixed
The immediate result was the elimination of the false-healthy backend state. With health checks that tested the real request path, HAProxy's backend pool accurately reflected which workers could actually serve requests. Workers that were blocked were removed from rotation. Workers that recovered were returned to rotation. The load balancer's view of the backend was no longer a fiction maintained by a handshake that tested nothing meaningful.
The query timeouts ensured that blocking was bounded. Workers could still block — the underlying database interactions hadn't changed — but now a blocked worker would unblock in a predictable, bounded time rather than potentially staying blocked indefinitely. Combined with the corrected max_children sizing, the pool had sufficient capacity to absorb transient blocking without cascading into pool exhaustion.
The 500/502 error rate stabilized. Request success rate under load became consistent rather than periodic and degraded. The system that had appeared to be functioning normally — because its health checks said so — was now actually functioning normally, with monitoring that could tell the difference.
Health checks during incident
Passing throughout. FastCGI handshake checks validated TCP connectivity, not worker availability.
Root cause
Workers blocked on DB I/O via kevent — still accepting health check connections while unable to serve requests.
Diagnostic method
ktrace on PHP-FPM workers traced accept → read → kevent block sequence against HAProxy timing logs.
Primary fix
HAProxy checks replaced with real request path validation — deadlocked workers now correctly marked unhealthy.
Supporting fixes
Query timeouts bounded blocking duration. max_children sized against DB concurrency limit, not memory alone.
The lesson
A health check is a test. The question is what it's actually testing. HAProxy's built-in FastCGI check tests whether a PHP-FPM process can complete a protocol handshake. That is a necessary condition for a healthy worker — but it is not a sufficient one. A worker that is blocked waiting for a database response can complete a handshake and accept new connections while being functionally unable to process any of them. The health check passes. The worker is broken. The two facts coexist without contradiction because they're measuring different things.
The same principle extends to any health check that validates presence rather than function: a process that responds to a ping but can't serve requests, a service that accepts TCP connections but has exhausted its upstream connections, a cache that reports up but has lost its backing store. What the check tests is the boundary of what you know. Anything beyond that boundary is an assumption. When the assumption is wrong, the monitoring lies — not because it's broken, but because it was never testing the right thing to begin with.
The fix for monitoring that lies is monitoring that tests what you actually care about. In this case: not "can this process shake hands" but "can this process serve a request." The distinction is a single configuration change in HAProxy. The diagnostic work to understand why that distinction mattered required tracing worker behavior at the system call level.
Remote-first. Dallas-based. Available until 2am CT.