The network problem that wasn't: DTrace on the TCP stack and a blocking external API call

Symptom5–8 second API delays

Initial diagnosisNetwork latency / upstream provider

ToolsDTrace tcp:::send · tcp:::receive · ip:::send

TrackedRetransmits · socket buffer · send queue · thread state

Root causeSynchronous blocking external API call

FixAsync handling · timeouts · circuit breaker

The situation

Certain API endpoints were taking 5 to 8 seconds to respond. Not always — intermittently, under specific conditions, and with a pattern that made the network the obvious suspect: the delays correlated with traffic from specific geographic regions. Requests from some locations were consistently fast. Requests from others hit the delay with enough regularity that the pattern was undeniable.

The client's hosting provider was contacted. Traceroutes were run. Routing paths were examined. The provider found nothing wrong on their side — no packet loss, no congestion on the relevant paths, latency measurements consistent with geographic distance. Everything a network-level investigation could look at looked clean.

The leading theory at this point was a routing anomaly with a specific upstream provider — traffic from certain regions transiting a congested peering point, something that would show up as latency but not as loss, something the client's provider couldn't see because it was happening further up the path. A plausible theory. Also wrong.

The geographic correlation was real. The conclusion that it implied a network problem was not. The correlation was pointing at the cause from the wrong direction entirely.

Why the geographic pattern was misleading

A delay that correlates with traffic source geography doesn't have to be caused by network latency between the client and the server. It can also be caused by latency in a server-side operation that itself depends on where the request comes from — a geolocation lookup, a fraud check, a third-party API call that takes the request's origin as an input. If the server makes an outbound call to an external API as part of processing a request, and that external API responds slowly for certain input parameters, the resulting delay will appear geographic from the outside even though the network path between the end user and the server is entirely clean.

This is the misdirect: the geographic pattern was pointing at the right variable — something that depended on request origin — but the natural interpretation of that pattern pointed at the network rather than at the application logic that was consuming the origin data.

The diagnostic process

Instrumenting the TCP stack with DTrace

The investigation started at the network layer — not because the network was suspected, but because the network layer is where the hypothesis could be definitively ruled in or out. DTrace probes were attached to the FreeBSD TCP stack: tcp:::send and tcp:::receive to trace packet-level activity on the relevant connections, ip:::send to capture the IP-layer view, and probes tracking retransmit events and socket buffer state.

If the delay was a network problem, the TCP trace would show it clearly: retransmits during the delay window, socket buffer growth indicating backpressure from a congested path, send queue depth increasing as packets waited for acknowledgement. The trace would show the TCP stack working — sending, waiting, retransmitting — during the 5-to-8-second delay.

That's not what the trace showed. During the delay window, the TCP stack was idle. No sends. No retransmits. No buffer pressure. The connection between the client and the server was open, healthy, and doing absolutely nothing. The TCP stack wasn't struggling with a bad network path. It was waiting for the application to give it something to send.

The delay was before the response, not during it

This finding inverted the problem entirely. A network latency problem would show up as delay during transmission — the TCP stack working hard, packets slow to acknowledge, retransmits accumulating. What the trace showed instead was silence at the TCP layer followed by normal fast transmission once the application was ready to respond. The 5-to-8-second window was time the application was spending before it wrote a single byte to the socket. The network wasn't involved at all. The delay was in the application thread, before the response was handed to the TCP stack.

With the network eliminated as a cause, the question became: what was the application doing for 5 to 8 seconds before it was ready to respond? The answer came from correlating the timing of the affected requests with outbound connection events from the server — connections from the application server to an external API endpoint. The affected requests were the ones that triggered an outbound call to an external service. The external service was responding slowly for certain input parameters — specifically those derived from the geographic origin of the incoming request. The application was blocking synchronously on that response, holding the request thread idle while it waited. From the outside, this looked like network latency. From the inside, it was a thread parked on a slow synchronous call.

The fix — three changes to the application's dependency handling

Async handling for the external API call

The external API call was refactored to be non-blocking. Rather than holding the request thread synchronously while waiting for a response from the external service, the call was restructured so that the application could continue processing and respond to the user while the external API interaction completed independently. For cases where the external response was genuinely required before a response could be sent, a timeout was enforced — a maximum wait time after which the application would proceed with a default behavior rather than blocking indefinitely.

Explicit timeouts on all external calls

The immediate issue was one external API call without a timeout. The broader issue was an application that made external calls without consistently enforcing time limits on how long those calls could take. Timeouts were introduced as a standard pattern across all external dependency calls — HTTP clients, socket connections, any I/O operation with a latency profile outside the application's direct control. An external service that responds slowly degrades gracefully rather than blocking application threads.

Circuit breaker behavior

A timeout prevents indefinite blocking but doesn't prevent the application from repeatedly attempting calls to an external service that is consistently slow or unavailable. A circuit breaker pattern was introduced: after a configurable number of timeouts or errors from a given external endpoint within a time window, the circuit opens and subsequent calls fail immediately with a cached or default response rather than attempting the external call. When the external service recovers, the circuit closes and normal behavior resumes. This prevents a degraded external dependency from creating sustained load on application threads even when timeouts are in place.

What changed

The perceived network latency disappeared because it had never been network latency. The 5-to-8-second delays were the external API's response time for specific input parameters, manifested as application thread blocking time, observable as geographic latency because the input parameters that triggered the slow path were derived from request origin. Once the blocking was eliminated — through async handling and enforced timeouts — response times normalized across all geographic traffic sources.

The geographic pattern ceased to exist, because the geographic variable was no longer deterministic of application behavior. Requests from previously-affected regions now behaved identically to requests from anywhere else — because the slow synchronous dependency that had been consuming their processing time was no longer blocking.

The hosting provider, who had spent time investigating a network problem that didn't exist on their side, was correct. The problem was never on their side.

Initial hypothesis

Network routing anomaly. Hosting provider investigated and found nothing. They were right.

TCP stack during delay

Completely idle. No retransmits, no buffer pressure. The stack was waiting for the application.

Root cause

Synchronous blocking call to an external API that responded slowly for geography-derived input parameters.

Fix

Async handling + enforced timeouts + circuit breaker on external dependency calls.

Geographic pattern after fix

Gone. All regions responded identically once the blocking dependency was removed from the request path.

The lesson

A symptom that correlates with geography points at something that varies by geography. That's a useful signal — but the list of things that vary by geography is longer than just network paths. Request origin is an input to application logic. If application logic branches on that input, and one branch is slower than another, the result looks geographic even if the cause is entirely within the application layer. The correct first question is not "what is the network doing during the delay" but "where is time being spent during the delay." DTrace on the TCP stack answered that question definitively in minutes: the network was idle. The delay was before the first byte was sent. Everything that followed from that answer pointed at the application.

The broader pattern — an external dependency called synchronously, without a timeout, in the hot path of a request — is one of the most common causes of latency that presents as something else. The external service is rarely the thing being monitored. It doesn't show up in CPU graphs, memory dashboards, or application error logs until it times out or fails completely. When it's merely slow, it appears as mysterious latency in the system that called it, attributed to whatever characteristic of the request happened to correlate with the slow code path. Finding it requires tracing at the boundary between the application and its dependencies — which is exactly what DTrace on the TCP stack provides.

← Previous: PHP-FPM HAProxy deadlock All case studies →