Diagnosis & modernization
The situation
The client operated a multinational business with employees and customers across multiple countries. Their internal web application — used daily by paying customers — had developed an intermittent fault: certain users would load a page and get nothing. A completely blank response. No error message, no timeout, just white.
The pattern was elusive. Some customers complained regularly. Others, often in the same organization, never experienced the problem at all. The affected users were concentrated in the Philippines, but not uniformly — within a single enterprise customer, some employees had the problem and others didn't, depending on which office they were working from that day.
At some point, someone discovered that connecting through a VPN fixed it. The workaround spread through the affected user base and became the accepted solution. The root cause went uninvestigated.
The monitoring said the site was up. Pingdom was running, checking from external locations, and showing green. As far as any automated system was concerned, nothing was wrong.
Pingdom makes an HTTP request and checks for a 200 response. Apache returning a blank page to a user in Manila looks identical to Apache serving a healthy page to a user in Dallas. The monitoring was telling the truth — it just wasn't asking the right question.
Why the VPN workaround obscured the cause
A VPN changes your apparent source IP address. When an affected employee in Manila connected through a US VPN exit node, their requests arrived at the server with a US IP address. The problem disappeared. This strongly implied something geographic — but the application itself contained no geographic logic. The team had no idea why routing through a VPN would affect application behavior, and the gap between "VPN fixes it" and "here's why" was wide enough that the investigation stalled.
The missing piece was that the server was running an ip2location module loaded at the Apache level — specifically the PX4 proxy detection database (PX4-IP-PROXYTYPE-COUNTRY-REGION-CITY-ISP.BIN) — to filter foreign bot traffic away from US-only marketing landers. This was a legitimate and intentional configuration. The team responsible for the landers knew about it. The team debugging the application did not, because it lived in Apache rather than in any application code they had access to.
The GeoIP module ran on every request the server handled — including every request to the multinational app. When a Philippine IP address was looked up and that lookup hit a malformed or missing entry in the PX4 database, the ip2location C library threw a SIGSEGV. Apache didn't catch it — a segfault in a C extension isn't a catchable exception. It killed the worker process outright. The response came back blank. Apache respawned the worker. Pingdom's next check hit a healthy worker and reported green.
The diagnostic process
The first step was understanding the full dependency tree of the running Apache process. ldd against the Apache binary and its loaded modules revealed the ip2location shared library. ltrace — which traces library calls rather than system calls — confirmed the library was being invoked on incoming requests and identified the specific lookup function in the call chain.
This established that a third-party C library was running in the Apache process space on every request. Any segfault in that library would kill the worker. The question was whether it was segfaulting, and on what input.
strace was attached to the Apache worker processes while a real user from an affected Philippine IP range loaded a page. strace traces system calls — the lowest-level interface between a process and the kernel — and captures signals delivered to the process. The output was unambiguous: the worker received a SIGSEGV during the ip2location lookup for specific Philippine IP ranges. The process terminated. The response was blank.
The fault was consistent for certain IP ranges and absent for others, which explained why some Philippine employees were affected and others weren't — they were working from different offices on different ISPs, some mapping to IP ranges that triggered the fault and some that didn't.
With the root cause confirmed, the broader problem became visible. The ip2location module had been deployed to solve one specific problem: keeping foreign traffic off US landers. But because it lived at the Apache level rather than in the application, it ran on every vhost, every request, every application on that server. The lander protection and the multinational app shared the same Apache process pool. A database fault that affected Philippine IP lookups silently broke the app for Philippine users — and nobody had made that connection because the two systems were mentally siloed even though they were physically co-located on the same server.
The fix — and the modernization it prompted
The immediate fix was replacing ip2location with MaxMind GeoIP2, which did not exhibit the segfault behavior for the affected IP ranges. This resolved the blank page issue.
But the engagement didn't stop there. The segfault was a symptom of a deeper architectural problem: any C library running in the Apache process space could silently kill worker processes on a fault, serve blank pages to real users, produce no application-level error, and remain invisible to external monitoring. The right answer was process isolation.
The stack was migrated from Apache with mod_php to nginx and PHP-FPM. In this architecture, the web server and PHP runtime are separate processes. A crash in a PHP-FPM worker does not affect nginx. The web server keeps running, the worker pool recovers, and failures are contained and logged rather than silent. GeoIP processing moved to the PHP-FPM layer where failures produce catchable exceptions rather than process-killing signals.
Closing the monitoring gap
The client had Pingdom running throughout the entire period the bug was active. It reported the site as healthy. This wasn't a failure of Pingdom — it was doing exactly what it was configured to do. But an HTTP check that verifies a 200 response from a US IP address cannot detect a segfault affecting only Philippine IP ranges. External availability monitoring answers "is the server responding?" It does not answer "is the application working correctly for all of your users?"
The monitoring stack was rebuilt entirely. At the server level, monit was configured to watch nginx worker count, PHP-FPM pool health, and Redis — verifying process health and responsiveness, not just existence. Monit was configured to automatically restart failed services and alert immediately on anomalies.
For external monitoring, a VPS running monit was provisioned for remote HTTP and HTTPS checks with response content validation — not just status codes. SSL certificate expiration and domain expiration monitoring were added across all domains. Application-specific health checks written in Perl ran via cron, verifying that known endpoints returned expected content rather than merely that the server was alive.
A monitoring stack built this way would have detected the original problem. A content-validating check sourced from a Philippine IP range would have returned unexpected content, not a clean 200 — and it would have alerted before a single user called to complain.
Root cause
ip2location PX4 SIGSEGV on specific Philippine IP ranges — killing Apache workers silently
Why it was hidden
GeoIP module shared across all vhosts; Pingdom showed green; VPN workaround masked the geographic signal
Stack after
nginx + PHP-FPM replacing Apache/mod_php; MaxMind GeoIP2 replacing ip2location
Monitoring after
Local monit (nginx, PHP-FPM, Redis) + remote VPS with content validation, cert & domain expiry alerting
The lesson
Shared infrastructure has shared failure modes. A module deployed to solve one problem — bot filtering on marketing landers — silently became part of every other request on the server. The organizational gap between the team that configured it and the team debugging the application was as much a part of the problem as the technical fault. External monitoring that checks from a fixed location and verifies only HTTP status codes cannot detect failures that are geographic, content-dependent, or confined to specific worker processes. When a problem is intermittent, geography-correlated, and a network-routing change makes it disappear — trace the network-dependent code path first, and trace it at the system call level.
Remote-first. Dallas-based. Available until 2am CT.