Monitoring, Observability & Infrastructure Recovery

The fundamental problem

Most monitoring is measuring the wrong things

A dashboard full of green indicators is not evidence that everything is working. It is evidence that everything being checked is working — which is only as good as what you decided to check.

The case studies on this site document several incidents where conventional monitoring showed nothing wrong while real users experienced real failures. Pingdom reported the site as healthy throughout the entire period a GeoIP segfault was serving blank pages to users in the Philippines — because Pingdom was checking whether the server responded, not whether the response was correct. HAProxy health checks showed all backends available while PHP-FPM workers were deadlocked on database connections — because the health checks were testing TCP handshakes, not request completion. CPU and memory graphs looked normal while the ZFS ARC was being silently evicted by a misconfigured Valkey process — because standard metrics don't measure memory at the subsystem boundary.

The pattern across all of these is the same: a monitoring system that checked the right tools in the wrong way, or checked the surface of a system rather than its behavior. Building monitoring that doesn't have this problem requires understanding the specific failure modes of the specific stack — not deploying a generic monitoring product and accepting its default checks as sufficient.

Architecture

Three layers, each doing something different

The monitoring architecture combines three distinct layers that complement each other's blind spots. No single layer is sufficient on its own.

Layer 1 — Local

monit — service recovery and push alerting

monit watches process health, service availability, and resource consumption on each host and within each jail. When a service fails and recovery is possible, monit restarts it immediately — without waiting for a human to notice. When recovery isn't possible, or when resource thresholds are crossed, it pushes an alert directly to on-call staff via Pushover or Prowl — push notifications to mobile, not emails that wait to be read. Recovery happens in seconds because the check and the recovery action run on the same host with no external dependency. The alert follows within the same cycle if the recovery doesn't hold.

Layer 2 — Metrics

collectd → Graphite → Grafana

collectd collects system and application metrics continuously and ships them to Graphite for storage and trending. Grafana provides dashboards and alerting on the metric stream. This layer answers questions that monit can't: not "is the service running" but "how busy is it, how has that changed over time, and what does the current trend imply about the near future." Custom metric emission scripts in Perl extend collectd's built-in collection with application-level signals that no off-the-shelf collector knows to measure.

Layer 3 — External

Multi-geo VPS — what real users in different regions see

Monitoring VPS nodes in Miami, Washington DC, and London perform checks from outside the network perimeter simultaneously. HTTP and HTTPS checks with response content validation, custom API endpoint tests, SSL certificate expiry, and domain expiry monitoring — from three geographic vantage points. This catches what a single-location check misses: CDN issues serving stale or broken content to specific regions, routing anomalies affecting only certain geographic paths, and geographic-specific failures like the GeoIP case where a single-location check would never have triggered. Three locations failing the same check is unambiguous; one location failing while two pass points at a regional problem rather than a server problem.

Custom instrumentation

Signals built for the specific stack

Off-the-shelf monitoring collects generic signals. The signals that matter most for a specific production environment — the ones that predict failures before they become visible — require custom instrumentation built around how that stack actually behaves.

Queue latency

SMS delivery and email hygiene pipeline

In affiliate marketing stacks where SMS delivery and email hygiene services sit in the processing pipeline, their latency directly affects throughput. When these third-party services slow down, they create backpressure that halts the entire stack flow — jobs queue, delivery rates drop, revenue is affected. Custom Graphite metrics track queue depth and processing latency for each stage. When latency exceeds configured bounds, automated circuit breaker behavior halts queuing to prevent the backlog from growing beyond recovery. The metric is the signal that triggers the response; without it, the first indication of a problem is a halted pipeline discovered at reporting time.

Third-party API health

Dependency monitoring for services that cause downtime

Any external API that sits in a critical path — payment processors, delivery APIs, hygiene services, data enrichment endpoints — is monitored with custom checks that test the actual response rather than just reachability. A third-party API can be reachable and returning 200 responses while the response content indicates degradation, quota exhaustion, or incorrect data. Custom Perl scripts check specific endpoint behavior on a schedule and emit metrics and alerts based on what the response actually contains. An API that is causing downstream queue buildup raises an alert before the queue depth itself becomes the visible symptom.

PHP-FPM busyness

Scaling signal from pool utilization

PHP-FPM exposes a status endpoint that reports active workers, idle workers, and queue depth. That data is collected as a Graphite metric and used as a scaling signal — when active worker utilization across the pool approaches configured thresholds, it triggers capacity additions before requests start queuing. This is the difference between reactive scaling (add capacity after users experience slowness) and proactive scaling (add capacity when the trend indicates it will be needed). The metric also feeds alerting: a sustained high active worker ratio that isn't being resolved by scaling indicates an upstream problem rather than a capacity shortage.

Valkey / Redis

Memory, hit rate, and fragmentation

Valkey metrics tracked include memory usage versus configured maxmemory, cache hit rate, eviction rate, and jemalloc fragmentation ratio. The fragmentation metric in particular is the signal that the DTrace/ARC case study documents the consequences of missing — a fragmentation ratio that climbs over time indicates allocator pressure that will eventually displace ZFS ARC cache, causing the filesystem to read from disk for data that should be cached. The metric catches the drift before the latency spikes become visible.

MariaDB

Replication lag, slow queries, connection pool

MariaDB metrics include replication lag on all replicas (with an alerting threshold that fires before lag reaches a level that would affect a promotion decision), slow query rate trending, active connections versus max_connections, and InnoDB buffer pool hit rate. Connection pool utilization is tracked against the PHP-FPM max_children count — when the two approach each other, it indicates either connection leak behavior or a need to revisit the concurrency configuration before connection exhaustion causes errors.

Jail-level metrics

Per-jail resource consumption

Each iocage jail is monitored individually — CPU consumption, memory usage against RCTL limits, network I/O, and disk I/O. This provides visibility that host-level metrics obscure: a jail consuming an increasing share of host resources over time is visible as a trend before it affects other jails on the same host. RCTL limit violations are captured and alerted. A jail that is approaching its configured resource ceiling is flagged for review before it hits the ceiling and starts affecting performance.

External validation

Certificate, domain, and content monitoring at scale

Why external monitoring is not optional — and why location matters

A process that is running correctly on a host can still serve broken content to users. A service that is healthy from the inside can be unreachable from outside the network due to a firewall change, a routing problem, or a CDN misconfiguration. External monitoring — checks made from outside the infrastructure, simulating a real user request — is the only layer that sees what users actually see.

Monitoring VPS nodes in Miami, Washington DC, and London perform HTTP and HTTPS checks with content validation from three geographic vantage points simultaneously. The check verifies not just that a response was received but that the response contains expected content. A page that returns 200 with an error message fails the check. Three locations failing the same check simultaneously is a server problem. One location failing while two pass is a regional routing or CDN problem — a completely different diagnosis that a single-location check would never surface.

This multi-geo approach is the architecture that would have caught the GeoIP segfault documented in the GeoIP diagnosis case study — a content-validating check sourced from a Philippine IP range would have returned unexpected content immediately, from the correct geographic vantage point. A US-sourced check, even with content validation, would have passed throughout. Geographic vantage matters as much as check methodology.

For clients who want a managed external monitoring endpoint without running their own VPS infrastructure, Uptime Kuma deployed in Docker provides a self-hosted alternative to commercial monitoring services — with the same content validation capability and full control over check configuration and geographic distribution.

Certificate and domain expiry at scale

Certificate expiry is one of the most preventable causes of production downtime. A certificate that expires silently takes services offline with no application-level warning — the first indication is typically a user reporting a browser security error, or monitoring detecting a failed HTTPS check. For portfolios of hundreds or thousands of domains across multiple registrars, manual certificate tracking is not viable.

Domain expiry monitoring across portfolios exceeding 1,000 domains spanning multiple registrars uses both registrar API endpoints where available and database-backed tracking where registrar APIs are unavailable or unreliable. Checks run on a schedule, alerting thresholds are set at 60, 30, and 14 days before expiry — long enough to renew through the normal process before the emergency window. The monitoring covers both the SSL certificate validity and the domain registration expiry independently, since the two can fail on different schedules for the same domain.

HTTP/HTTPS content validation — not just status codes
Multi-geo external checks — Miami, Washington DC, London
monit push alerts via Pushover and Prowl
Uptime Kuma in Docker — self-hosted alternative
SSL certificate expiry — 60/30/14 day alerting
Domain registration expiry — registrar API and DB
1000+ domain portfolios across multiple registrars
Custom Perl scripts for API-based expiry checks
Application-specific endpoint validation
Third-party API health checks — response content
Geographic failure detection — regional vs server faults
Handover to client on request — documented and clean

When monitoring itself was wrong

Case studies where conventional monitoring failed

These case studies document incidents where the monitoring showed green while something was wrong — and what proper instrumentation would have caught earlier.

External monitoring failure

GeoIP segfault — Pingdom showed green

Apache workers segfaulting on Philippine IP ranges. Pingdom reported the site healthy throughout. A content-validating check from the affected geography would have caught it immediately.

Health check failure

PHP-FPM deadlock — health checks passing

Workers blocked on database I/O while completing FastCGI handshakes. HAProxy health checks showed all backends available. A real request path check would have failed immediately.

Metrics gap

ZFS ARC eviction — dashboards showed nothing

Valkey fragmentation silently displacing ZFS ARC. CPU and memory graphs looked healthy. Fragmentation ratio and ARC eviction rate metrics would have shown the drift weeks earlier.

Metrics gap

Scheduler pressure — CPU looked fine

100 threads competing for 16 cores, system felt sluggish. Context switch rate trending in Graphite would have shown the oversubscription growing as workers were added over time.

Monitoring & observability