← Back to case studies

Diagnosis — database, filesystem & kernel

The MariaDB stall under low load: InnoDB flush behavior amplified by ZFS synchronous writes

PlatformFreeBSD · ZFS · MariaDB / InnoDB

DatasetHundreds of GB, write-heavy workload

Toolsktrace on mysqld · fsync correlation

Root causeinnodb_flush_log_at_trx_commit=1 amplified by ZFS sync writes to spinning disk

FixFlush policy tuning + 480GB SSD added as ZFS SLOG

Threads_running during stallsLow

The situation

The database server was large — hundreds of gigabytes across InnoDB tables, a write-heavy workload handling continuous transaction volume. By every conventional metric, it was not under stress. Threads_running was low. CPU utilization was comfortable. The I/O graphs showed normal-looking patterns. There was nothing in the MariaDB status variables that pointed at a saturated system.

And yet queries were periodically freezing. Not failing — freezing. A query that should return in milliseconds would occasionally take several seconds, then complete normally. The next ten queries would be fast. Then another stall. The pattern was periodic and roughly predictable in frequency, but the individual stalls were unpredictable in which query they'd affect. Applications sitting on top of the database saw intermittent slow responses. Connection pools started backing up during stall windows. The problem was real and measurable but its cause was invisible to everything that was being monitored.

The stalls weren't caused by contention between queries. They weren't caused by disk saturation. They were caused by a deterministic interaction between two configurations that were each correct in isolation — and catastrophic together on this hardware.

The wrong hypotheses

Lock contention was the first candidate — a long-running transaction holding a row or table lock while shorter queries waited. SHOW ENGINE INNODB STATUS and the lock wait instrumentation in the performance schema showed nothing consistent with the stall pattern. Lock waits existed, as they always do, but they weren't correlated with the stall timing.

Disk I/O bottleneck was the second candidate — the spindles saturated, the I/O queue backing up, queries waiting for data that was slow to read from disk. The I/O metrics during stall windows didn't show saturation. Queue depth was normal. Read latency was normal. The disks weren't overwhelmed — they were being asked to do something specific and doing it slowly, but the aggregate statistics obscured that detail entirely.

Both hypotheses were wrong. Finding the right one required getting below the database's own instrumentation and watching what the process was actually doing at the system call level.

The diagnostic process

ktrace on the mysqld process

ktrace was attached to the mysqld process and its worker threads. The trace recorded every system call made by the database — reads, writes, seeks, and critically, fsync calls. fsync is the system call that forces data from the OS write buffer to the physical storage medium. It blocks until the storage confirms the write is durable. On spinning disk, that confirmation takes as long as the disk's rotational latency and write path require — typically several milliseconds per call, but variable and occasionally much longer depending on the I/O queue state at the moment of the call.

The stall timing in the ktrace output correlated precisely with fsync calls issued by the InnoDB storage engine. During stall windows, mysqld was blocked in fsync — not for the duration of a single disk write, but for the duration of a flush cycle that involved multiple sequential fsync calls across the InnoDB redo log. While those fsync calls were in flight, every other thread waiting to commit a transaction was blocked behind them. Queries that had completed their work and were ready to return results were sitting in queue, waiting for the flush cycle to finish before their commit could be acknowledged.

The stalls weren't random. They were periodic because InnoDB's log flush cycle is periodic — driven by innodb_flush_log_at_trx_commit=1, which instructs InnoDB to flush and sync the redo log to disk on every transaction commit. Under the burst write patterns of this workload, commits were accumulating faster than individual flushes could process them, creating a backlog that resolved in periodic bulk flush events rather than smooth continuous writes.

The ZFS amplification

On a conventional filesystem, an fsync call is expensive but bounded. On ZFS, it carries additional overhead: ZFS maintains its own intent log — the ZIL, or ZFS Intent Log — to guarantee write ordering and durability across pool operations. When an application calls fsync, ZFS must flush its intent log to the same storage that holds the data pool. By default, without a dedicated SLOG (Separate LOG device), the ZIL writes go to the same spinning disk vdevs as everything else — competing with data reads and writes for the same I/O queue.

The result was a compounding effect. InnoDB's fsync calls were already expensive on spinning disk. On ZFS without a dedicated SLOG, each fsync also triggered ZIL activity on the same spindles, serializing through the same I/O queue. The stall duration that ktrace measured was longer than the raw disk latency numbers suggested it should be, because each flush cycle was doing more work than a non-ZFS system would do for the same operation. Two layers of durability guarantee, both landing on the same spinning disk, at the same time, triggered by the same commit events.

The fix — two changes at two different layers

The diagnosis identified two interacting problems. The fix addressed both, at their respective layers. Either change alone would have improved the situation. Together they eliminated the stalls entirely.

Layer 1 — innodb_flush_log_at_trx_commit tuning

innodb_flush_log_at_trx_commit=1 is InnoDB's safest setting: every commit is synchronously flushed and synced to the redo log before the client receives acknowledgement. It provides the strongest durability guarantee — in the event of a crash, no committed transaction is lost. It is also the most expensive setting, because it turns every commit into a synchronous I/O operation.

Setting this to 2 changes the behavior: InnoDB writes to the log buffer on every commit and flushes to the OS file cache, but the sync to physical storage happens once per second rather than on every commit. In the event of a crash, up to one second of committed transactions may be lost — they are in the OS buffer but not yet on disk. For many production workloads, this is an acceptable trade-off. For a workload where the data is replicated to a secondary — as this database was — the practical durability difference is minimal, because the replica has an independent copy of the committed data.

Changing to innodb_flush_log_at_trx_commit=2 immediately reduced the frequency of fsync calls by an order of magnitude — from one per commit to approximately one per second. The periodic bulk flush events that were causing stalls became far less frequent and far shorter, because the log buffer had less to flush on each cycle.

Layer 2 — dedicated SLOG device

A 480GB SSD was added to the ZFS pool as a dedicated SLOG device. The SLOG gives ZFS a fast, dedicated location for ZIL writes — synchronous write operations that previously had to compete with data I/O on the spinning disk vdevs now land on SSD first and are committed to the data pool asynchronously. The synchronous write latency for ZFS operations dropped from tens of milliseconds to sub-millisecond, because SSD write latency is measured in microseconds rather than the milliseconds of spinning disk rotational latency.

The SLOG change addressed the ZFS amplification of the InnoDB fsync cost. Even with innodb_flush_log_at_trx_commit=1 — which we changed — the SLOG would have significantly reduced stall duration by removing ZIL activity from the spinning disk I/O queue. With both changes in place, the remaining fsync cycles that InnoDB issued landed on fast storage rather than spinning disk, completing in a fraction of the time they had previously taken.

Observing the result

After both changes, ktrace output confirmed that fsync calls were completing in sub-millisecond time — consistent with SSD latency rather than spinning disk latency. The periodic stall pattern in query response time disappeared. MariaDB's own status counters showed improved throughput consistency. Connection pool wait times normalized. The application's slow response events ceased.

Why this only appeared under burst writes

The stalls were not constant because the problem was rate-dependent. Under light write traffic, InnoDB's per-commit fsync calls were infrequent enough that spinning disk could keep up — each flush completed before the next was requested, and no backlog accumulated. Under burst write conditions — concentrated transaction volume over short windows — commits arrived faster than the flush cycle could process them synchronously. The backlog created the periodic bulk flush events. The bulk flush events created the stalls.

This is why the CPU and Threads_running metrics showed nothing alarming. The threads weren't running during the stall — they were sleeping, waiting for fsync to complete. A sleeping thread consumes no CPU. It contributes nothing to load average in the traditional sense. It is invisible to every metric that measures activity rather than waiting. The problem was not that the system was doing too much. It was that the system was waiting — efficiently, quietly, and periodically — for storage that couldn't keep up with the specific I/O pattern InnoDB was generating.

Threads_running during stalls

Low. Threads were sleeping in fsync — consuming no CPU, invisible to activity metrics.

Root cause

innodb_flush_log_at_trx_commit=1 burst-mode fsync amplified by ZFS ZIL on spinning disk — two durability layers competing for the same I/O queue.

Diagnostic method

ktrace on mysqld threads — fsync call timing correlated precisely with application-level stall windows.

Fix layer 1

innodb_flush_log_at_trx_commit=2 — reduced fsync frequency from per-commit to per-second.

Fix layer 2

480GB SSD added as dedicated ZFS SLOG — remaining fsync calls complete in microseconds, not milliseconds.

The lesson

Performance problems that span subsystem boundaries are the hardest to diagnose because no single tool can see both sides simultaneously. A DBA looking at MariaDB metrics sees low thread counts and normal lock activity — nothing wrong. A storage admin looking at disk I/O sees normal-looking queue depths — nothing wrong. Both observations are correct. The problem lives in the contract between them: InnoDB's durability guarantee requires synchronous disk acknowledgement, ZFS's write ordering guarantee adds its own synchronous overhead to the same operation, and spinning disk satisfies both requirements but does so slowly enough that under burst conditions the combination creates periodic, deterministic stalls.

Finding this required instrumenting at the system call level — below InnoDB's own visibility, below ZFS's own logging — to see the actual fsync latency that was blocking the database's commit threads. ktrace provided that visibility. Once the stall timing was correlated with the fsync call pattern, both causes were clear, and the fix at each layer followed directly. A cheap SSD in the right position — as a dedicated SLOG rather than in the data pool — changed the I/O profile for every synchronous write on the system for a fraction of the cost of replacing the spinning disk arrays.

← Previous: The network problem that wasn't All case studies →