DRBD resync took down Jenkins disk I/O for 8 hours
During a full resync of the Jenkins DRBD partition (1.1 TB), w_await on the primary node shot up to 103ms. Builds started degrading, agents began timing out. The cause: a fixed resync-rate that ignored application I/O. The fix took 15 minutes.
Context
Jenkins runs in active-standby DR: the primary node holds the DRBD partition in Primary/UpToDate state, secondary in Secondary/UpToDate. Both nodes run on OpenStack Ubuntu 22.04, with the jenkins partition occupying 1.1 TB of a 1.3 TB block device.
After planned maintenance, the secondary node fell out of sync. DRBD started a full resync.
Problem
Right after resync started, iostat on the primary showed normal working load:
Device r/s w/s wkB/s w_await %util
vdb 945 2270 12969 5.71 34.4
Ten minutes later, under active Jenkins load:
Device r/s w/s wkB/s w_await %util
vdb 980 54 5613 103.00 76.0
At 103ms, w_await means the disk queue is backing up. Builds started hanging, some agents went into a JNLP reconnect loop.
The resync config at the time of the incident:
drbdsetup disk-options jenkins --c-plan-ahead=0 --resync-rate=512000k
c-plan-ahead=0 disables the adaptive controller. DRBD was reading ~945 r/s from disk to send to the secondary — without throttling back even when Jenkins was writing heavily.
Solution
Switched to adaptive resync on the fly — no Jenkins downtime, no replication interruption:
drbdsetup disk-options jenkins \
--c-plan-ahead=20 \
--c-fill-target=50k \
--c-min-rate=50000k \
--c-max-rate=500000k
What each parameter does:
c-plan-ahead=20 — the controller’s planning horizon in units of 0.1s, i.e. 2 seconds. The docs recommend ≥ 5x RTT; with 0.23ms ping between nodes, 20 is overkill, but harmless.
c-fill-target=50k — how much data to keep in-flight, in sectors (1 sector = 512 bytes, 50k ≈ 25 MB). The controller adjusts read speed to keep the queue from growing.
c-min-rate=50000k — the resync floor. Even when Jenkins is writing hard, the controller won’t drop below ~50 MB/s.
c-max-rate=500000k — the ceiling. When the disk is idle, resync can use up to ~500 MB/s.
Result
iostat two minutes after changing the parameters:
Device r/s w/s wkB/s w_await %util
vdb 870 2269 12965 5.71 34.4 # Jenkins writing actively
vdb 0 0 0 0.00 0.0 # Jenkins idle
vdb 1 1 4 0.00 0.0 # Jenkins idle
w_await peaked at 5.7ms, down from 103ms. Builds stopped hanging.
Resync speed dropped to ~36 MB/s — DRBD steps aside for Jenkins and takes what’s left. Estimated time to full sync: 8.5 hours. Jenkins ran without interruption throughout.
Takeaways
A fixed resync-rate with c-plan-ahead=0 on a production node is a misconfiguration. The adaptive controller is on by default in DRBD 9 (c-plan-ahead=20), but an explicit --c-plan-ahead=0 in drbdsetup disk-options turns it off.
If you’re tuning parameters via drbdsetup, check that c-plan-ahead isn’t zero. Parameters apply without restarting the resource or breaking replication.
Next time: start with adaptive, then decide if a fixed rate is actually needed.