Skip to content

Zero-Downtime Upgrades

Replace a running Dwaar binary without dropping a single connection. Dwaar uses Pingora’s file-descriptor transfer mechanism: the new process inherits the listening sockets from the old one, starts accepting requests immediately, and the old process drains its in-flight connections before exiting.

This works at the OS level — no connection is reset, no client sees a TCP error, no request is lost.

The key property is that both processes share the listening sockets briefly during the handover. The kernel queues incoming connections on the shared socket; the new process services them. No SO_REUSEPORT race, no listen gap.

Trigger an upgrade with the upgrade subcommand:

Terminal window
# Upgrade using the current binary (replaces itself)
dwaar upgrade
# Upgrade to a specific new binary
dwaar upgrade --binary /usr/local/bin/dwaar-1.2.0
# Specify a non-default PID file location
dwaar upgrade --pid-file /run/dwaar/dwaar.pid

Pass --upgrade as a flag when you want to start a process that takes over from a running instance (this is what dwaar upgrade does internally):

Terminal window
dwaar --upgrade --config /etc/dwaar/Dwaarfile
Flag / SubcommandPurpose
dwaar upgradeOrchestrates the full upgrade: starts new process, sends SIGQUIT to old
--upgradeTells the new process to inherit listening FDs from the old one
--binary PATHPath to the new binary (default: current executable)
--pid-file PATHPath to the PID file of the running instance (default: /tmp/dwaar.pid)

Dwaar writes a PID file when running in daemon mode (--daemon). The upgrade subcommand reads this file to find the old process.

Default location: /tmp/dwaar.pid

Configure a production path via the unit file or startup flags:

Terminal window
dwaar --daemon --config /etc/dwaar/Dwaarfile
# PID written to /tmp/dwaar.pid by default

When using systemd with Type=simple (no --daemon), set PIDFile in the unit file and use $MAINPID — systemd tracks the PID itself, so you can pass it explicitly:

Terminal window
# In ExecStart, write the PID manually if not using --daemon
ExecStart=/usr/local/bin/dwaar --config /etc/dwaar/Dwaarfile

For daemon mode in production:

Terminal window
ExecStart=/usr/local/bin/dwaar --daemon --config /etc/dwaar/Dwaarfile
PIDFile=/run/dwaar/dwaar.pid

The upgrade subcommand verifies the PID file before trusting it — it checks that the file is owned by the current user and is not world-writable, preventing privilege escalation via a tampered PID file.

After receiving SIGQUIT, the old process enters a drain phase:

  • Stops calling accept() on all listening sockets.
  • Waits for all in-flight HTTP requests to complete.
  • Exits once the drain window closes or all connections finish, whichever comes first.

The drain window is controlled by drain_timeout_secs in your Dwaarfile:

options {
drain_timeout_secs 30
}
ValueBehaviour
Not setDefault 30 seconds
0Drain immediately (may cut active long-poll or streaming responses)
120Wait up to 2 minutes for long-running requests to finish

Pingora also applies its own grace_period_seconds (5 s) and graceful_shutdown_timeout_seconds (5 s) on top of the drain window. In practice, connections that complete within drain_timeout_secs exit cleanly; connections still open at the deadline are closed.

Tune drain_timeout_secs to match your longest expected request duration. For APIs with short timeouts, 30 s is sufficient. For file uploads or long-poll endpoints, increase to 60–120 s.

When Dwaar runs in multi-worker mode, a supervisor process forks child workers and restarts them on crash or reload. As of 0.2.2 the supervisor no longer retires an old worker until the new child has proven it can serve traffic.

What is probed. The supervisor connects to the worker’s admin endpoint — the Unix domain socket path if --admin-socket was passed, otherwise TCP 127.0.0.1:6190 (the always-on admin listener on worker 0). A successful connect() is proof that the child bound its listeners.

Cadence. Poll every 50 ms, capped at a 10 s deadline. The supervisor uses blocking stdlib sockets (std::net::TcpStream, std::os::unix::net::UnixStream) rather than Tokio — the supervisor loop runs before Pingora’s runtime is spun up, so there is no async executor available and none is needed.

Failure modes.

ConditionSupervisor action
connect() succeeds before the deadlineRestart is declared successful. Old worker receives SIGQUIT.
waitpid(WNOHANG) reports the child exited before readinessChildExited — the old worker is left running and serving, the restart is aborted. Check logs for the child’s panic.
10 s deadline elapses with no successful connectTimeout — the new worker is killed, the old one keeps serving.

This closes a race that existed before: a worker whose constructor panicked after fork() but before bind could leave the supervisor thinking the child was alive. The waitpid(WNOHANG) arm catches that case cleanly.

The supervisor’s SHUTTING_DOWN flag — set by the SIGTERM / SIGINT signal handler and read by the supervisor loop to decide whether to restart a dying child — is now loaded and stored with Ordering::SeqCst on both sides of the exchange.

Relaxed or acquire/release orderings were not strictly wrong for a single atomic, but SeqCst is the ordering required by the C11 memory model to guarantee that a signal handler on one thread and a normal load on the supervisor loop thread observe the flag in a consistent global order — especially relevant under POSIX signal-safety rules where the signal can interrupt the loop mid-iteration. Without SeqCst, a signal arriving between the flag load and the fork() call could leave a fresh child running after the supervisor had already decided to shut down.

No operator-facing change. Shutdown semantics are unchanged; the fix is a memory-ordering correctness patch that closes a rare race observed under synthetic stress testing.

Follow these steps to perform a production upgrade:

  1. Build or download the new binary.

    Terminal window
    # Example: download to a staging path
    curl -Lo /usr/local/bin/dwaar-new https://releases.dwaar.dev/v1.2.0/dwaar-linux-amd64
    chmod +x /usr/local/bin/dwaar-new
  2. Validate the new binary against the live config before touching the running process.

    Terminal window
    /usr/local/bin/dwaar-new validate --config /etc/dwaar/Dwaarfile
  3. Run the upgrade.

    Terminal window
    dwaar upgrade --binary /usr/local/bin/dwaar-new --pid-file /run/dwaar/dwaar.pid

    You will see output similar to:

    upgrading dwaar (old PID: 12345)
    starting new process: /usr/local/bin/dwaar-new --upgrade --config /etc/dwaar/Dwaarfile
    new process started (PID: 12399)
    sending SIGQUIT to old process (PID: 12345)
    upgrade complete — old process will drain and exit
  4. Verify (see next section).

  5. If the new process fails to start, the old process continues running unaffected — the SIGQUIT is only sent after the new process passes the liveness poll.

  6. Swap the binary symlink once satisfied.

    Terminal window
    ln -sf /usr/local/bin/dwaar-new /usr/local/bin/dwaar

Confirm the upgrade succeeded before removing the old binary.

Check the new PID is running and the version matches:

Terminal window
# Confirm new process is alive
ps aux | grep dwaar
# Check version
dwaar version

Check active routes via the admin API (available immediately after the new process starts):

Terminal window
dwaar routes

Check logs for the new process startup message:

Terminal window
journalctl -u dwaar -n 50
# Look for: "starting dwaar" with the new version

Check that the old process has exited:

Terminal window
# Old PID should be gone
kill -0 <old_pid> 2>&1
# Expected: "No such process"

Send a test request to confirm traffic is being served:

Terminal window
curl -sv https://yourdomain.example/ -o /dev/null

If the new process crashes immediately after launch (visible in journalctl), the old process is still running and serving traffic — SIGQUIT was not sent. Investigate the startup failure, fix it, and retry.

  • Systemd Service — running Dwaar as a managed systemd unit
  • Docker — rolling updates in container environments
  • Timeouts — configuring drain_timeout_secs and connection timeouts