Zero-Downtime Upgrades
Zero-Downtime Upgrades
Section titled “Zero-Downtime Upgrades”Replace a running Dwaar binary without dropping a single connection. Dwaar uses Pingora’s file-descriptor transfer mechanism: the new process inherits the listening sockets from the old one, starts accepting requests immediately, and the old process drains its in-flight connections before exiting.
This works at the OS level — no connection is reset, no client sees a TCP error, no request is lost.
How It Works
Section titled “How It Works”The key property is that both processes share the listening sockets briefly during the handover. The kernel queues incoming connections on the shared socket; the new process services them. No SO_REUSEPORT race, no listen gap.
CLI Commands
Section titled “CLI Commands”Trigger an upgrade with the upgrade subcommand:
# Upgrade using the current binary (replaces itself)dwaar upgrade
# Upgrade to a specific new binarydwaar upgrade --binary /usr/local/bin/dwaar-1.2.0
# Specify a non-default PID file locationdwaar upgrade --pid-file /run/dwaar/dwaar.pidPass --upgrade as a flag when you want to start a process that takes over from a running instance (this is what dwaar upgrade does internally):
dwaar --upgrade --config /etc/dwaar/Dwaarfile| Flag / Subcommand | Purpose |
|---|---|
dwaar upgrade | Orchestrates the full upgrade: starts new process, sends SIGQUIT to old |
--upgrade | Tells the new process to inherit listening FDs from the old one |
--binary PATH | Path to the new binary (default: current executable) |
--pid-file PATH | Path to the PID file of the running instance (default: /tmp/dwaar.pid) |
PID File
Section titled “PID File”Dwaar writes a PID file when running in daemon mode (--daemon). The upgrade subcommand reads this file to find the old process.
Default location: /tmp/dwaar.pid
Configure a production path via the unit file or startup flags:
dwaar --daemon --config /etc/dwaar/Dwaarfile# PID written to /tmp/dwaar.pid by defaultWhen using systemd with Type=simple (no --daemon), set PIDFile in the unit file and use $MAINPID — systemd tracks the PID itself, so you can pass it explicitly:
# In ExecStart, write the PID manually if not using --daemonExecStart=/usr/local/bin/dwaar --config /etc/dwaar/DwaarfileFor daemon mode in production:
ExecStart=/usr/local/bin/dwaar --daemon --config /etc/dwaar/DwaarfilePIDFile=/run/dwaar/dwaar.pidThe upgrade subcommand verifies the PID file before trusting it — it checks that the file is owned by the current user and is not world-writable, preventing privilege escalation via a tampered PID file.
Connection Draining
Section titled “Connection Draining”After receiving SIGQUIT, the old process enters a drain phase:
- Stops calling
accept()on all listening sockets. - Waits for all in-flight HTTP requests to complete.
- Exits once the drain window closes or all connections finish, whichever comes first.
The drain window is controlled by drain_timeout_secs in your Dwaarfile:
options { drain_timeout_secs 30}| Value | Behaviour |
|---|---|
| Not set | Default 30 seconds |
0 | Drain immediately (may cut active long-poll or streaming responses) |
120 | Wait up to 2 minutes for long-running requests to finish |
Pingora also applies its own grace_period_seconds (5 s) and graceful_shutdown_timeout_seconds (5 s) on top of the drain window. In practice, connections that complete within drain_timeout_secs exit cleanly; connections still open at the deadline are closed.
Tune drain_timeout_secs to match your longest expected request duration. For APIs with short timeouts, 30 s is sufficient. For file uploads or long-poll endpoints, increase to 60–120 s.
Supervisor Readiness Probe
Section titled “Supervisor Readiness Probe”When Dwaar runs in multi-worker mode, a supervisor process forks child workers and restarts them on crash or reload. As of 0.2.2 the supervisor no longer retires an old worker until the new child has proven it can serve traffic.
What is probed. The supervisor connects to the worker’s admin endpoint — the Unix domain socket path if --admin-socket was passed, otherwise TCP 127.0.0.1:6190 (the always-on admin listener on worker 0). A successful connect() is proof that the child bound its listeners.
Cadence. Poll every 50 ms, capped at a 10 s deadline. The supervisor uses blocking stdlib sockets (std::net::TcpStream, std::os::unix::net::UnixStream) rather than Tokio — the supervisor loop runs before Pingora’s runtime is spun up, so there is no async executor available and none is needed.
Failure modes.
| Condition | Supervisor action |
|---|---|
connect() succeeds before the deadline | Restart is declared successful. Old worker receives SIGQUIT. |
waitpid(WNOHANG) reports the child exited before readiness | ChildExited — the old worker is left running and serving, the restart is aborted. Check logs for the child’s panic. |
| 10 s deadline elapses with no successful connect | Timeout — the new worker is killed, the old one keeps serving. |
This closes a race that existed before: a worker whose constructor panicked after fork() but before bind could leave the supervisor thinking the child was alive. The waitpid(WNOHANG) arm catches that case cleanly.
Shutdown flag ordering (0.2.3)
Section titled “Shutdown flag ordering (0.2.3)”The supervisor’s SHUTTING_DOWN flag — set by the SIGTERM / SIGINT signal handler and read by the supervisor loop to decide whether to restart a dying child — is now loaded and stored with Ordering::SeqCst on both sides of the exchange.
Relaxed or acquire/release orderings were not strictly wrong for a single atomic, but SeqCst is the ordering required by the C11 memory model to guarantee that a signal handler on one thread and a normal load on the supervisor loop thread observe the flag in a consistent global order — especially relevant under POSIX signal-safety rules where the signal can interrupt the loop mid-iteration. Without SeqCst, a signal arriving between the flag load and the fork() call could leave a fresh child running after the supervisor had already decided to shut down.
No operator-facing change. Shutdown semantics are unchanged; the fix is a memory-ordering correctness patch that closes a rare race observed under synthetic stress testing.
Step-by-Step
Section titled “Step-by-Step”Follow these steps to perform a production upgrade:
-
Build or download the new binary.
Terminal window # Example: download to a staging pathcurl -Lo /usr/local/bin/dwaar-new https://releases.dwaar.dev/v1.2.0/dwaar-linux-amd64chmod +x /usr/local/bin/dwaar-new -
Validate the new binary against the live config before touching the running process.
Terminal window /usr/local/bin/dwaar-new validate --config /etc/dwaar/Dwaarfile -
Run the upgrade.
Terminal window dwaar upgrade --binary /usr/local/bin/dwaar-new --pid-file /run/dwaar/dwaar.pidYou will see output similar to:
upgrading dwaar (old PID: 12345)starting new process: /usr/local/bin/dwaar-new --upgrade --config /etc/dwaar/Dwaarfilenew process started (PID: 12399)sending SIGQUIT to old process (PID: 12345)upgrade complete — old process will drain and exit -
Verify (see next section).
-
If the new process fails to start, the old process continues running unaffected — the SIGQUIT is only sent after the new process passes the liveness poll.
-
Swap the binary symlink once satisfied.
Terminal window ln -sf /usr/local/bin/dwaar-new /usr/local/bin/dwaar
Verification
Section titled “Verification”Confirm the upgrade succeeded before removing the old binary.
Check the new PID is running and the version matches:
# Confirm new process is aliveps aux | grep dwaar
# Check versiondwaar versionCheck active routes via the admin API (available immediately after the new process starts):
dwaar routesCheck logs for the new process startup message:
journalctl -u dwaar -n 50# Look for: "starting dwaar" with the new versionCheck that the old process has exited:
# Old PID should be gonekill -0 <old_pid> 2>&1# Expected: "No such process"Send a test request to confirm traffic is being served:
curl -sv https://yourdomain.example/ -o /dev/nullIf the new process crashes immediately after launch (visible in journalctl), the old process is still running and serving traffic — SIGQUIT was not sent. Investigate the startup failure, fix it, and retry.
Related
Section titled “Related”- Systemd Service — running Dwaar as a managed systemd unit
- Docker — rolling updates in container environments
- Timeouts — configuring
drain_timeout_secsand connection timeouts