Runbook
Startup
Section titled “Startup”wakeplaned serve --db /var/lib/wakeplane/wakeplane.db --addr :8080Environment variables:
| Variable | Default | Description |
|---|---|---|
WAKEPLANE_DB | ./wakeplane.db | SQLite database path |
WAKEPLANE_ADDR | :8080 | Listen address |
WAKEPLANE_LOG_LEVEL | info | Log level: debug, info, warn, error |
Confirm the daemon is healthy:
curl http://localhost:8080/health # {"status":"ok"}curl http://localhost:8080/ready # {"status":"ok"} — loops initializedHealth endpoints
Section titled “Health endpoints”| Endpoint | Probe | Returns ok when |
|---|---|---|
/health | Liveness | Process is running |
/ready | Readiness | Database open, planner and dispatcher loops initialized |
Shutdown logging
Section titled “Shutdown logging”Normal drain sequence emits structured log lines:
level=info msg="shutting down" phase=run_looplevel=info msg="run loop exited"level=info msg="dispatcher drained" active_workers=0level=info msg="store closed"If shutdown stalls, look for:
level=warn msg="dispatcher shutdown stalled" active_workers=NThis indicates non-cooperative executors. Extend the shutdown timeout or use process supervision to handle termination.
Key metrics
Section titled “Key metrics”Expose at /metrics in Prometheus text format.
| Metric | Alert threshold | Action |
|---|---|---|
wakeplane_due_runs | > 100 sustained | Dispatcher may be stalled or overloaded |
wakeplane_running_runs | > expected concurrency | Check for stuck runs |
wakeplane_dead_letter_runs_total | > 0 | Investigate failed runs |
wakeplane_expired_claims_total | growing | Workers may be crashing during execution |
wakeplane_planner_tick_duration_seconds | > 1s | Database pressure |
Common failures
Section titled “Common failures”Stuck running runs
Section titled “Stuck running runs”Cause: Executor died without completing. Lease expired.
Detection: wakeplane run list --status running shows old runs. wakeplane_expired_claims_total growing.
Recovery: Expired leases are automatically returned to pending by the planner on the next tick.
Stuck claimed runs
Section titled “Stuck claimed runs”Cause: Dispatcher claimed a run but did not start the executor.
Detection: wakeplane run list --status claimed shows old runs.
Recovery: Expired claimed leases are also returned to pending automatically.
Dead letters accumulating
Section titled “Dead letters accumulating”Cause: Runs are exhausting their retry budget.
Detection: wakeplane_dead_letter_runs_total increasing.
Recovery: Investigate run receipts. Fix the underlying target. Optionally re-trigger manually.
Growing due_runs
Section titled “Growing due_runs”Cause: Dispatcher is not keeping up with the planner.
Detection: wakeplane_due_runs increasing steadily.
Recovery: Check dispatcher logs for errors. Check database write latency. Consider reducing concurrency or increasing --tick-interval.
db locked errors
Section titled “db locked errors”Cause: Multiple writers attempting concurrent SQLite writes.
Detection: database is locked in logs.
Recovery: Ensure only one wakeplaned process is running against the same database file. Wakeplane uses SetMaxOpenConns(1) but multi-process access is unsupported.
Schedule not firing
Section titled “Schedule not firing”Cause: Schedule is paused, spec is invalid, or timezone is wrong.
Detection: wakeplane schedule get <name> — check state, next_run_at, and timezone.
Recovery: Correct the schedule spec or timezone. Resume if paused.
SQLite backup
Section titled “SQLite backup”# Online backup (safe while daemon is running)sqlite3 /var/lib/wakeplane/wakeplane.db ".backup /var/lib/wakeplane/wakeplane.bak"
# Verifysqlite3 /var/lib/wakeplane/wakeplane.bak "PRAGMA integrity_check;"See also
Section titled “See also”- Reference: API Contract — status endpoint
- Concepts: Run Ledger — run state transitions
- Concepts: Policies — retry and misfire behavior