Runbook

Startup

wakeplaned serve --db /var/lib/wakeplane/wakeplane.db --addr :8080

Environment variables:

Variable	Default	Description
`WAKEPLANE_DB`	`./wakeplane.db`	SQLite database path
`WAKEPLANE_ADDR`	`:8080`	Listen address
`WAKEPLANE_LOG_LEVEL`	`info`	Log level: `debug`, `info`, `warn`, `error`

Confirm the daemon is healthy:

curl http://localhost:8080/health    # {"status":"ok"}
curl http://localhost:8080/ready     # {"status":"ok"} — loops initialized

Health endpoints

Endpoint	Probe	Returns `ok` when
`/health`	Liveness	Process is running
`/ready`	Readiness	Database open, planner and dispatcher loops initialized

Shutdown logging

Normal drain sequence emits structured log lines:

level=info msg="shutting down" phase=run_loop
level=info msg="run loop exited"
level=info msg="dispatcher drained" active_workers=0
level=info msg="store closed"

If shutdown stalls, look for:

level=warn msg="dispatcher shutdown stalled" active_workers=N

This indicates non-cooperative executors. Extend the shutdown timeout or use process supervision to handle termination.

Key metrics

Expose at /metrics in Prometheus text format.

Metric	Alert threshold	Action
`wakeplane_due_runs`	> 100 sustained	Dispatcher may be stalled or overloaded
`wakeplane_running_runs`	> expected concurrency	Check for stuck runs
`wakeplane_dead_letter_runs_total`	> 0	Investigate failed runs
`wakeplane_expired_claims_total`	growing	Workers may be crashing during execution
`wakeplane_planner_tick_duration_seconds`	> 1s	Database pressure

Common failures

Stuck `running` runs

Cause: Executor died without completing. Lease expired.
Detection: wakeplane run list --status running shows old runs. wakeplane_expired_claims_total growing.
Recovery: Expired leases are automatically returned to pending by the planner on the next tick.

Stuck `claimed` runs

Cause: Dispatcher claimed a run but did not start the executor.
Detection: wakeplane run list --status claimed shows old runs.
Recovery: Expired claimed leases are also returned to pending automatically.

Dead letters accumulating

Cause: Runs are exhausting their retry budget.
Detection: wakeplane_dead_letter_runs_total increasing.
Recovery: Investigate run receipts. Fix the underlying target. Optionally re-trigger manually.

Growing `due_runs`

Cause: Dispatcher is not keeping up with the planner.
Detection: wakeplane_due_runs increasing steadily.
Recovery: Check dispatcher logs for errors. Check database write latency. Consider reducing concurrency or increasing --tick-interval.

`db locked` errors

Cause: Multiple writers attempting concurrent SQLite writes.
Detection: database is locked in logs.
Recovery: Ensure only one wakeplaned process is running against the same database file. Wakeplane uses SetMaxOpenConns(1) but multi-process access is unsupported.

Schedule not firing

Cause: Schedule is paused, spec is invalid, or timezone is wrong.
Detection: wakeplane schedule get <name> — check state, next_run_at, and timezone.
Recovery: Correct the schedule spec or timezone. Resume if paused.

SQLite backup

# Online backup (safe while daemon is running)
sqlite3 /var/lib/wakeplane/wakeplane.db ".backup /var/lib/wakeplane/wakeplane.bak"

# Verify
sqlite3 /var/lib/wakeplane/wakeplane.bak "PRAGMA integrity_check;"