Observability

Sparkwing tracks job health, failure reasons, and resource usage so you can debug failures fast and right-size containers.

Failure reasons

Every failed job carries a failure_reason in its result. The controller classifies failures automatically -- you never have to grep logs to figure out why a build died.

Reason	What happened	What to do
`oom_killed`	Container exceeded its memory limit and was killed by the kernel (exit 137).	Raise the runner memory limit or reduce the pipeline's memory use; check the resource chart.
`timeout`	Job exceeded its configured execution timeout.	Raise the timeout or optimize the pipeline.
`agent_lost`	Runner stopped heartbeating (crashed, evicted, or lost network).	Check pod events with `kubectl describe pod`; may indicate node pressure or a pipeline bug.
`queue_timeout`	No runner claimed the job within the queue timeout (default 15m).	Ensure runners are up and their advertised `--label` set satisfies the pipeline's `requires:` / node `.Requires()`.
`runner_lease_expired`	The runner holding the node's claim stopped renewing its lease, so the controller reclaimed it.	Check the runner's health; the node is safe to retry.
`verify`	The node's action completed, but its `Verify` postcondition returned an error -- the failure is at the verify stage, not the action.	Inspect the `Verify` assertion and the action's actual output.
`logs_auth`	The runner's log-append calls were rejected (401/403) by the controller, so the run's structured logs are unrecoverable.	Check the runner token's `logs.write` scope; the run fails loud rather than reporting success with no output.

A plain pipeline-level failure (a failed test or command) carries no structured failure_reason -- read the logs.

How detection works

The Kubernetes runner polls its Job/pod status while a node runs. When it sees a terminated container (e.g. OOMKilled, non-zero exit), it fails the node immediately with the specific reason rather than waiting for the heartbeat timeout.

For nodes where the pod disappears entirely (node failure, eviction), the controller's heartbeat sweep catches the missed lease and marks the node agent_lost.

API

The failure reason is available in all job responses:

{
  "result": {
    "success": false,
    "failure_reason": "oom_killed",
    "exit_code": 137,
    "logs": "Container \"runner\" was killed by the kernel OOM killer..."
  }
}

Resource usage metrics

While a node runs, the runner samples its own CPU and memory in-process (reading /proc) roughly every 2 seconds. Samples are stored and charted in the dashboard. No cluster metrics-server is involved.

What's measured

CPU: millicores, derived from the runner process's CPU time.
Memory: resident bytes (RSS).

The dashboard charts the samples over time with peak and average in the header.

API

GET /api/v1/runs/{id}/nodes/{nodeID}/metrics (see api-reference.md) returns the sample points:

{
  "points": [
    { "ts": "2026-04-12T10:00:00Z", "cpu_millicores": 450, "memory_bytes": 536870912 },
    { "ts": "2026-04-12T10:00:02Z", "cpu_millicores": 1200, "memory_bytes": 1073741824 }
  ]
}

Using metrics to right-size containers

Run your pipeline a few times
Open the job detail in the dashboard and expand Resources
Compare peak usage to your pod's configured limits:
- If peak memory is close to the limit → increase the limit or optimize memory usage
- If peak CPU is well below the limit → you can safely lower requests to save cluster resources
- If CPU is consistently at the limit → the pipeline is CPU-bound; increase the limit for faster builds

Dashboard

The dashboard shows failure information at every level:

Home page: failure reason badges in the recent builds table
Pipelines page: failure reason badge in the summary header, plus a prominent banner with contextual help text
Resources section: collapsible CPU/memory charts in the job detail panel (auto-refreshes for running jobs)

Data retention

Finished runs (and their metrics) are kept until you prune them. There is no automatic time-based cleanup; use sparkwing runs prune to delete runs past a threshold or by id (see cli-reference.md).

OpenTelemetry

Every sparkwing service initializes OpenTelemetry and exposes a Prometheus /metrics endpoint. Set OTEL_EXPORTER_OTLP_ENDPOINT to additionally export traces and structured logs via OTLP.

Prometheus /metrics

Always active on every service; scrape it with your Prometheus.

OTLP export

When OTEL_EXPORTER_OTLP_ENDPOINT is set, services export over OTLP/HTTP to that endpoint:

Traces via otlptracehttp (run + HTTP spans).
Logs via otlploghttp (structured logs with trace correlation).

Metrics stay on the Prometheus /metrics endpoint. There is no in-cluster OTEL collector required; point the OTLP endpoint at whatever backend you run (e.g. Tempo for traces, Loki for logs).

Metrics reference

Controller (sparkwing-controller, Prometheus):

Metric	Type	Description
`sparkwing_runs_total`	Counter	Runs that reached a terminal state, by pipeline and status
`sparkwing_run_duration_seconds`	Histogram	End-to-end wall time from create to finish
`sparkwing_nodes_claimed_total`	Counter	Successful node claims
`sparkwing_pending_nodes`	Gauge	Claim-queue depth (ready, unclaimed nodes)
`sparkwing_active_runners`	Gauge	Distinct runners with a non-expired lease in the last 2 minutes
`sparkwing_http_requests_total`	Counter	HTTP requests by route, method, status
`sparkwing_http_request_duration_seconds`	Histogram	HTTP latency by route and method

Cache (sparkwing-cache, OTEL meter):

Metric	Type	Description
`sparkwing.gitcache.archives_served`	Counter	Archive downloads
`sparkwing.gitcache.files_served`	Counter	Single-file downloads
`sparkwing.gitcache.fetch_duration`	Histogram	Background fetch time
`sparkwing.gitcache.cache_hits`	Counter	Binary/dependency cache hits
`sparkwing.gitcache.cache_misses`	Counter	Binary/dependency cache misses

Observability

Failure reasonsSection anchor link

How detection worksSection anchor link

APISection anchor link

Resource usage metricsSection anchor link

What's measuredSection anchor link

APISection anchor link

Using metrics to right-size containersSection anchor link

DashboardSection anchor link

Data retentionSection anchor link

OpenTelemetrySection anchor link

Prometheus /metricsSection anchor link

OTLP exportSection anchor link

Metrics referenceSection anchor link