Observability

Sparkwing tracks job health, failure reasons, and resource usage so you can debug failures fast and right-size containers.

Failure reasonsSection anchor link

Every failed job carries a failure_reason in its result. The controller classifies failures automatically -- you never have to grep logs to figure out why a build died.

ReasonWhat happenedWhat to do
oom_killedContainer exceeded its memory limit and was killed by the kernel (exit 137).Raise the runner memory limit or reduce the pipeline's memory use; check the resource chart.
timeoutJob exceeded its configured execution timeout.Raise the timeout or optimize the pipeline.
agent_lostRunner stopped heartbeating (crashed, evicted, or lost network).Check pod events with kubectl describe pod; may indicate node pressure or a pipeline bug.
queue_timeoutNo runner claimed the job within the queue timeout (default 15m).Ensure runners are up and their advertised --label set satisfies the pipeline's requires: / node .Requires().
runner_lease_expiredThe runner holding the node's claim stopped renewing its lease, so the controller reclaimed it.Check the runner's health; the node is safe to retry.
verifyThe node's action completed, but its Verify postcondition returned an error -- the failure is at the verify stage, not the action.Inspect the Verify assertion and the action's actual output.
logs_authThe runner's log-append calls were rejected (401/403) by the controller, so the run's structured logs are unrecoverable.Check the runner token's logs.write scope; the run fails loud rather than reporting success with no output.

A plain pipeline-level failure (a failed test or command) carries no structured failure_reason -- read the logs.

How detection worksSection anchor link

The Kubernetes runner polls its Job/pod status while a node runs. When it sees a terminated container (e.g. OOMKilled, non-zero exit), it fails the node immediately with the specific reason rather than waiting for the heartbeat timeout.

For nodes where the pod disappears entirely (node failure, eviction), the controller's heartbeat sweep catches the missed lease and marks the node agent_lost.

APISection anchor link

The failure reason is available in all job responses:

{
  "result": {
    "success": false,
    "failure_reason": "oom_killed",
    "exit_code": 137,
    "logs": "Container \"runner\" was killed by the kernel OOM killer..."
  }
}

Resource usage metricsSection anchor link

While a node runs, the runner samples its own CPU and memory in-process (reading /proc) roughly every 2 seconds. Samples are stored and charted in the dashboard. No cluster metrics-server is involved.

What's measuredSection anchor link

  • CPU: millicores, derived from the runner process's CPU time.
  • Memory: resident bytes (RSS).

The dashboard charts the samples over time with peak and average in the header.

APISection anchor link

GET /api/v1/runs/{id}/nodes/{nodeID}/metrics (see api-reference.md) returns the sample points:

{
  "points": [
    { "ts": "2026-04-12T10:00:00Z", "cpu_millicores": 450, "memory_bytes": 536870912 },
    { "ts": "2026-04-12T10:00:02Z", "cpu_millicores": 1200, "memory_bytes": 1073741824 }
  ]
}

Using metrics to right-size containersSection anchor link

  1. Run your pipeline a few times
  2. Open the job detail in the dashboard and expand Resources
  3. Compare peak usage to your pod's configured limits:
    • If peak memory is close to the limit → increase the limit or optimize memory usage
    • If peak CPU is well below the limit → you can safely lower requests to save cluster resources
    • If CPU is consistently at the limit → the pipeline is CPU-bound; increase the limit for faster builds

DashboardSection anchor link

The dashboard shows failure information at every level:

  • Home page: failure reason badges in the recent builds table
  • Pipelines page: failure reason badge in the summary header, plus a prominent banner with contextual help text
  • Resources section: collapsible CPU/memory charts in the job detail panel (auto-refreshes for running jobs)

Data retentionSection anchor link

Finished runs (and their metrics) are kept until you prune them. There is no automatic time-based cleanup; use sparkwing runs prune to delete runs past a threshold or by id (see cli-reference.md).

OpenTelemetrySection anchor link

Every sparkwing service initializes OpenTelemetry and exposes a Prometheus /metrics endpoint. Set OTEL_EXPORTER_OTLP_ENDPOINT to additionally export traces and structured logs via OTLP.

Prometheus /metricsSection anchor link

Always active on every service; scrape it with your Prometheus.

OTLP exportSection anchor link

When OTEL_EXPORTER_OTLP_ENDPOINT is set, services export over OTLP/HTTP to that endpoint:

  • Traces via otlptracehttp (run + HTTP spans).
  • Logs via otlploghttp (structured logs with trace correlation).

Metrics stay on the Prometheus /metrics endpoint. There is no in-cluster OTEL collector required; point the OTLP endpoint at whatever backend you run (e.g. Tempo for traces, Loki for logs).

Metrics referenceSection anchor link

Controller (sparkwing-controller, Prometheus):

MetricTypeDescription
sparkwing_runs_totalCounterRuns that reached a terminal state, by pipeline and status
sparkwing_run_duration_secondsHistogramEnd-to-end wall time from create to finish
sparkwing_nodes_claimed_totalCounterSuccessful node claims
sparkwing_pending_nodesGaugeClaim-queue depth (ready, unclaimed nodes)
sparkwing_active_runnersGaugeDistinct runners with a non-expired lease in the last 2 minutes
sparkwing_http_requests_totalCounterHTTP requests by route, method, status
sparkwing_http_request_duration_secondsHistogramHTTP latency by route and method

Cache (sparkwing-cache, OTEL meter):

MetricTypeDescription
sparkwing.gitcache.archives_servedCounterArchive downloads
sparkwing.gitcache.files_servedCounterSingle-file downloads
sparkwing.gitcache.fetch_durationHistogramBackground fetch time
sparkwing.gitcache.cache_hitsCounterBinary/dependency cache hits
sparkwing.gitcache.cache_missesCounterBinary/dependency cache misses