Cache (Gitcache)

sparkwing-cache is sparkwing's in-cluster git cache, blob store, and package proxy. It mirrors repositories from GitHub, serves git clones over HTTP, stores uploaded code tarballs, caches package registry responses, and keeps itself fresh with a background fetch loop.

The cache is read-only for git - pipelines clone from it but push directly to GitHub. This eliminates a class of divergence bugs where the cache's bare repos would drift from upstream.

ArchitectureSection anchor link

                   ┌─────────────┐
                   │   GitHub    │
                   └──────┬──────┘
                          │ fetch (background, every 30s)
                   ┌──────▼──────┐
 sparkwing CLI ────────►│   cache     │◄──── runner (clone + pkg proxy)
 (eager refresh)   │  (read-only │
                   │   + blobs   │
                   │   + proxy)  │
                   └─────────────┘

 runner ──── push gitops ────► GitHub (direct, via GITHUB_TOKEN PAT)

Reads (clone, fetch, file, archive) go through the cache - fast, in-cluster, no GitHub rate limits.

Writes (gitops deploy push) go directly to GitHub via HTTPS + PAT. Runners have GITHUB_TOKEN from the github-config k8s secret.

Repo RegistrationSection anchor link

Repos are registered by name so pipelines can clone them as http://gitcache/git/<name> without knowing the full URL.

Set GITCACHE_REPOS env var on the cache deployment:

env:
  - name: GITCACHE_REPOS
    value: "gitops=git@github.com:user/gitops.git,app=git@github.com:user/app.git"

On startup, the cache registers the name-to-URL mappings. Repos are cloned on-demand when first requested (e.g. via /archive or /upload). If the PVC is nuked, repos are re-cloned automatically on next access.

Manual registrationSection anchor link

curl -X POST "http://sparkwing-cache:8090/git/register?name=gitops&repo=git@github.com:user/repo.git"

Seeding (no SSH required)Section anchor link

If the cache doesn't have SSH access, seed from a machine that does:

git clone --bare git@github.com:user/repo.git /tmp/repo-seed
cd /tmp/repo-seed && git bundle create /tmp/repo.bundle --all
curl -X POST "http://gitcache:8090/sync/seed?repo=git@github.com:user/repo.git" \
  --data-binary @/tmp/repo.bundle

Operator DiscoverySection anchor link

Some operator flows -- the eager-refresh on sparkwing pipeline trigger --profile <controller-profile> and the profile health probe -- talk to the cache pod directly over HTTP. They discover the cache pod's URL from the controller -- no per-profile configuration required on the operator side.

Wire it up on the controller deployment:

env:
  - name: CACHE_POD_URL
    value: "https://cache-sparkwing.example.dev"

(Or pass --cache-pod-url=https://cache-sparkwing.example.dev on the controller's command line.) The controller announces this URL via GET /api/v1/services; operator CLIs fetch it once per session and cache in-process.

If unset, the announce endpoint returns 404 and operator flows that need the cache pod (eager-refresh, health probe) fail loud with a clear "controller announced no cache pod URL" message.

Background FetchSection anchor link

The cache periodically fetches upstream for all registered bare repos (default: every 30 seconds, configurable via FETCH_INTERVAL env var).

This keeps repos fresh so that:

  • Runner clones see recent commits without cold-start fetches
  • Ancestor negotiation for incremental uploads succeeds more often

Code delivery on remote triggersSection anchor link

sparkwing pipeline trigger <pipeline> --profile prod triggers by commit SHA: the CLI sends the branch + SHA to the controller, and the runner clones that SHA from the cache. To close the git push && sparkwing pipeline trigger race -- where the cache hasn't yet mirrored the just-pushed commit -- the CLI fires a best-effort eager refresh of the repo (POST /git/refresh) before returning; the runner also retries on a stale SHA.

sparkwing CLI -> controller /api/v1/triggers (branch + SHA)
sparkwing CLI -> cache POST /git/refresh    (eager mirror of the pushed SHA)
runner        -> cache /git/<name>          (clone at SHA)

The cache also exposes tarball-upload and ancestor-negotiation endpoints (/upload, /uploads/<id>, /sync/negotiate) for code-sync flows; see the API table below.

GitOps Deployment FlowSection anchor link

1. Runner builds Docker image from source
2. Runner pushes image to a registry (ECR, GCR, Docker Hub, etc.)
3. Runner clones the gitops repo from the cache (read cache)
4. Runner updates kustomization.yaml with new image tag
5. Runner pushes the gitops repo directly to GitHub (HTTPS + PAT)
6. ArgoCD detects change, syncs cluster

The runner uses GITHUB_TOKEN (from github-config k8s secret) to authenticate the push. The PAT needs write access to the gitops repo.

AuthSection anchor link

The cache is exposed externally via ingress at your dashboard host's cache- subdomain. Write endpoints (/upload, /sync/negotiate, /sync/seed) require a bearer token:

Authorization: Bearer <SPARKWING_API_TOKEN>

In-cluster requests (from controller, runners) skip auth - they reach the cache via the k8s Service without the X-Forwarded-For header that the ingress sets.

API EndpointsSection anchor link

Git Protocol (read-only)Section anchor link

MethodEndpointDescription
POST/git/register?name=X&repo=YRegister a repo name
GET/git/<name>/info/refs?service=git-upload-packClone/fetch discovery
POST/git/<name>/git-upload-packClone/fetch data
POST/git/<name>/git-receive-packReturns 403 (read-only)
POST/git/refresh?name=X (or ?repo=Y)Synchronous fetch of one bare repo (eager refresh)

Archives & FilesSection anchor link

MethodEndpointDescription
GET/archive?repo=X&branch=YDownload repo as tar.gz
GET/file?repo=X&branch=Y&path=ZGet a single file
GET/tree-hash?repo=X&branch=Y&path=ZContent-addressable hash
GET/branch-contains?repo=X&branch=Y&commit=ZCheck if commit is on branch

Uploads (Code Sync)Section anchor link

MethodEndpointDescription
POST/uploadUpload a tarball (auth required)
POST/upload?repo=X&base=YIncremental upload on base commit
GET/uploads/<id>Download uploaded tarball
POST/sync/negotiateFind common ancestor (auth required)
POST/sync/seed?repo=XSeed repo from git bundle (auth required)

ArtifactsSection anchor link

MethodEndpointDescription
POST/artifacts/<jobID>?path=XUpload artifact
GET/artifacts/<jobID>List artifacts
GET/artifacts/<jobID>?glob=XDownload matching artifacts

Binary & Dependency CacheSection anchor link

MethodEndpointDescription
GET/bin/<name>Download cached binary
PUT/bin/<name>Upload binary to cache
GET/cache/<key>Download cached dependency archive
HEAD/cache/<key>Check if cache entry exists
PUT/cache/<key>Upload dependency archive to cache

StatusSection anchor link

MethodEndpointDescription
GET/healthHealth check ({"status":"ok"})
GET/reposList registered repos

DeploymentSection anchor link

The cache runs as a Deployment in the sparkwing namespace:

  • Image: sparkwing-cache
  • Port: 8090 (service port 80)
  • Storage: PVC at /data
  • SSH: Optional, mounted at /etc/ssh-key from ssh-key secret
  • Ingress: your dashboard host's cache- subdomain

Environment VariablesSection anchor link

VariableDescription
SPARKWING_API_TOKENBearer token for write endpoint auth
GITCACHE_REPOSComma-separated name=url pairs for auto-registration
FETCH_INTERVALBackground fetch interval (default: 30s)
DATA_DIROverride data root (default: /data)
PORTListen port (default: 8090)

Data directoriesSection anchor link

PathContents
/data/repos/Bare git repositories (named by content hash)
/data/archives/Cached repo tarballs
/data/uploads/Uploaded code tarballs
/data/artifacts/Job output artifacts
/data/bins/Compiled pipeline binary cache
/data/cache/Dependency-archive cache (gems, node_modules, etc.)
/data/proxy/Package-registry proxy cache (npm, PyPI, Go, etc.)
/data/repo-names.jsonFriendly name → URL registry