Build Caching

How sparkwing makes Docker builds fast, and where the time actually goes.

Where build time goesSection anchor link

Benchmark: full-stack Dockerfile installing 225 npm packages + 100 Ruby gems (including native extensions like pg, nio4r, bootsnap). Tested on Apple Silicon Mac and EKS (Graviton arm64).

Local Mac resultsSection anchor link

ScenarioTimeWhat's cached
True cold (system prune)94snothing
Base images cached92scontainer images
Images + warm cache mounts61simages + compiled deps
Full layer cache (unchanged)1severything

Derived cost breakdownSection anchor link

ComponentCostNotes
Base image pulls~2sDocker Hub CDN is fast
Package downloads (npm + gems)~10–15sregistry CDNs are fast on good networks
Native extension compilation~40–50sgcc/make for pg, nio4r, bootsnap, sassc...
npm resolution + linking~10–15sCPU-bound, not network
Docker layer export~5–7swriting to image store

The bottleneck is CPU, not network. Compiling native extensions accounts for ~50% of a cold build. Package downloads are only ~15% of the total.

EKS results (Graviton arm64, 2 vCPU)Section anchor link

ScenarioTimeSavings
Cold build105s--
Layer cache (nothing changed)0s105s (100%)
--no-cache, cache mounts warm101s4s (4%)
Dep change + warm cache mounts101s2s vs cold dep change
Cold build + proxy (cached)98s7s (7%)

The EKS build is 1.7x slower than the Mac, primarily from CPU limits (2 vCPU pod) and EBS disk I/O (network-attached storage).

EKS time breakdown (~105s)Section anchor link

ComponentTime% of build
bundle install (download + compile)~76s72%
npm install (resolve + link)~15s14%
Docker layer export~13s12%
Base image + setup~2s2%

Native extension compilation (pg, nio4r, bootsnap, sassc) dominates. No caching strategy can skip compilation -- only layer cache (unchanged rebuild) or pre-compiled base images avoid it.

Caching layers -- what each one doesSection anchor link

Sparkwing has four caching layers. Each addresses a different failure mode:

1. Docker layer cache (biggest win: ~99% speedup)Section anchor link

When nothing in the Dockerfile changes, every layer is cached and the build completes in ~1s. This is Docker's default behavior -- no sparkwing configuration needed.

Breaks when: any file referenced by COPY changes (code, package.json, Gemfile), Dockerfile changes, or build args change.

2. BuildKit cache mounts (second biggest: ~34% speedup)Section anchor link

RUN --mount=type=cache,target=/root/.npm npm install
RUN --mount=type=cache,target=/usr/local/bundle bundle install

Cache mounts persist compiled artifacts and downloaded packages across builds even when the layer cache is busted. The mount directory survives --no-cache and Dockerfile changes -- only docker builder prune -af clears it.

Benchmark proof:

  • Images cached, cold mounts: 92s
  • Images cached, warm mounts: 61s
  • Savings: 31s (34%)

The savings come from skipping native extension recompilation (pg, nio4r, etc.), not from skipping downloads.

Breaks when: the base image's runtime version changes (Ruby 3.3 → 3.2 compiled extensions are incompatible), or build cache is pruned.

3. Warm PVC pool (multiplier for cache mounts)Section anchor link

The controller pre-warms PVCs with Docker image layers. The DinD sidecar on each runner pod mounts a warm PVC at /var/lib/docker. Since the warmer is additive (never wipes the PVC), BuildKit cache mounts from previous job runs persist on the PVC.

This means cache mounts survive across pipeline runs -- not just within a single build session. The first build on a PVC is cold; every subsequent build benefits from warm mounts.

Breaks when: the PVC is recycled (new PVC from the pool), or the warmer is run with a destructive reset (it currently doesn't -- see warmer.go).

4. Dependency proxy (reliability + bandwidth, modest speed)Section anchor link

sparkwing-cache includes a package proxy that caches npm, pip, gem, Go module, and Alpine package downloads in-cluster. Runners fetch packages from the cache proxy instead of the public internet.

Speed impact: ~4s savings on a 104s EKS build. The proxy eliminates network egress for cached packages, but since package downloads are only ~15% of total build time, the absolute savings are small on fast networks.

Where the proxy matters:

  • Reliability: builds succeed when npmjs.org or rubygems.org have outages (stale-on-error fallback serves cached responses)
  • Bandwidth: 290MB of cached packages not re-downloaded from the internet on every cold build, across every node
  • Constrained networks: air-gapped clusters, metered egress, cross-region builds where registry latency is high
  • Concurrent builds: 10 runners building simultaneously don't each fetch the same 200MB from upstream

Does NOT help when: the bottleneck is compilation (most builds), the network is fast (AWS to npm CDN), or the packages aren't in the cache yet (first build).

RecommendationsSection anchor link

For pipeline authors (Dockerfile best practices)Section anchor link

Always use BuildKit cache mounts for package managers:

# Node
RUN --mount=type=cache,target=/root/.npm npm ci

# Ruby
RUN --mount=type=cache,target=/usr/local/bundle bundle install

# Python
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt

# Go
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go build ./...

# Rust
RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/app/target \
    cargo build --release

For using the proxy in DockerfilesSection anchor link

ARG PROXY_URL=""

# npm
RUN --mount=type=cache,target=/root/.npm \
    if [ -n "$PROXY_URL" ]; then npm config set registry ${PROXY_URL}/proxy/npm/; fi && \
    npm ci

# pip
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --index-url ${PROXY_URL:-https://pypi.org}/simple/ \
    --trusted-host sparkwing-cache.sparkwing.svc.cluster.local \
    -r requirements.txt

# apk (Alpine)
RUN if [ -n "$PROXY_URL" ]; then \
      sed -i "s|https://dl-cdn.alpinelinux.org|${PROXY_URL}/proxy/alpine|g" /etc/apk/repositories; \
    fi && apk add --no-cache git

PROXY_URL is a build arg your pipeline sets -- it is not injected automatically. Default it to empty (no proxy) and, for in-cluster builds, pass the cache's service URL yourself, e.g. --build-arg PROXY_URL=http://sparkwing-cache.sparkwing.svc.cluster.local when invoking the Docker build from your pipeline.

What NOT to optimize (and why)Section anchor link

These were all investigated and benchmarked. The savings are real but small because builds are CPU-bound, not network-bound.

  • Base image pull time -- only ~2s on fast networks. The warm PVC pool already pre-pulls common images.
  • Package download caching (proxy/mounts) -- saves 2–7s on a 105s build. Downloads are ~15% of total time; compilation is ~72%. The proxy's value is reliability (builds work when registries are down) and bandwidth savings, not speed.
  • P2P artifact distribution -- overkill at current scale. The dependency access pattern (many small files, different per-repo) doesn't benefit from peer-to-peer the way large uniform blobs do.

What WOULD helpSection anchor link

  • More CPU for runner pods -- compilation is the bottleneck. Bumping from 2 to 4 vCPU would let bundle install --jobs=4 actually parallelize and could cut ~30% off build time.
  • Faster disk -- EBS adds ~6s to Docker layer exports vs local SSD. Local NVMe instances (c5d, m5d) or higher IOPS gp3 would help.
  • Pre-compiled base images -- a custom base image with common gems pre-installed eliminates compilation entirely for those deps. This is the nuclear option: build time drops to seconds, but you own the base image.

Proxy serviceSection anchor link

The package proxy is part of sparkwing-cache. It supports six upstream registries:

RegistryUpstreamURL rewriting
npmregistry.npmjs.orgYes (tarball URLs in metadata)
pypipypi.orgYes (file URLs in simple index)
pythonhostedfiles.pythonhosted.orgNo
rubygemsrubygems.orgNo
golangproxy.golang.orgNo
alpinedl-cdn.alpinelinux.orgNo

Cache policy:

  • Immutable content (.tgz, .whl, .gem, .zip, .jar, .apk): cached indefinitely
  • Metadata (JSON, HTML): 10-minute TTL, stale-on-error fallback
  • Background cleanup: removes expired entries hourly

Endpoints:

  • GET /proxy/{registry}/{path} -- cached reverse proxy
  • GET /stats -- cache size per registry
  • GET /health -- liveness/readiness

Configuration (env vars):

  • PROXY_CACHE_DIR -- cache directory (default: /data/proxy)
  • PROXY_CACHE_TTL -- metadata TTL (default: 10m)
  • PROXY_MAX_AGE -- cleanup threshold for immutable entries (default: 168h)