1
0
Fork 1
mirror of https://github.com/git-pkgs/proxy.git synced 2026-06-02 16:48:16 -04:00
pkg-proxy/docs/architecture.md
Lars Wallenborn 76f41cf271
Add storage backend probe to /health (closes #73) (#119)
* config: add Health.StorageProbeInterval

* metrics: add proxy_health_probe_failures_total counter

* server: add storageProbe with happy-path test

* server: add storageProbe failure-mode tests

* server: add healthCache with TTL, single-flight, transition logging

* server: wire storage probe into /health

* server: update TestHealthEndpoint for JSON; wire healthCache into newTestServer

Also fix Windows file-locking issue in storageProbe: close the reader
explicitly before Delete so the file handle is released prior to os.Remove.

* server: clean up stale comment in storageProbe

* docs: document storage health probe and new metric

* docs: regenerate Swagger for /health JSON response

* server: simplify rc.Close error handling in storageProbe

* server: defer probe cleanup so size/open/read/verify failures don't leak objects

Previously, storageProbe only called Delete on the success path. Any
failure between Store and the final Delete (size mismatch, Open error,
mid-stream read failure, content mismatch) left the probe object orphaned
in the storage backend. With caching disabled and Kubernetes-rate probing,
the leak could accumulate noticeably on backends like S3.

Use a named return + defer to attempt Delete after every successful Store.
The earlier-step failure remains the primary error; Delete failure only
surfaces as step="delete" when nothing else went wrong. Add a table-driven
test that asserts cleanup runs for each non-delete failure path.

Reported by Copilot on #119.

* config: validate health.storage_probe_interval in Config.Validate

The new duration field was only validated at use time in newHealthCache.
The existing codebase already validates other duration fields
(MetadataTTL, DirectServeTTL, Gradle.MaxAge, Gradle.SweepInterval) in
Config.Validate() so misconfiguration fails fast at startup with a
config-key-specific error.

Match that pattern. The parse-at-use code in newHealthCache stays as
a safety net, mirroring the MetadataTTL precedent.

Reported by Copilot on #119.

* docs: lowercase "counter" in metrics table for consistency

Other rows in the table use lowercase type names (counter/gauge/histogram).
Match that style.

Reported by Copilot on #119.

* docs: include size-check step in /health probe description

The probe is write → size-check → read → verify → delete; the
architecture note was missing the size-check step.

Reported by Copilot on #119.

* server: address andrew's review on #119

- Drop unused callerCtx parameter from healthCache.Check (Check is now
  parameter-less; the comment-only "accepted for symmetry" justification
  wasn't carrying its weight).
- Emit "storage": {"status": "skipped"} on DB short-circuit instead of
  omitting the key, so monitors expecting a fixed key set keep working.
- Reject negative storage_probe_interval at config validation time
  (previously parsed and silently behaved like "0").
- Extract HealthConfig.Validate to keep Config.Validate under the
  gocognit threshold and match the existing GradleBuildCacheConfig pattern.
- README Health Check section: note that /health is intended as a
  readiness probe rather than a liveness probe (Check holds a mutex
  for up to the 10s probe timeout).
- cmd/proxy/main.go godoc: column-align the new env var with the
  surrounding Gradle entries.

Reported by andrew on #119.
2026-05-22 12:14:01 +01:00

15 KiB

Architecture

This document describes the internal architecture of the git-pkgs proxy.

Overview

The proxy is a caching HTTP server that sits between package manager clients and upstream registries. It intercepts requests, checks a local cache, and either serves cached content or fetches from upstream.

┌──────────────────────────────────────────────────────────────────┐
│                          HTTP Server                              │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │                     Router (Chi)                          │    │
│  │  /npm/*     -> NPMHandler      /health  -> healthHandler  │    │
│  │  /cargo/*   -> CargoHandler    /stats   -> statsHandler   │    │
│  │  /gem/*     -> GemHandler      /metrics -> prometheus     │    │
│  │  ...17 ecosystems              /api/*   -> APIHandler     │    │
│  │                                /        -> Web UI         │    │
│  └──────────────────────────────────────────────────────────┘    │
│         │                    │                    │               │
│         ▼                    ▼                    ▼               │
│  ┌───────────┐       ┌─────────────┐      ┌─────────────┐       │
│  │ Database  │       │   Storage   │      │   Upstream  │       │
│  │ SQLite or │       │ Filesystem  │      │  Registries │       │
│  │ Postgres  │       │  or S3      │      │  (Fetcher)  │       │
│  └───────────┘       └─────────────┘      └─────────────┘       │
└──────────────────────────────────────────────────────────────────┘

Request Flow

Metadata Request (npm example)

  1. Client requests GET /npm/lodash
  2. NPMHandler receives request
  3. Handler fetches metadata from upstream registry.npmjs.org/lodash
  4. Handler rewrites tarball URLs in metadata to point at proxy
  5. Handler returns modified metadata to client

Metadata is not cached - always fetched fresh. This ensures clients see new versions immediately.

Artifact Download (npm example)

  1. Client requests GET /npm/lodash/-/lodash-4.17.21.tgz

  2. NPMHandler extracts package name and version from URL

  3. Handler calls Proxy.GetOrFetchArtifact()

  4. Proxy checks database for cached artifact:

    Cache Hit:

    • Look up artifact record in database
    • Open file from storage
    • Record hit (increment counter, update last_accessed_at)
    • Return reader to handler
    • Handler streams file to client

    Cache Miss:

    • Resolve download URL using Resolver
    • Fetch artifact from upstream using Fetcher
    • Store artifact in Storage (returns size, hash)
    • Create/update database records (package, version, artifact)
    • Open stored file
    • Return reader to handler
    • Handler streams file to client
┌────────┐  GET /npm/lodash/-/lodash-4.17.21.tgz  ┌─────────────┐
│ Client │ ──────────────────────────────────────▶│ NPMHandler  │
└────────┘                                        └──────┬──────┘
                                                         │
                                                         ▼
                                               ┌─────────────────┐
                                               │ Proxy           │
                                               │ GetOrFetch      │
                                               └────────┬────────┘
                                                        │
                                    ┌───────────────────┼───────────────────┐
                                    │                   │                   │
                                    ▼                   ▼                   ▼
                             ┌───────────┐       ┌───────────┐       ┌───────────┐
                             │ Database  │       │  Storage  │       │ Upstream  │
                             │ (lookup)  │       │  (read)   │       │ (fetch)   │
                             └───────────┘       └───────────┘       └───────────┘

Package Structure

internal/database

SQLite or PostgreSQL database for cache metadata. SQLite uses modernc.org/sqlite (pure Go, no CGO). PostgreSQL uses lib/pq.

The schema is compatible with git-pkgs databases. The proxy adds the artifacts and vulnerabilities tables on top of the shared packages and versions tables, so both tools can point at the same database.

Tables:

packages (
    id          INTEGER PRIMARY KEY,  -- SERIAL on Postgres
    purl        TEXT NOT NULL,        -- unique, e.g. pkg:npm/lodash
    ecosystem   TEXT NOT NULL,
    name        TEXT NOT NULL,
    latest_version  TEXT,
    license         TEXT,
    description     TEXT,
    homepage        TEXT,
    repository_url  TEXT,
    registry_url    TEXT,
    supplier_name   TEXT,
    supplier_type   TEXT,
    source          TEXT,
    enriched_at     DATETIME,
    vulns_synced_at DATETIME,
    created_at      DATETIME,
    updated_at      DATETIME
)
-- indexes: purl (unique), (ecosystem, name)

versions (
    id           INTEGER PRIMARY KEY,
    purl         TEXT NOT NULL,       -- unique, e.g. pkg:npm/lodash@4.17.21
    package_purl TEXT NOT NULL,       -- FK to packages.purl
    license      TEXT,
    published_at DATETIME,
    integrity    TEXT,                -- subresource integrity hash
    yanked       INTEGER DEFAULT 0,  -- BOOLEAN on Postgres
    source       TEXT,
    enriched_at  DATETIME,
    created_at   DATETIME,
    updated_at   DATETIME
)
-- indexes: purl (unique), package_purl

artifacts (
    id             INTEGER PRIMARY KEY,
    version_purl   TEXT NOT NULL,
    filename       TEXT NOT NULL,
    upstream_url   TEXT NOT NULL,
    storage_path   TEXT,              -- null until cached
    content_hash   TEXT,              -- SHA-256
    size           INTEGER,           -- BIGINT on Postgres
    content_type   TEXT,
    fetched_at     DATETIME,
    hit_count      INTEGER DEFAULT 0, -- BIGINT on Postgres
    last_accessed_at DATETIME,
    created_at     DATETIME,
    updated_at     DATETIME
)
-- indexes: (version_purl, filename) unique, storage_path, last_accessed_at

vulnerabilities (
    id            INTEGER PRIMARY KEY,
    vuln_id       TEXT NOT NULL,      -- e.g. CVE-2021-1234
    ecosystem     TEXT NOT NULL,
    package_name  TEXT NOT NULL,
    severity      TEXT,
    summary       TEXT,
    fixed_version TEXT,
    cvss_score    REAL,
    "references"  TEXT,               -- JSON array
    fetched_at    DATETIME,
    created_at    DATETIME,
    updated_at    DATETIME
)
-- indexes: (vuln_id, ecosystem, package_name) unique, (ecosystem, package_name)

metadata_cache (
    id            INTEGER PRIMARY KEY,
    ecosystem     TEXT NOT NULL,
    name          TEXT NOT NULL,
    storage_path  TEXT NOT NULL,
    etag          TEXT,
    content_type  TEXT,
    size          INTEGER,           -- BIGINT on Postgres
    fetched_at    DATETIME,
    created_at    DATETIME,
    updated_at    DATETIME
)
-- indexes: (ecosystem, name) unique

On PostgreSQL, INTEGER PRIMARY KEY becomes SERIAL, DATETIME becomes TIMESTAMP, INTEGER DEFAULT 0 booleans become BOOLEAN DEFAULT FALSE, and size/count columns use BIGINT.

The MigrateSchema() function handles backward compatibility with older git-pkgs databases by running named migrations that add missing columns and tables. See migrations.md for how to add new schema changes.

Key operations:

  • GetPackageByPURL() - Look up package by PURL
  • GetVersionByPURL() - Look up version by PURL
  • GetArtifact() - Look up artifact by version + filename
  • UpsertPackage/Version/Artifact() - Insert or update records
  • RecordArtifactHit() - Increment hit counter, update access time
  • GetLeastRecentlyUsedArtifacts() - For cache eviction
  • SearchPackages() - Full-text search across cached packages

internal/storage

File storage abstraction. Current implementation uses local filesystem.

Interface:

type Storage interface {
    Store(ctx, path, reader) (size, hash, error)
    Open(ctx, path) (io.ReadCloser, error)
    Exists(ctx, path) (bool, error)
    Delete(ctx, path) error
    Size(ctx, path) (int64, error)
    UsedSpace(ctx) (int64, error)
}

Filesystem implementation:

  • Stores files in nested directories: {ecosystem}/{name}/{version}/{filename}
  • Atomic writes using temp file + rename
  • Computes SHA256 hash during write
  • Cleans up empty parent directories on delete

Path structure:

cache/artifacts/
├── npm/
│   ├── lodash/
│   │   └── 4.17.21/
│   │       └── lodash-4.17.21.tgz
│   └── @babel/
│       └── core/
│           └── 7.23.0/
│               └── core-7.23.0.tgz
└── cargo/
    └── serde/
        └── 1.0.193/
            └── serde-1.0.193.crate

internal/upstream

Fetches artifacts from upstream registries.

Fetcher:

  • HTTP client with configurable timeout (5 min default for large artifacts)
  • Exponential backoff retry on 429 (rate limit) and 5xx errors
  • Returns streaming reader (doesn't load into memory)
  • Configurable user-agent

Resolver:

  • Determines download URL for a package/version
  • Handles ecosystem-specific URL patterns:
    • npm: https://registry.npmjs.org/{name}/-/{shortname}-{version}.tgz
    • cargo: https://static.crates.io/crates/{name}/{name}-{version}.crate
    • etc.

internal/handler

HTTP protocol handlers for each registry type.

Proxy (shared):

  • GetOrFetchArtifact() - Main cache logic
  • Coordinates database, storage, and fetcher
  • Handles cache hit/miss flow

NPMHandler:

  • handlePackageMetadata() - Proxy + rewrite metadata
  • handleDownload() - Serve cached artifact
  • Rewrites tarball URLs to point at proxy

CargoHandler:

  • handleConfig() - Return registry config
  • handleIndex() - Proxy sparse index
  • handleDownload() - Serve cached crate

internal/server

HTTP server setup, web UI, and API handlers.

  • Creates and wires together all components
  • Mounts protocol handlers at ecosystem-specific paths
  • Middleware: request ID, real IP, logging, panic recovery, active request tracking
  • Web UI: dashboard, package browser, source browser, version comparison
  • Templates are embedded in the binary via //go:embed
  • Enrichment API for package metadata, vulnerability scanning, and outdated detection
  • Health, stats, and Prometheus metrics endpoints. /health runs an active write → size-check → read → verify → delete probe against the storage backend and returns a structured JSON response (HealthResponse) with "ok" / "error" status per subsystem. Probe results are cached (default 30 s, configurable via health.storage_probe_interval) to avoid overwhelming remote backends.

internal/metrics

Prometheus metrics for cache performance, upstream latency, storage operations, and active requests. See the Monitoring section of the README for the full metric list.

Cooldown

Version age filtering for supply chain attack mitigation, provided by github.com/git-pkgs/cooldown. Configurable at global, ecosystem, and per-package levels. Supported by npm, PyPI, pub.dev, and Composer handlers.

internal/enrichment

Package metadata enrichment. Fetches license, description, homepage, repository URL, and vulnerability data from upstream registries. Powers the /api/ endpoints and the web UI's package detail pages.

internal/mirror

Selective package mirroring for pre-populating the proxy cache. Supports multiple input sources: individual PURLs (versioned or unversioned), CycloneDX/SPDX SBOM files, and full registry enumeration. Uses a bounded worker pool backed by errgroup to download artifacts in parallel, reusing handler.Proxy.GetOrFetchArtifact() for the actual fetch-and-cache work.

The package also provides a MetadataCache for storing raw upstream metadata blobs so the proxy can serve metadata responses offline. The JobStore manages async mirror jobs exposed via the /api/mirror endpoints.

internal/config

Configuration loading.

  • Supports YAML and JSON files
  • Environment variable overrides (PROXY_ prefix)
  • Command line flag overrides
  • Validation

Extending the Proxy

Adding a New Registry

  1. Add URL resolution in upstream/resolver.go
  2. Create handler in handler/newregistry.go
  3. Mount in server/server.go
  4. Add tests

Adding a New Storage Backend

  1. Implement storage.Storage interface
  2. Add configuration options in config/config.go
  3. Add initialization in server/server.go

Cache Eviction

The database tracks hit_count and last_accessed_at for LRU eviction. Query with:

db.GetLeastRecentlyUsedArtifacts(limit)

Eviction can be implemented as:

  1. Background goroutine checking GetTotalCacheSize()
  2. When over limit, get LRU artifacts
  3. Delete from storage and clear database records

Design Decisions

Why SQLite?

  • Simple deployment (single file)
  • No external dependencies
  • Good performance for this workload
  • Pure Go driver available (no CGO)

Why rewrite metadata URLs?

  • Ensures clients fetch artifacts through proxy
  • Alternative: Let clients fetch directly, miss cache opportunity

Why not cache metadata (by default)?

  • Simplicity - no invalidation logic needed
  • Fresh data - new versions visible immediately
  • Metadata is small, upstream fetch is fast
  • Set cache_metadata: true or use the mirror command to enable metadata caching for offline use via the metadata_cache table

Why stream artifacts?

  • Memory efficient - don't load large files into RAM
  • Better latency - start sending while still receiving

Why atomic writes?

  • Prevents serving partial files
  • Safe concurrent access
  • Clean recovery from crashes