1
0
Fork 1
mirror of https://github.com/git-pkgs/proxy.git synced 2026-06-02 08:38:17 -04:00
pkg-proxy/docs/architecture.md

362 lines
15 KiB
Markdown
Raw Permalink Normal View History

2026-01-20 21:52:44 +00:00
# Architecture
This document describes the internal architecture of the git-pkgs proxy.
## Overview
The proxy is a caching HTTP server that sits between package manager clients and upstream registries. It intercepts requests, checks a local cache, and either serves cached content or fetches from upstream.
```
┌──────────────────────────────────────────────────────────────────┐
│ HTTP Server │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Router (Chi) │ │
│ │ /npm/* -> NPMHandler /health -> healthHandler │ │
│ │ /cargo/* -> CargoHandler /stats -> statsHandler │ │
│ │ /gem/* -> GemHandler /metrics -> prometheus │ │
│ │ ...17 ecosystems /api/* -> APIHandler │ │
│ │ / -> Web UI │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
2026-01-20 21:52:44 +00:00
│ ┌───────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Database │ │ Storage │ │ Upstream │ │
│ │ SQLite or │ │ Filesystem │ │ Registries │ │
│ │ Postgres │ │ or S3 │ │ (Fetcher) │ │
2026-01-20 21:52:44 +00:00
│ └───────────┘ └─────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘
2026-01-20 21:52:44 +00:00
```
## Request Flow
### Metadata Request (npm example)
1. Client requests `GET /npm/lodash`
2. NPMHandler receives request
3. Handler fetches metadata from upstream `registry.npmjs.org/lodash`
4. Handler rewrites tarball URLs in metadata to point at proxy
5. Handler returns modified metadata to client
Metadata is not cached - always fetched fresh. This ensures clients see new versions immediately.
### Artifact Download (npm example)
1. Client requests `GET /npm/lodash/-/lodash-4.17.21.tgz`
2. NPMHandler extracts package name and version from URL
3. Handler calls `Proxy.GetOrFetchArtifact()`
4. Proxy checks database for cached artifact:
**Cache Hit:**
- Look up artifact record in database
- Open file from storage
- Record hit (increment counter, update last_accessed_at)
- Return reader to handler
- Handler streams file to client
**Cache Miss:**
- Resolve download URL using Resolver
- Fetch artifact from upstream using Fetcher
- Store artifact in Storage (returns size, hash)
- Create/update database records (package, version, artifact)
- Open stored file
- Return reader to handler
- Handler streams file to client
```
┌────────┐ GET /npm/lodash/-/lodash-4.17.21.tgz ┌─────────────┐
│ Client │ ──────────────────────────────────────▶│ NPMHandler │
└────────┘ └──────┬──────┘
┌─────────────────┐
│ Proxy │
│ GetOrFetch │
└────────┬────────┘
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Database │ │ Storage │ │ Upstream │
│ (lookup) │ │ (read) │ │ (fetch) │
└───────────┘ └───────────┘ └───────────┘
```
## Package Structure
### `internal/database`
SQLite or PostgreSQL database for cache metadata. SQLite uses `modernc.org/sqlite` (pure Go, no CGO). PostgreSQL uses `lib/pq`.
The schema is compatible with [git-pkgs](https://github.com/git-pkgs) databases. The proxy adds the `artifacts` and `vulnerabilities` tables on top of the shared `packages` and `versions` tables, so both tools can point at the same database.
2026-01-20 21:52:44 +00:00
**Tables:**
```sql
packages (
id INTEGER PRIMARY KEY, -- SERIAL on Postgres
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash
ecosystem TEXT NOT NULL,
name TEXT NOT NULL,
latest_version TEXT,
license TEXT,
description TEXT,
homepage TEXT,
repository_url TEXT,
registry_url TEXT,
supplier_name TEXT,
supplier_type TEXT,
source TEXT,
enriched_at DATETIME,
vulns_synced_at DATETIME,
created_at DATETIME,
updated_at DATETIME
2026-01-20 21:52:44 +00:00
)
-- indexes: purl (unique), (ecosystem, name)
2026-01-20 21:52:44 +00:00
versions (
id INTEGER PRIMARY KEY,
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash@4.17.21
package_purl TEXT NOT NULL, -- FK to packages.purl
license TEXT,
published_at DATETIME,
integrity TEXT, -- subresource integrity hash
yanked INTEGER DEFAULT 0, -- BOOLEAN on Postgres
source TEXT,
enriched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
2026-01-20 21:52:44 +00:00
)
-- indexes: purl (unique), package_purl
2026-01-20 21:52:44 +00:00
artifacts (
id INTEGER PRIMARY KEY,
version_purl TEXT NOT NULL,
filename TEXT NOT NULL,
upstream_url TEXT NOT NULL,
storage_path TEXT, -- null until cached
content_hash TEXT, -- SHA-256
size INTEGER, -- BIGINT on Postgres
content_type TEXT,
fetched_at DATETIME,
hit_count INTEGER DEFAULT 0, -- BIGINT on Postgres
last_accessed_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (version_purl, filename) unique, storage_path, last_accessed_at
vulnerabilities (
id INTEGER PRIMARY KEY,
vuln_id TEXT NOT NULL, -- e.g. CVE-2021-1234
ecosystem TEXT NOT NULL,
package_name TEXT NOT NULL,
severity TEXT,
summary TEXT,
fixed_version TEXT,
cvss_score REAL,
"references" TEXT, -- JSON array
fetched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
2026-01-20 21:52:44 +00:00
)
-- indexes: (vuln_id, ecosystem, package_name) unique, (ecosystem, package_name)
metadata_cache (
id INTEGER PRIMARY KEY,
ecosystem TEXT NOT NULL,
name TEXT NOT NULL,
storage_path TEXT NOT NULL,
etag TEXT,
content_type TEXT,
size INTEGER, -- BIGINT on Postgres
fetched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (ecosystem, name) unique
2026-01-20 21:52:44 +00:00
```
On PostgreSQL, `INTEGER PRIMARY KEY` becomes `SERIAL`, `DATETIME` becomes `TIMESTAMP`, `INTEGER DEFAULT 0` booleans become `BOOLEAN DEFAULT FALSE`, and size/count columns use `BIGINT`.
The `MigrateSchema()` function handles backward compatibility with older git-pkgs databases by running named migrations that add missing columns and tables. See [migrations.md](migrations.md) for how to add new schema changes.
2026-01-20 21:52:44 +00:00
**Key operations:**
- `GetPackageByPURL()` - Look up package by PURL
- `GetVersionByPURL()` - Look up version by PURL
- `GetArtifact()` - Look up artifact by version + filename
- `UpsertPackage/Version/Artifact()` - Insert or update records
- `RecordArtifactHit()` - Increment hit counter, update access time
- `GetLeastRecentlyUsedArtifacts()` - For cache eviction
- `SearchPackages()` - Full-text search across cached packages
2026-01-20 21:52:44 +00:00
### `internal/storage`
File storage abstraction. Current implementation uses local filesystem.
**Interface:**
```go
type Storage interface {
Store(ctx, path, reader) (size, hash, error)
Open(ctx, path) (io.ReadCloser, error)
Exists(ctx, path) (bool, error)
Delete(ctx, path) error
Size(ctx, path) (int64, error)
UsedSpace(ctx) (int64, error)
}
```
**Filesystem implementation:**
- Stores files in nested directories: `{ecosystem}/{name}/{version}/{filename}`
- Atomic writes using temp file + rename
- Computes SHA256 hash during write
- Cleans up empty parent directories on delete
**Path structure:**
```
cache/artifacts/
├── npm/
│ ├── lodash/
│ │ └── 4.17.21/
│ │ └── lodash-4.17.21.tgz
│ └── @babel/
│ └── core/
│ └── 7.23.0/
│ └── core-7.23.0.tgz
└── cargo/
└── serde/
└── 1.0.193/
└── serde-1.0.193.crate
```
### `internal/upstream`
Fetches artifacts from upstream registries.
**Fetcher:**
- HTTP client with configurable timeout (5 min default for large artifacts)
- Exponential backoff retry on 429 (rate limit) and 5xx errors
- Returns streaming reader (doesn't load into memory)
- Configurable user-agent
**Resolver:**
- Determines download URL for a package/version
- Handles ecosystem-specific URL patterns:
- npm: `https://registry.npmjs.org/{name}/-/{shortname}-{version}.tgz`
- cargo: `https://static.crates.io/crates/{name}/{name}-{version}.crate`
- etc.
### `internal/handler`
HTTP protocol handlers for each registry type.
**Proxy (shared):**
- `GetOrFetchArtifact()` - Main cache logic
- Coordinates database, storage, and fetcher
- Handles cache hit/miss flow
**NPMHandler:**
- `handlePackageMetadata()` - Proxy + rewrite metadata
- `handleDownload()` - Serve cached artifact
- Rewrites tarball URLs to point at proxy
**CargoHandler:**
- `handleConfig()` - Return registry config
- `handleIndex()` - Proxy sparse index
- `handleDownload()` - Serve cached crate
### `internal/server`
HTTP server setup, web UI, and API handlers.
2026-01-20 21:52:44 +00:00
- Creates and wires together all components
- Mounts protocol handlers at ecosystem-specific paths
- Middleware: request ID, real IP, logging, panic recovery, active request tracking
- Web UI: dashboard, package browser, source browser, version comparison
- Templates are embedded in the binary via `//go:embed`
- Enrichment API for package metadata, vulnerability scanning, and outdated detection
Add storage backend probe to /health (closes #73) (#119) * config: add Health.StorageProbeInterval * metrics: add proxy_health_probe_failures_total counter * server: add storageProbe with happy-path test * server: add storageProbe failure-mode tests * server: add healthCache with TTL, single-flight, transition logging * server: wire storage probe into /health * server: update TestHealthEndpoint for JSON; wire healthCache into newTestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove. * server: clean up stale comment in storageProbe * docs: document storage health probe and new metric * docs: regenerate Swagger for /health JSON response * server: simplify rc.Close error handling in storageProbe * server: defer probe cleanup so size/open/read/verify failures don't leak objects Previously, storageProbe only called Delete on the success path. Any failure between Store and the final Delete (size mismatch, Open error, mid-stream read failure, content mismatch) left the probe object orphaned in the storage backend. With caching disabled and Kubernetes-rate probing, the leak could accumulate noticeably on backends like S3. Use a named return + defer to attempt Delete after every successful Store. The earlier-step failure remains the primary error; Delete failure only surfaces as step="delete" when nothing else went wrong. Add a table-driven test that asserts cleanup runs for each non-delete failure path. Reported by Copilot on #119. * config: validate health.storage_probe_interval in Config.Validate The new duration field was only validated at use time in newHealthCache. The existing codebase already validates other duration fields (MetadataTTL, DirectServeTTL, Gradle.MaxAge, Gradle.SweepInterval) in Config.Validate() so misconfiguration fails fast at startup with a config-key-specific error. Match that pattern. The parse-at-use code in newHealthCache stays as a safety net, mirroring the MetadataTTL precedent. Reported by Copilot on #119. * docs: lowercase "counter" in metrics table for consistency Other rows in the table use lowercase type names (counter/gauge/histogram). Match that style. Reported by Copilot on #119. * docs: include size-check step in /health probe description The probe is write → size-check → read → verify → delete; the architecture note was missing the size-check step. Reported by Copilot on #119. * server: address andrew's review on #119 - Drop unused callerCtx parameter from healthCache.Check (Check is now parameter-less; the comment-only "accepted for symmetry" justification wasn't carrying its weight). - Emit "storage": {"status": "skipped"} on DB short-circuit instead of omitting the key, so monitors expecting a fixed key set keep working. - Reject negative storage_probe_interval at config validation time (previously parsed and silently behaved like "0"). - Extract HealthConfig.Validate to keep Config.Validate under the gocognit threshold and match the existing GradleBuildCacheConfig pattern. - README Health Check section: note that /health is intended as a readiness probe rather than a liveness probe (Check holds a mutex for up to the 10s probe timeout). - cmd/proxy/main.go godoc: column-align the new env var with the surrounding Gradle entries. Reported by andrew on #119.
2026-05-22 14:14:01 +03:00
- Health, stats, and Prometheus metrics endpoints. `/health` runs an active write → size-check → read → verify → delete probe against the storage backend and returns a structured JSON response (`HealthResponse`) with `"ok"` / `"error"` status per subsystem. Probe results are cached (default 30 s, configurable via `health.storage_probe_interval`) to avoid overwhelming remote backends.
### `internal/metrics`
Prometheus metrics for cache performance, upstream latency, storage operations, and active requests. See the Monitoring section of the README for the full metric list.
### Cooldown
Version age filtering for supply chain attack mitigation, provided by [github.com/git-pkgs/cooldown](https://github.com/git-pkgs/cooldown). Configurable at global, ecosystem, and per-package levels. Supported by npm, PyPI, pub.dev, and Composer handlers.
### `internal/enrichment`
Package metadata enrichment. Fetches license, description, homepage, repository URL, and vulnerability data from upstream registries. Powers the `/api/` endpoints and the web UI's package detail pages.
2026-01-20 21:52:44 +00:00
### `internal/mirror`
Selective package mirroring for pre-populating the proxy cache. Supports multiple input sources: individual PURLs (versioned or unversioned), CycloneDX/SPDX SBOM files, and full registry enumeration. Uses a bounded worker pool backed by `errgroup` to download artifacts in parallel, reusing `handler.Proxy.GetOrFetchArtifact()` for the actual fetch-and-cache work.
The package also provides a `MetadataCache` for storing raw upstream metadata blobs so the proxy can serve metadata responses offline. The `JobStore` manages async mirror jobs exposed via the `/api/mirror` endpoints.
2026-01-20 21:52:44 +00:00
### `internal/config`
Configuration loading.
- Supports YAML and JSON files
- Environment variable overrides (PROXY_ prefix)
- Command line flag overrides
- Validation
## Extending the Proxy
### Adding a New Registry
1. Add URL resolution in `upstream/resolver.go`
2. Create handler in `handler/newregistry.go`
3. Mount in `server/server.go`
4. Add tests
### Adding a New Storage Backend
1. Implement `storage.Storage` interface
2. Add configuration options in `config/config.go`
3. Add initialization in `server/server.go`
### Cache Eviction
The database tracks `hit_count` and `last_accessed_at` for LRU eviction. Query with:
```go
db.GetLeastRecentlyUsedArtifacts(limit)
```
Eviction can be implemented as:
1. Background goroutine checking `GetTotalCacheSize()`
2. When over limit, get LRU artifacts
3. Delete from storage and clear database records
## Design Decisions
**Why SQLite?**
- Simple deployment (single file)
- No external dependencies
- Good performance for this workload
- Pure Go driver available (no CGO)
**Why rewrite metadata URLs?**
- Ensures clients fetch artifacts through proxy
- Alternative: Let clients fetch directly, miss cache opportunity
**Why not cache metadata (by default)?**
2026-01-20 21:52:44 +00:00
- Simplicity - no invalidation logic needed
- Fresh data - new versions visible immediately
- Metadata is small, upstream fetch is fast
- Set `cache_metadata: true` or use the mirror command to enable metadata caching for offline use via the `metadata_cache` table
2026-01-20 21:52:44 +00:00
**Why stream artifacts?**
- Memory efficient - don't load large files into RAM
- Better latency - start sending while still receiving
**Why atomic writes?**
- Prevents serving partial files
- Safe concurrent access
- Clean recovery from crashes