pkg-proxy/docs/architecture.md

362 lines
14 KiB
Markdown
Raw Permalink Normal View History

2026-01-20 21:52:44 +00:00
# Architecture
This document describes the internal architecture of the git-pkgs proxy.
## Overview
The proxy is a caching HTTP server that sits between package manager clients and upstream registries. It intercepts requests, checks a local cache, and either serves cached content or fetches from upstream.
```
┌──────────────────────────────────────────────────────────────────┐
│ HTTP Server │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Router (Chi) │ │
│ │ /npm/* -> NPMHandler /health -> healthHandler │ │
│ │ /cargo/* -> CargoHandler /stats -> statsHandler │ │
│ │ /gem/* -> GemHandler /metrics -> prometheus │ │
│ │ ...16 ecosystems /api/* -> APIHandler │ │
│ │ / -> Web UI │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
2026-01-20 21:52:44 +00:00
│ ┌───────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Database │ │ Storage │ │ Upstream │ │
│ │ SQLite or │ │ Filesystem │ │ Registries │ │
│ │ Postgres │ │ or S3 │ │ (Fetcher) │ │
2026-01-20 21:52:44 +00:00
│ └───────────┘ └─────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘
2026-01-20 21:52:44 +00:00
```
## Request Flow
### Metadata Request (npm example)
1. Client requests `GET /npm/lodash`
2. NPMHandler receives request
3. Handler fetches metadata from upstream `registry.npmjs.org/lodash`
4. Handler rewrites tarball URLs in metadata to point at proxy
5. Handler returns modified metadata to client
Metadata is not cached - always fetched fresh. This ensures clients see new versions immediately.
### Artifact Download (npm example)
1. Client requests `GET /npm/lodash/-/lodash-4.17.21.tgz`
2. NPMHandler extracts package name and version from URL
3. Handler calls `Proxy.GetOrFetchArtifact()`
4. Proxy checks database for cached artifact:
**Cache Hit:**
- Look up artifact record in database
- Open file from storage
- Record hit (increment counter, update last_accessed_at)
- Return reader to handler
- Handler streams file to client
**Cache Miss:**
- Resolve download URL using Resolver
- Fetch artifact from upstream using Fetcher
- Store artifact in Storage (returns size, hash)
- Create/update database records (package, version, artifact)
- Open stored file
- Return reader to handler
- Handler streams file to client
```
┌────────┐ GET /npm/lodash/-/lodash-4.17.21.tgz ┌─────────────┐
│ Client │ ──────────────────────────────────────▶│ NPMHandler │
└────────┘ └──────┬──────┘
┌─────────────────┐
│ Proxy │
│ GetOrFetch │
└────────┬────────┘
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Database │ │ Storage │ │ Upstream │
│ (lookup) │ │ (read) │ │ (fetch) │
└───────────┘ └───────────┘ └───────────┘
```
## Package Structure
### `internal/database`
SQLite or PostgreSQL database for cache metadata. SQLite uses `modernc.org/sqlite` (pure Go, no CGO). PostgreSQL uses `lib/pq`.
The schema is compatible with [git-pkgs](https://github.com/git-pkgs) databases. The proxy adds the `artifacts` and `vulnerabilities` tables on top of the shared `packages` and `versions` tables, so both tools can point at the same database.
2026-01-20 21:52:44 +00:00
**Tables:**
```sql
packages (
id INTEGER PRIMARY KEY, -- SERIAL on Postgres
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash
ecosystem TEXT NOT NULL,
name TEXT NOT NULL,
latest_version TEXT,
license TEXT,
description TEXT,
homepage TEXT,
repository_url TEXT,
registry_url TEXT,
supplier_name TEXT,
supplier_type TEXT,
source TEXT,
enriched_at DATETIME,
vulns_synced_at DATETIME,
created_at DATETIME,
updated_at DATETIME
2026-01-20 21:52:44 +00:00
)
-- indexes: purl (unique), (ecosystem, name)
2026-01-20 21:52:44 +00:00
versions (
id INTEGER PRIMARY KEY,
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash@4.17.21
package_purl TEXT NOT NULL, -- FK to packages.purl
license TEXT,
published_at DATETIME,
integrity TEXT, -- subresource integrity hash
yanked INTEGER DEFAULT 0, -- BOOLEAN on Postgres
source TEXT,
enriched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
2026-01-20 21:52:44 +00:00
)
-- indexes: purl (unique), package_purl
2026-01-20 21:52:44 +00:00
artifacts (
id INTEGER PRIMARY KEY,
version_purl TEXT NOT NULL,
filename TEXT NOT NULL,
upstream_url TEXT NOT NULL,
storage_path TEXT, -- null until cached
content_hash TEXT, -- SHA-256
size INTEGER, -- BIGINT on Postgres
content_type TEXT,
fetched_at DATETIME,
hit_count INTEGER DEFAULT 0, -- BIGINT on Postgres
last_accessed_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (version_purl, filename) unique, storage_path, last_accessed_at
vulnerabilities (
id INTEGER PRIMARY KEY,
vuln_id TEXT NOT NULL, -- e.g. CVE-2021-1234
ecosystem TEXT NOT NULL,
package_name TEXT NOT NULL,
severity TEXT,
summary TEXT,
fixed_version TEXT,
cvss_score REAL,
"references" TEXT, -- JSON array
fetched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
2026-01-20 21:52:44 +00:00
)
-- indexes: (vuln_id, ecosystem, package_name) unique, (ecosystem, package_name)
metadata_cache (
id INTEGER PRIMARY KEY,
ecosystem TEXT NOT NULL,
name TEXT NOT NULL,
storage_path TEXT NOT NULL,
etag TEXT,
content_type TEXT,
size INTEGER, -- BIGINT on Postgres
fetched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (ecosystem, name) unique
2026-01-20 21:52:44 +00:00
```
On PostgreSQL, `INTEGER PRIMARY KEY` becomes `SERIAL`, `DATETIME` becomes `TIMESTAMP`, `INTEGER DEFAULT 0` booleans become `BOOLEAN DEFAULT FALSE`, and size/count columns use `BIGINT`.
The `MigrateSchema()` function handles backward compatibility with older git-pkgs databases by running named migrations that add missing columns and tables. See [migrations.md](migrations.md) for how to add new schema changes.
2026-01-20 21:52:44 +00:00
**Key operations:**
- `GetPackageByPURL()` - Look up package by PURL
- `GetVersionByPURL()` - Look up version by PURL
- `GetArtifact()` - Look up artifact by version + filename
- `UpsertPackage/Version/Artifact()` - Insert or update records
- `RecordArtifactHit()` - Increment hit counter, update access time
- `GetLeastRecentlyUsedArtifacts()` - For cache eviction
- `SearchPackages()` - Full-text search across cached packages
2026-01-20 21:52:44 +00:00
### `internal/storage`
File storage abstraction. Current implementation uses local filesystem.
**Interface:**
```go
type Storage interface {
Store(ctx, path, reader) (size, hash, error)
Open(ctx, path) (io.ReadCloser, error)
Exists(ctx, path) (bool, error)
Delete(ctx, path) error
Size(ctx, path) (int64, error)
UsedSpace(ctx) (int64, error)
}
```
**Filesystem implementation:**
- Stores files in nested directories: `{ecosystem}/{name}/{version}/{filename}`
- Atomic writes using temp file + rename
- Computes SHA256 hash during write
- Cleans up empty parent directories on delete
**Path structure:**
```
cache/artifacts/
├── npm/
│ ├── lodash/
│ │ └── 4.17.21/
│ │ └── lodash-4.17.21.tgz
│ └── @babel/
│ └── core/
│ └── 7.23.0/
│ └── core-7.23.0.tgz
└── cargo/
└── serde/
└── 1.0.193/
└── serde-1.0.193.crate
```
### `internal/upstream`
Fetches artifacts from upstream registries.
**Fetcher:**
- HTTP client with configurable timeout (5 min default for large artifacts)
- Exponential backoff retry on 429 (rate limit) and 5xx errors
- Returns streaming reader (doesn't load into memory)
- Configurable user-agent
**Resolver:**
- Determines download URL for a package/version
- Handles ecosystem-specific URL patterns:
- npm: `https://registry.npmjs.org/{name}/-/{shortname}-{version}.tgz`
- cargo: `https://static.crates.io/crates/{name}/{name}-{version}.crate`
- etc.
### `internal/handler`
HTTP protocol handlers for each registry type.
**Proxy (shared):**
- `GetOrFetchArtifact()` - Main cache logic
- Coordinates database, storage, and fetcher
- Handles cache hit/miss flow
**NPMHandler:**
- `handlePackageMetadata()` - Proxy + rewrite metadata
- `handleDownload()` - Serve cached artifact
- Rewrites tarball URLs to point at proxy
**CargoHandler:**
- `handleConfig()` - Return registry config
- `handleIndex()` - Proxy sparse index
- `handleDownload()` - Serve cached crate
### `internal/server`
HTTP server setup, web UI, and API handlers.
2026-01-20 21:52:44 +00:00
- Creates and wires together all components
- Mounts protocol handlers at ecosystem-specific paths
- Middleware: request ID, real IP, logging, panic recovery, active request tracking
- Web UI: dashboard, package browser, source browser, version comparison
- Templates are embedded in the binary via `//go:embed`
- Enrichment API for package metadata, vulnerability scanning, and outdated detection
- Health, stats, and Prometheus metrics endpoints
### `internal/metrics`
Prometheus metrics for cache performance, upstream latency, storage operations, and active requests. See the Monitoring section of the README for the full metric list.
### `internal/cooldown`
Version age filtering for supply chain attack mitigation. Configurable at global, ecosystem, and per-package levels. Supported by npm, PyPI, pub.dev, and Composer handlers.
### `internal/enrichment`
Package metadata enrichment. Fetches license, description, homepage, repository URL, and vulnerability data from upstream registries. Powers the `/api/` endpoints and the web UI's package detail pages.
2026-01-20 21:52:44 +00:00
### `internal/mirror`
Selective package mirroring for pre-populating the proxy cache. Supports multiple input sources: individual PURLs (versioned or unversioned), CycloneDX/SPDX SBOM files, and full registry enumeration. Uses a bounded worker pool backed by `errgroup` to download artifacts in parallel, reusing `handler.Proxy.GetOrFetchArtifact()` for the actual fetch-and-cache work.
The package also provides a `MetadataCache` for storing raw upstream metadata blobs so the proxy can serve metadata responses offline. The `JobStore` manages async mirror jobs exposed via the `/api/mirror` endpoints.
2026-01-20 21:52:44 +00:00
### `internal/config`
Configuration loading.
- Supports YAML and JSON files
- Environment variable overrides (PROXY_ prefix)
- Command line flag overrides
- Validation
## Extending the Proxy
### Adding a New Registry
1. Add URL resolution in `upstream/resolver.go`
2. Create handler in `handler/newregistry.go`
3. Mount in `server/server.go`
4. Add tests
### Adding a New Storage Backend
1. Implement `storage.Storage` interface
2. Add configuration options in `config/config.go`
3. Add initialization in `server/server.go`
### Cache Eviction
The database tracks `hit_count` and `last_accessed_at` for LRU eviction. Query with:
```go
db.GetLeastRecentlyUsedArtifacts(limit)
```
Eviction can be implemented as:
1. Background goroutine checking `GetTotalCacheSize()`
2. When over limit, get LRU artifacts
3. Delete from storage and clear database records
## Design Decisions
**Why SQLite?**
- Simple deployment (single file)
- No external dependencies
- Good performance for this workload
- Pure Go driver available (no CGO)
**Why rewrite metadata URLs?**
- Ensures clients fetch artifacts through proxy
- Alternative: Let clients fetch directly, miss cache opportunity
**Why not cache metadata (by default)?**
2026-01-20 21:52:44 +00:00
- Simplicity - no invalidation logic needed
- Fresh data - new versions visible immediately
- Metadata is small, upstream fetch is fast
- Set `cache_metadata: true` or use the mirror command to enable metadata caching for offline use via the `metadata_cache` table
2026-01-20 21:52:44 +00:00
**Why stream artifacts?**
- Memory efficient - don't load large files into RAM
- Better latency - start sending while still receiving
**Why atomic writes?**
- Prevents serving partial files
- Safe concurrent access
- Clean recovery from crashes