- Bump github.com/git-pkgs/registries to v0.6.0: the fetcher now honours HTTP_PROXY, gates dialled IPs against the safehttp block list, and Version.Integrity is populated for pub, julia and nuget - Replace internal/cooldown with github.com/git-pkgs/cooldown v0.1.1 (identical surface, lifted from this repo) - Update docs/architecture.md to point at the external package
14 KiB
Architecture
This document describes the internal architecture of the git-pkgs proxy.
Overview
The proxy is a caching HTTP server that sits between package manager clients and upstream registries. It intercepts requests, checks a local cache, and either serves cached content or fetches from upstream.
┌──────────────────────────────────────────────────────────────────┐
│ HTTP Server │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Router (Chi) │ │
│ │ /npm/* -> NPMHandler /health -> healthHandler │ │
│ │ /cargo/* -> CargoHandler /stats -> statsHandler │ │
│ │ /gem/* -> GemHandler /metrics -> prometheus │ │
│ │ ...16 ecosystems /api/* -> APIHandler │ │
│ │ / -> Web UI │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Database │ │ Storage │ │ Upstream │ │
│ │ SQLite or │ │ Filesystem │ │ Registries │ │
│ │ Postgres │ │ or S3 │ │ (Fetcher) │ │
│ └───────────┘ └─────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Request Flow
Metadata Request (npm example)
- Client requests
GET /npm/lodash - NPMHandler receives request
- Handler fetches metadata from upstream
registry.npmjs.org/lodash - Handler rewrites tarball URLs in metadata to point at proxy
- Handler returns modified metadata to client
Metadata is not cached - always fetched fresh. This ensures clients see new versions immediately.
Artifact Download (npm example)
-
Client requests
GET /npm/lodash/-/lodash-4.17.21.tgz -
NPMHandler extracts package name and version from URL
-
Handler calls
Proxy.GetOrFetchArtifact() -
Proxy checks database for cached artifact:
Cache Hit:
- Look up artifact record in database
- Open file from storage
- Record hit (increment counter, update last_accessed_at)
- Return reader to handler
- Handler streams file to client
Cache Miss:
- Resolve download URL using Resolver
- Fetch artifact from upstream using Fetcher
- Store artifact in Storage (returns size, hash)
- Create/update database records (package, version, artifact)
- Open stored file
- Return reader to handler
- Handler streams file to client
┌────────┐ GET /npm/lodash/-/lodash-4.17.21.tgz ┌─────────────┐
│ Client │ ──────────────────────────────────────▶│ NPMHandler │
└────────┘ └──────┬──────┘
│
▼
┌─────────────────┐
│ Proxy │
│ GetOrFetch │
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Database │ │ Storage │ │ Upstream │
│ (lookup) │ │ (read) │ │ (fetch) │
└───────────┘ └───────────┘ └───────────┘
Package Structure
internal/database
SQLite or PostgreSQL database for cache metadata. SQLite uses modernc.org/sqlite (pure Go, no CGO). PostgreSQL uses lib/pq.
The schema is compatible with git-pkgs databases. The proxy adds the artifacts and vulnerabilities tables on top of the shared packages and versions tables, so both tools can point at the same database.
Tables:
packages (
id INTEGER PRIMARY KEY, -- SERIAL on Postgres
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash
ecosystem TEXT NOT NULL,
name TEXT NOT NULL,
latest_version TEXT,
license TEXT,
description TEXT,
homepage TEXT,
repository_url TEXT,
registry_url TEXT,
supplier_name TEXT,
supplier_type TEXT,
source TEXT,
enriched_at DATETIME,
vulns_synced_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: purl (unique), (ecosystem, name)
versions (
id INTEGER PRIMARY KEY,
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash@4.17.21
package_purl TEXT NOT NULL, -- FK to packages.purl
license TEXT,
published_at DATETIME,
integrity TEXT, -- subresource integrity hash
yanked INTEGER DEFAULT 0, -- BOOLEAN on Postgres
source TEXT,
enriched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: purl (unique), package_purl
artifacts (
id INTEGER PRIMARY KEY,
version_purl TEXT NOT NULL,
filename TEXT NOT NULL,
upstream_url TEXT NOT NULL,
storage_path TEXT, -- null until cached
content_hash TEXT, -- SHA-256
size INTEGER, -- BIGINT on Postgres
content_type TEXT,
fetched_at DATETIME,
hit_count INTEGER DEFAULT 0, -- BIGINT on Postgres
last_accessed_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (version_purl, filename) unique, storage_path, last_accessed_at
vulnerabilities (
id INTEGER PRIMARY KEY,
vuln_id TEXT NOT NULL, -- e.g. CVE-2021-1234
ecosystem TEXT NOT NULL,
package_name TEXT NOT NULL,
severity TEXT,
summary TEXT,
fixed_version TEXT,
cvss_score REAL,
"references" TEXT, -- JSON array
fetched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (vuln_id, ecosystem, package_name) unique, (ecosystem, package_name)
metadata_cache (
id INTEGER PRIMARY KEY,
ecosystem TEXT NOT NULL,
name TEXT NOT NULL,
storage_path TEXT NOT NULL,
etag TEXT,
content_type TEXT,
size INTEGER, -- BIGINT on Postgres
fetched_at DATETIME,
created_at DATETIME,
updated_at DATETIME
)
-- indexes: (ecosystem, name) unique
On PostgreSQL, INTEGER PRIMARY KEY becomes SERIAL, DATETIME becomes TIMESTAMP, INTEGER DEFAULT 0 booleans become BOOLEAN DEFAULT FALSE, and size/count columns use BIGINT.
The MigrateSchema() function handles backward compatibility with older git-pkgs databases by running named migrations that add missing columns and tables. See migrations.md for how to add new schema changes.
Key operations:
GetPackageByPURL()- Look up package by PURLGetVersionByPURL()- Look up version by PURLGetArtifact()- Look up artifact by version + filenameUpsertPackage/Version/Artifact()- Insert or update recordsRecordArtifactHit()- Increment hit counter, update access timeGetLeastRecentlyUsedArtifacts()- For cache evictionSearchPackages()- Full-text search across cached packages
internal/storage
File storage abstraction. Current implementation uses local filesystem.
Interface:
type Storage interface {
Store(ctx, path, reader) (size, hash, error)
Open(ctx, path) (io.ReadCloser, error)
Exists(ctx, path) (bool, error)
Delete(ctx, path) error
Size(ctx, path) (int64, error)
UsedSpace(ctx) (int64, error)
}
Filesystem implementation:
- Stores files in nested directories:
{ecosystem}/{name}/{version}/{filename} - Atomic writes using temp file + rename
- Computes SHA256 hash during write
- Cleans up empty parent directories on delete
Path structure:
cache/artifacts/
├── npm/
│ ├── lodash/
│ │ └── 4.17.21/
│ │ └── lodash-4.17.21.tgz
│ └── @babel/
│ └── core/
│ └── 7.23.0/
│ └── core-7.23.0.tgz
└── cargo/
└── serde/
└── 1.0.193/
└── serde-1.0.193.crate
internal/upstream
Fetches artifacts from upstream registries.
Fetcher:
- HTTP client with configurable timeout (5 min default for large artifacts)
- Exponential backoff retry on 429 (rate limit) and 5xx errors
- Returns streaming reader (doesn't load into memory)
- Configurable user-agent
Resolver:
- Determines download URL for a package/version
- Handles ecosystem-specific URL patterns:
- npm:
https://registry.npmjs.org/{name}/-/{shortname}-{version}.tgz - cargo:
https://static.crates.io/crates/{name}/{name}-{version}.crate - etc.
- npm:
internal/handler
HTTP protocol handlers for each registry type.
Proxy (shared):
GetOrFetchArtifact()- Main cache logic- Coordinates database, storage, and fetcher
- Handles cache hit/miss flow
NPMHandler:
handlePackageMetadata()- Proxy + rewrite metadatahandleDownload()- Serve cached artifact- Rewrites tarball URLs to point at proxy
CargoHandler:
handleConfig()- Return registry confighandleIndex()- Proxy sparse indexhandleDownload()- Serve cached crate
internal/server
HTTP server setup, web UI, and API handlers.
- Creates and wires together all components
- Mounts protocol handlers at ecosystem-specific paths
- Middleware: request ID, real IP, logging, panic recovery, active request tracking
- Web UI: dashboard, package browser, source browser, version comparison
- Templates are embedded in the binary via
//go:embed - Enrichment API for package metadata, vulnerability scanning, and outdated detection
- Health, stats, and Prometheus metrics endpoints
internal/metrics
Prometheus metrics for cache performance, upstream latency, storage operations, and active requests. See the Monitoring section of the README for the full metric list.
Cooldown
Version age filtering for supply chain attack mitigation, provided by github.com/git-pkgs/cooldown. Configurable at global, ecosystem, and per-package levels. Supported by npm, PyPI, pub.dev, and Composer handlers.
internal/enrichment
Package metadata enrichment. Fetches license, description, homepage, repository URL, and vulnerability data from upstream registries. Powers the /api/ endpoints and the web UI's package detail pages.
internal/mirror
Selective package mirroring for pre-populating the proxy cache. Supports multiple input sources: individual PURLs (versioned or unversioned), CycloneDX/SPDX SBOM files, and full registry enumeration. Uses a bounded worker pool backed by errgroup to download artifacts in parallel, reusing handler.Proxy.GetOrFetchArtifact() for the actual fetch-and-cache work.
The package also provides a MetadataCache for storing raw upstream metadata blobs so the proxy can serve metadata responses offline. The JobStore manages async mirror jobs exposed via the /api/mirror endpoints.
internal/config
Configuration loading.
- Supports YAML and JSON files
- Environment variable overrides (PROXY_ prefix)
- Command line flag overrides
- Validation
Extending the Proxy
Adding a New Registry
- Add URL resolution in
upstream/resolver.go - Create handler in
handler/newregistry.go - Mount in
server/server.go - Add tests
Adding a New Storage Backend
- Implement
storage.Storageinterface - Add configuration options in
config/config.go - Add initialization in
server/server.go
Cache Eviction
The database tracks hit_count and last_accessed_at for LRU eviction. Query with:
db.GetLeastRecentlyUsedArtifacts(limit)
Eviction can be implemented as:
- Background goroutine checking
GetTotalCacheSize() - When over limit, get LRU artifacts
- Delete from storage and clear database records
Design Decisions
Why SQLite?
- Simple deployment (single file)
- No external dependencies
- Good performance for this workload
- Pure Go driver available (no CGO)
Why rewrite metadata URLs?
- Ensures clients fetch artifacts through proxy
- Alternative: Let clients fetch directly, miss cache opportunity
Why not cache metadata (by default)?
- Simplicity - no invalidation logic needed
- Fresh data - new versions visible immediately
- Metadata is small, upstream fetch is fast
- Set
cache_metadata: trueor use the mirror command to enable metadata caching for offline use via themetadata_cachetable
Why stream artifacts?
- Memory efficient - don't load large files into RAM
- Better latency - start sending while still receiving
Why atomic writes?
- Prevents serving partial files
- Safe concurrent access
- Clean recovery from crashes