forked from mirrors/pkg-proxy
Add a `proxy mirror` CLI command and `/api/mirror` API endpoints that pre-populate the cache from various input sources: individual PURLs, SBOM files (CycloneDX and SPDX), or full registry enumeration. The mirror reuses the existing handler.Proxy.GetOrFetchArtifact() pipeline so cached artifacts are identical to those fetched on demand. A bounded worker pool controls download parallelism. Metadata caching is opt-in via `cache_metadata: true` in config (or PROXY_CACHE_METADATA=true). The mirror command always enables it. When enabled, upstream metadata responses are stored for offline fallback with ETag-based conditional revalidation. New internal/mirror package with Source interface, PURLSource, SBOMSource, RegistrySource, and async JobStore. New metadata_cache database table for offline metadata serving.
362 lines
14 KiB
Markdown
362 lines
14 KiB
Markdown
# Architecture
|
|
|
|
This document describes the internal architecture of the git-pkgs proxy.
|
|
|
|
## Overview
|
|
|
|
The proxy is a caching HTTP server that sits between package manager clients and upstream registries. It intercepts requests, checks a local cache, and either serves cached content or fetches from upstream.
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ HTTP Server │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ Router (Chi) │ │
|
|
│ │ /npm/* -> NPMHandler /health -> healthHandler │ │
|
|
│ │ /cargo/* -> CargoHandler /stats -> statsHandler │ │
|
|
│ │ /gem/* -> GemHandler /metrics -> prometheus │ │
|
|
│ │ ...16 ecosystems /api/* -> APIHandler │ │
|
|
│ │ / -> Web UI │ │
|
|
│ └──────────────────────────────────────────────────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌───────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Database │ │ Storage │ │ Upstream │ │
|
|
│ │ SQLite or │ │ Filesystem │ │ Registries │ │
|
|
│ │ Postgres │ │ or S3 │ │ (Fetcher) │ │
|
|
│ └───────────┘ └─────────────┘ └─────────────┘ │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Request Flow
|
|
|
|
### Metadata Request (npm example)
|
|
|
|
1. Client requests `GET /npm/lodash`
|
|
2. NPMHandler receives request
|
|
3. Handler fetches metadata from upstream `registry.npmjs.org/lodash`
|
|
4. Handler rewrites tarball URLs in metadata to point at proxy
|
|
5. Handler returns modified metadata to client
|
|
|
|
Metadata is not cached - always fetched fresh. This ensures clients see new versions immediately.
|
|
|
|
### Artifact Download (npm example)
|
|
|
|
1. Client requests `GET /npm/lodash/-/lodash-4.17.21.tgz`
|
|
2. NPMHandler extracts package name and version from URL
|
|
3. Handler calls `Proxy.GetOrFetchArtifact()`
|
|
4. Proxy checks database for cached artifact:
|
|
|
|
**Cache Hit:**
|
|
- Look up artifact record in database
|
|
- Open file from storage
|
|
- Record hit (increment counter, update last_accessed_at)
|
|
- Return reader to handler
|
|
- Handler streams file to client
|
|
|
|
**Cache Miss:**
|
|
- Resolve download URL using Resolver
|
|
- Fetch artifact from upstream using Fetcher
|
|
- Store artifact in Storage (returns size, hash)
|
|
- Create/update database records (package, version, artifact)
|
|
- Open stored file
|
|
- Return reader to handler
|
|
- Handler streams file to client
|
|
|
|
```
|
|
┌────────┐ GET /npm/lodash/-/lodash-4.17.21.tgz ┌─────────────┐
|
|
│ Client │ ──────────────────────────────────────▶│ NPMHandler │
|
|
└────────┘ └──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ Proxy │
|
|
│ GetOrFetch │
|
|
└────────┬────────┘
|
|
│
|
|
┌───────────────────┼───────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌───────────┐ ┌───────────┐ ┌───────────┐
|
|
│ Database │ │ Storage │ │ Upstream │
|
|
│ (lookup) │ │ (read) │ │ (fetch) │
|
|
└───────────┘ └───────────┘ └───────────┘
|
|
```
|
|
|
|
## Package Structure
|
|
|
|
### `internal/database`
|
|
|
|
SQLite or PostgreSQL database for cache metadata. SQLite uses `modernc.org/sqlite` (pure Go, no CGO). PostgreSQL uses `lib/pq`.
|
|
|
|
The schema is compatible with [git-pkgs](https://github.com/git-pkgs) databases. The proxy adds the `artifacts` and `vulnerabilities` tables on top of the shared `packages` and `versions` tables, so both tools can point at the same database.
|
|
|
|
**Tables:**
|
|
|
|
```sql
|
|
packages (
|
|
id INTEGER PRIMARY KEY, -- SERIAL on Postgres
|
|
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash
|
|
ecosystem TEXT NOT NULL,
|
|
name TEXT NOT NULL,
|
|
latest_version TEXT,
|
|
license TEXT,
|
|
description TEXT,
|
|
homepage TEXT,
|
|
repository_url TEXT,
|
|
registry_url TEXT,
|
|
supplier_name TEXT,
|
|
supplier_type TEXT,
|
|
source TEXT,
|
|
enriched_at DATETIME,
|
|
vulns_synced_at DATETIME,
|
|
created_at DATETIME,
|
|
updated_at DATETIME
|
|
)
|
|
-- indexes: purl (unique), (ecosystem, name)
|
|
|
|
versions (
|
|
id INTEGER PRIMARY KEY,
|
|
purl TEXT NOT NULL, -- unique, e.g. pkg:npm/lodash@4.17.21
|
|
package_purl TEXT NOT NULL, -- FK to packages.purl
|
|
license TEXT,
|
|
published_at DATETIME,
|
|
integrity TEXT, -- subresource integrity hash
|
|
yanked INTEGER DEFAULT 0, -- BOOLEAN on Postgres
|
|
source TEXT,
|
|
enriched_at DATETIME,
|
|
created_at DATETIME,
|
|
updated_at DATETIME
|
|
)
|
|
-- indexes: purl (unique), package_purl
|
|
|
|
artifacts (
|
|
id INTEGER PRIMARY KEY,
|
|
version_purl TEXT NOT NULL,
|
|
filename TEXT NOT NULL,
|
|
upstream_url TEXT NOT NULL,
|
|
storage_path TEXT, -- null until cached
|
|
content_hash TEXT, -- SHA-256
|
|
size INTEGER, -- BIGINT on Postgres
|
|
content_type TEXT,
|
|
fetched_at DATETIME,
|
|
hit_count INTEGER DEFAULT 0, -- BIGINT on Postgres
|
|
last_accessed_at DATETIME,
|
|
created_at DATETIME,
|
|
updated_at DATETIME
|
|
)
|
|
-- indexes: (version_purl, filename) unique, storage_path, last_accessed_at
|
|
|
|
vulnerabilities (
|
|
id INTEGER PRIMARY KEY,
|
|
vuln_id TEXT NOT NULL, -- e.g. CVE-2021-1234
|
|
ecosystem TEXT NOT NULL,
|
|
package_name TEXT NOT NULL,
|
|
severity TEXT,
|
|
summary TEXT,
|
|
fixed_version TEXT,
|
|
cvss_score REAL,
|
|
"references" TEXT, -- JSON array
|
|
fetched_at DATETIME,
|
|
created_at DATETIME,
|
|
updated_at DATETIME
|
|
)
|
|
-- indexes: (vuln_id, ecosystem, package_name) unique, (ecosystem, package_name)
|
|
|
|
metadata_cache (
|
|
id INTEGER PRIMARY KEY,
|
|
ecosystem TEXT NOT NULL,
|
|
name TEXT NOT NULL,
|
|
storage_path TEXT NOT NULL,
|
|
etag TEXT,
|
|
content_type TEXT,
|
|
size INTEGER, -- BIGINT on Postgres
|
|
fetched_at DATETIME,
|
|
created_at DATETIME,
|
|
updated_at DATETIME
|
|
)
|
|
-- indexes: (ecosystem, name) unique
|
|
```
|
|
|
|
On PostgreSQL, `INTEGER PRIMARY KEY` becomes `SERIAL`, `DATETIME` becomes `TIMESTAMP`, `INTEGER DEFAULT 0` booleans become `BOOLEAN DEFAULT FALSE`, and size/count columns use `BIGINT`.
|
|
|
|
The `MigrateSchema()` function handles backward compatibility with older git-pkgs databases by running named migrations that add missing columns and tables. See [migrations.md](migrations.md) for how to add new schema changes.
|
|
|
|
**Key operations:**
|
|
- `GetPackageByPURL()` - Look up package by PURL
|
|
- `GetVersionByPURL()` - Look up version by PURL
|
|
- `GetArtifact()` - Look up artifact by version + filename
|
|
- `UpsertPackage/Version/Artifact()` - Insert or update records
|
|
- `RecordArtifactHit()` - Increment hit counter, update access time
|
|
- `GetLeastRecentlyUsedArtifacts()` - For cache eviction
|
|
- `SearchPackages()` - Full-text search across cached packages
|
|
|
|
### `internal/storage`
|
|
|
|
File storage abstraction. Current implementation uses local filesystem.
|
|
|
|
**Interface:**
|
|
|
|
```go
|
|
type Storage interface {
|
|
Store(ctx, path, reader) (size, hash, error)
|
|
Open(ctx, path) (io.ReadCloser, error)
|
|
Exists(ctx, path) (bool, error)
|
|
Delete(ctx, path) error
|
|
Size(ctx, path) (int64, error)
|
|
UsedSpace(ctx) (int64, error)
|
|
}
|
|
```
|
|
|
|
**Filesystem implementation:**
|
|
- Stores files in nested directories: `{ecosystem}/{name}/{version}/{filename}`
|
|
- Atomic writes using temp file + rename
|
|
- Computes SHA256 hash during write
|
|
- Cleans up empty parent directories on delete
|
|
|
|
**Path structure:**
|
|
|
|
```
|
|
cache/artifacts/
|
|
├── npm/
|
|
│ ├── lodash/
|
|
│ │ └── 4.17.21/
|
|
│ │ └── lodash-4.17.21.tgz
|
|
│ └── @babel/
|
|
│ └── core/
|
|
│ └── 7.23.0/
|
|
│ └── core-7.23.0.tgz
|
|
└── cargo/
|
|
└── serde/
|
|
└── 1.0.193/
|
|
└── serde-1.0.193.crate
|
|
```
|
|
|
|
### `internal/upstream`
|
|
|
|
Fetches artifacts from upstream registries.
|
|
|
|
**Fetcher:**
|
|
- HTTP client with configurable timeout (5 min default for large artifacts)
|
|
- Exponential backoff retry on 429 (rate limit) and 5xx errors
|
|
- Returns streaming reader (doesn't load into memory)
|
|
- Configurable user-agent
|
|
|
|
**Resolver:**
|
|
- Determines download URL for a package/version
|
|
- Handles ecosystem-specific URL patterns:
|
|
- npm: `https://registry.npmjs.org/{name}/-/{shortname}-{version}.tgz`
|
|
- cargo: `https://static.crates.io/crates/{name}/{name}-{version}.crate`
|
|
- etc.
|
|
|
|
### `internal/handler`
|
|
|
|
HTTP protocol handlers for each registry type.
|
|
|
|
**Proxy (shared):**
|
|
- `GetOrFetchArtifact()` - Main cache logic
|
|
- Coordinates database, storage, and fetcher
|
|
- Handles cache hit/miss flow
|
|
|
|
**NPMHandler:**
|
|
- `handlePackageMetadata()` - Proxy + rewrite metadata
|
|
- `handleDownload()` - Serve cached artifact
|
|
- Rewrites tarball URLs to point at proxy
|
|
|
|
**CargoHandler:**
|
|
- `handleConfig()` - Return registry config
|
|
- `handleIndex()` - Proxy sparse index
|
|
- `handleDownload()` - Serve cached crate
|
|
|
|
### `internal/server`
|
|
|
|
HTTP server setup, web UI, and API handlers.
|
|
|
|
- Creates and wires together all components
|
|
- Mounts protocol handlers at ecosystem-specific paths
|
|
- Middleware: request ID, real IP, logging, panic recovery, active request tracking
|
|
- Web UI: dashboard, package browser, source browser, version comparison
|
|
- Templates are embedded in the binary via `//go:embed`
|
|
- Enrichment API for package metadata, vulnerability scanning, and outdated detection
|
|
- Health, stats, and Prometheus metrics endpoints
|
|
|
|
### `internal/metrics`
|
|
|
|
Prometheus metrics for cache performance, upstream latency, storage operations, and active requests. See the Monitoring section of the README for the full metric list.
|
|
|
|
### `internal/cooldown`
|
|
|
|
Version age filtering for supply chain attack mitigation. Configurable at global, ecosystem, and per-package levels. Supported by npm, PyPI, pub.dev, and Composer handlers.
|
|
|
|
### `internal/enrichment`
|
|
|
|
Package metadata enrichment. Fetches license, description, homepage, repository URL, and vulnerability data from upstream registries. Powers the `/api/` endpoints and the web UI's package detail pages.
|
|
|
|
### `internal/mirror`
|
|
|
|
Selective package mirroring for pre-populating the proxy cache. Supports multiple input sources: individual PURLs (versioned or unversioned), CycloneDX/SPDX SBOM files, and full registry enumeration. Uses a bounded worker pool backed by `errgroup` to download artifacts in parallel, reusing `handler.Proxy.GetOrFetchArtifact()` for the actual fetch-and-cache work.
|
|
|
|
The package also provides a `MetadataCache` for storing raw upstream metadata blobs so the proxy can serve metadata responses offline. The `JobStore` manages async mirror jobs exposed via the `/api/mirror` endpoints.
|
|
|
|
### `internal/config`
|
|
|
|
Configuration loading.
|
|
|
|
- Supports YAML and JSON files
|
|
- Environment variable overrides (PROXY_ prefix)
|
|
- Command line flag overrides
|
|
- Validation
|
|
|
|
## Extending the Proxy
|
|
|
|
### Adding a New Registry
|
|
|
|
1. Add URL resolution in `upstream/resolver.go`
|
|
2. Create handler in `handler/newregistry.go`
|
|
3. Mount in `server/server.go`
|
|
4. Add tests
|
|
|
|
### Adding a New Storage Backend
|
|
|
|
1. Implement `storage.Storage` interface
|
|
2. Add configuration options in `config/config.go`
|
|
3. Add initialization in `server/server.go`
|
|
|
|
### Cache Eviction
|
|
|
|
The database tracks `hit_count` and `last_accessed_at` for LRU eviction. Query with:
|
|
|
|
```go
|
|
db.GetLeastRecentlyUsedArtifacts(limit)
|
|
```
|
|
|
|
Eviction can be implemented as:
|
|
1. Background goroutine checking `GetTotalCacheSize()`
|
|
2. When over limit, get LRU artifacts
|
|
3. Delete from storage and clear database records
|
|
|
|
## Design Decisions
|
|
|
|
**Why SQLite?**
|
|
- Simple deployment (single file)
|
|
- No external dependencies
|
|
- Good performance for this workload
|
|
- Pure Go driver available (no CGO)
|
|
|
|
**Why rewrite metadata URLs?**
|
|
- Ensures clients fetch artifacts through proxy
|
|
- Alternative: Let clients fetch directly, miss cache opportunity
|
|
|
|
**Why not cache metadata (by default)?**
|
|
- Simplicity - no invalidation logic needed
|
|
- Fresh data - new versions visible immediately
|
|
- Metadata is small, upstream fetch is fast
|
|
- Set `cache_metadata: true` or use the mirror command to enable metadata caching for offline use via the `metadata_cache` table
|
|
|
|
**Why stream artifacts?**
|
|
- Memory efficient - don't load large files into RAM
|
|
- Better latency - start sending while still receiving
|
|
|
|
**Why atomic writes?**
|
|
- Prevents serving partial files
|
|
- Safe concurrent access
|
|
- Clean recovery from crashes
|