Content archive

When you want every storage artefact from an execution in one file — for handoff to external tooling, for offline analysis, for archival — use the content archive API. It’s async: you request, poll, then download.

5 endpoints under /api/v1/content.

Request an archive

curl -X POST http://localhost:8000/api/v1/content/archive/prepare \
  -H 'Content-Type: application/json' \
  -d '{"execution_id": "EXEC_ID"}'

Response:

{
  "job_id": "archive-job-uuid",
  "status": "queued"
}

The server queues the archive job, walks executions/EXEC_ID/** in storage, bundles into a zip, and writes the zip back to storage under archives/<job_id>.zip.

Optional scoping

Narrow the archive:

curl -X POST http://localhost:8000/api/v1/content/archive/prepare \
  -H 'Content-Type: application/json' \
  -d '{
    "execution_id": "EXEC_ID",
    "routes": ["sitemap_scraper"],
    "stages": ["web_scraper"],
    "include_metadata": true
  }'

routes — only include objects under these route ids
stages — only include objects with these stage names
include_metadata — true (default) ships sidecars; false keeps the archive smaller

Poll status

curl http://localhost:8000/api/v1/content/archive/JOB_ID/status

Response:

{
  "job_id": "...",
  "status": "running",
  "progress": 0.42,
  "object_count": 1200,
  "bytes_packed": 34567890
}

Statuses: queued, running, completed, failed. Progress is [0, 1].

Download

Once status == "completed":

curl -O http://localhost:8000/api/v1/content/archive/JOB_ID/download

Returns the zip file bytes. Size can be large — downloads are chunked.

CLI wrapper

The CLI hides the polling loop:

factflow storage download EXEC_ID --output-dir ./out/

Internally:

POSTs /content/archive/prepare
Polls /content/archive/{job}/status every 2 seconds until completed
Downloads, extracts to --output-dir
Deletes the server-side archive

Count (fast)

If you want to know how big an archive would be before requesting:

curl 'http://localhost:8000/api/v1/content/count?execution_id=EXEC_ID'

Returns object count and total byte size. O(list) — fast.

Lifecycle

Server-side archives are short-lived:

TTL 24 hours by default
Cleaned up automatically
Re-request if you missed the window

Failure modes

Execution has no storage — 404 No objects found
Archive too large (beyond configured limit) — 413 Payload too large. Scope with routes: / stages:.
Storage provider error mid-archive — job transitions to failed with an error payload; retry is manual