Skip to content

Content archive

When you want every storage artefact from an execution in one file — for handoff to external tooling, for offline analysis, for archival — use the content archive API. It’s async: you request, poll, then download.

5 endpoints under /api/v1/content.

Terminal window
curl -X POST http://localhost:8000/api/v1/content/archive/prepare \
-H 'Content-Type: application/json' \
-d '{"execution_id": "EXEC_ID"}'

Response:

{
"job_id": "archive-job-uuid",
"status": "queued"
}

The server queues the archive job, walks executions/EXEC_ID/** in storage, bundles into a zip, and writes the zip back to storage under archives/<job_id>.zip.

Narrow the archive:

Terminal window
curl -X POST http://localhost:8000/api/v1/content/archive/prepare \
-H 'Content-Type: application/json' \
-d '{
"execution_id": "EXEC_ID",
"routes": ["sitemap_scraper"],
"stages": ["web_scraper"],
"include_metadata": true
}'
  • routes — only include objects under these route ids
  • stages — only include objects with these stage names
  • include_metadata — true (default) ships sidecars; false keeps the archive smaller
Terminal window
curl http://localhost:8000/api/v1/content/archive/JOB_ID/status

Response:

{
"job_id": "...",
"status": "running",
"progress": 0.42,
"object_count": 1200,
"bytes_packed": 34567890
}

Statuses: queued, running, completed, failed. Progress is [0, 1].

Once status == "completed":

Terminal window
curl -O http://localhost:8000/api/v1/content/archive/JOB_ID/download

Returns the zip file bytes. Size can be large — downloads are chunked.

The CLI hides the polling loop:

Terminal window
factflow storage download EXEC_ID --output-dir ./out/

Internally:

  1. POSTs /content/archive/prepare
  2. Polls /content/archive/{job}/status every 2 seconds until completed
  3. Downloads, extracts to --output-dir
  4. Deletes the server-side archive

If you want to know how big an archive would be before requesting:

Terminal window
curl 'http://localhost:8000/api/v1/content/count?execution_id=EXEC_ID'

Returns object count and total byte size. O(list) — fast.

Server-side archives are short-lived:

  • TTL 24 hours by default
  • Cleaned up automatically
  • Re-request if you missed the window
  • Execution has no storage404 No objects found
  • Archive too large (beyond configured limit) — 413 Payload too large. Scope with routes: / stages:.
  • Storage provider error mid-archive — job transitions to failed with an error payload; retry is manual