Skip to content

Storage model

Factflow stores every meaningful artefact — scraped pages, converted markdown, segments, embeddings, knowledge maps — in a pluggable storage backend. Two providers ship: filesystem (for dev + single-node prod) and MinIO / S3 (for multi-node prod).

Every storage backend implements StorageProtocol from factflow-protocols:

  • write(key, data, metadata) — persist bytes plus structured metadata
  • read(key) — fetch bytes
  • exists(key) — predicate
  • list(prefix) — async iterator of keys matching a prefix
  • delete(key) — remove

Providers are URI-selected:

  • file://<path>FilesystemStorageProvider
  • minio://<bucket> or s3://<bucket>MinioStorageProvider

factflow_infra.storage.create_storage_provider(settings) dispatches at construction time.

Every object has a companion .meta.json file (for filesystem) or object metadata (for S3/MinIO). Schema is defined in StorageMetadataConfig. It captures:

  • Provenance — which execution, which route, which adapter wrote this, when
  • Config hash — the config snapshot’s hash, for reproducibility
  • Lineage reference — the lineage row id
  • Custom fields — per-pipeline extensions (e.g., source URL for webscraper, embedding model for embeddings)

Sidecars are additive. A pipeline can extend the schema with its own fields without breaking consumers; unknown fields are preserved on read.

Terminal window
factflow storage metadata "executions/ID/ROUTE/STAGE/MSG"

Returns the sidecar JSON.

Every shipped adapter writes under this pattern:

executions/<execution-id>/<route-id>/<stage-name>/<message-id>[.ext]

This is convention, not enforced by the protocol. Breaking it is possible but breaks every downstream consumer (replay, lineage lookups, CLI browsers). Don’t break it without a reason.

The single load-bearing invariant: storage must contain the artefact if replay wants to use it.

Replay reads executions/SRC/ROUTE/STAGE/* and republishes the contents to target route queues. If an object is missing, replay fails — storage is not a cache.

This shapes retention policy:

  • Don’t auto-expire objects referenced by executions you might replay
  • Do prune executions you’ve written off (cascade: delete storage keys by prefix first, then delete lineage rows, then delete the execution row)

Writes optionally publish StorageEvent messages to a configured event channel. Subscribers:

  • The frontend’s “live execution view” via /api/v1/executions/<id>/storage-stream SSE
  • The CLI’s factflow storage watch --execution <id> stream
  • The webhook dispatcher (for subscribed URLs)
  • Replay (to trigger when an expected object arrives late)

Events are best-effort. A missed event never causes data loss — the artefact is still in storage.

  • Filesystem is fast for writes, slow for list(prefix) when prefixes have many objects. Fine for dev, acceptable for single-node prod.
  • MinIO / S3 has constant-time listing but higher per-request latency. Use for multi-node or when you need external durability guarantees.
  • Neither is optimised for small-object random reads. Pipelines that need frequent random access should materialise into Postgres (for structured data) or a vector DB (for embeddings) rather than re-reading storage.