Storage model

Factflow stores every meaningful artefact — scraped pages, converted markdown, segments, embeddings, knowledge maps — in a pluggable storage backend. Two providers ship: filesystem (for dev + single-node prod) and MinIO / S3 (for multi-node prod).

The protocol

Every storage backend implements StorageProtocol from factflow-protocols:

write(key, data, metadata) — persist bytes plus structured metadata
read(key) — fetch bytes
exists(key) — predicate
list(prefix) — async iterator of keys matching a prefix
delete(key) — remove

Providers are URI-selected:

file://<path> → FilesystemStorageProvider
minio://<bucket> or s3://<bucket> → MinioStorageProvider

factflow_infra.storage.create_storage_provider(settings) dispatches at construction time.

Sidecar metadata

Every object has a companion .meta.json file (for filesystem) or object metadata (for S3/MinIO). Schema is defined in StorageMetadataConfig. It captures:

Provenance — which execution, which route, which adapter wrote this, when
Config hash — the config snapshot’s hash, for reproducibility
Lineage reference — the lineage row id
Custom fields — per-pipeline extensions (e.g., source URL for webscraper, embedding model for embeddings)

Sidecars are additive. A pipeline can extend the schema with its own fields without breaking consumers; unknown fields are preserved on read.

factflow storage metadata "executions/ID/ROUTE/STAGE/MSG"

Returns the sidecar JSON.

Key convention

Every shipped adapter writes under this pattern:

executions/<execution-id>/<route-id>/<stage-name>/<message-id>[.ext]

This is convention, not enforced by the protocol. Breaking it is possible but breaks every downstream consumer (replay, lineage lookups, CLI browsers). Don’t break it without a reason.

Storage is the replay source of truth

The single load-bearing invariant: storage must contain the artefact if replay wants to use it.

Replay reads executions/SRC/ROUTE/STAGE/* and republishes the contents to target route queues. If an object is missing, replay fails — storage is not a cache.

This shapes retention policy:

Don’t auto-expire objects referenced by executions you might replay
Do prune executions you’ve written off (cascade: delete storage keys by prefix first, then delete lineage rows, then delete the execution row)

Storage events

Writes optionally publish StorageEvent messages to a configured event channel. Subscribers:

The frontend’s “live execution view” via /api/v1/executions/<id>/storage-stream SSE
The CLI’s factflow storage watch --execution <id> stream
The webhook dispatcher (for subscribed URLs)
Replay (to trigger when an expected object arrives late)

Events are best-effort. A missed event never causes data loss — the artefact is still in storage.

Performance notes

Filesystem is fast for writes, slow for list(prefix) when prefixes have many objects. Fine for dev, acceptable for single-node prod.
MinIO / S3 has constant-time listing but higher per-request latency. Use for multi-node or when you need external durability guarantees.
Neither is optimised for small-object random reads. Pipelines that need frequent random access should materialise into Postgres (for structured data) or a vector DB (for embeddings) rather than re-reading storage.

factflow-infra reference — provider implementations
factflow-protocols reference — the contracts
Storage guide — CLI commands and day-to-day ops
Replay guide — how storage powers replay