Storage model
Factflow stores every meaningful artefact — scraped pages, converted markdown, segments, embeddings, knowledge maps — in a pluggable storage backend. Two providers ship: filesystem (for dev + single-node prod) and MinIO / S3 (for multi-node prod).
The protocol
Section titled “The protocol”Every storage backend implements StorageProtocol from factflow-protocols:
write(key, data, metadata)— persist bytes plus structured metadataread(key)— fetch bytesexists(key)— predicatelist(prefix)— async iterator of keys matching a prefixdelete(key)— remove
Providers are URI-selected:
file://<path>→FilesystemStorageProviderminio://<bucket>ors3://<bucket>→MinioStorageProvider
factflow_infra.storage.create_storage_provider(settings) dispatches at construction time.
Sidecar metadata
Section titled “Sidecar metadata”Every object has a companion .meta.json file (for filesystem) or object metadata (for S3/MinIO). Schema is defined in StorageMetadataConfig. It captures:
- Provenance — which execution, which route, which adapter wrote this, when
- Config hash — the config snapshot’s hash, for reproducibility
- Lineage reference — the lineage row id
- Custom fields — per-pipeline extensions (e.g., source URL for webscraper, embedding model for embeddings)
Sidecars are additive. A pipeline can extend the schema with its own fields without breaking consumers; unknown fields are preserved on read.
factflow storage metadata "executions/ID/ROUTE/STAGE/MSG"Returns the sidecar JSON.
Key convention
Section titled “Key convention”Every shipped adapter writes under this pattern:
executions/<execution-id>/<route-id>/<stage-name>/<message-id>[.ext]This is convention, not enforced by the protocol. Breaking it is possible but breaks every downstream consumer (replay, lineage lookups, CLI browsers). Don’t break it without a reason.
Storage is the replay source of truth
Section titled “Storage is the replay source of truth”The single load-bearing invariant: storage must contain the artefact if replay wants to use it.
Replay reads executions/SRC/ROUTE/STAGE/* and republishes the contents to target route queues. If an object is missing, replay fails — storage is not a cache.
This shapes retention policy:
- Don’t auto-expire objects referenced by executions you might replay
- Do prune executions you’ve written off (cascade: delete storage keys by prefix first, then delete lineage rows, then delete the execution row)
Storage events
Section titled “Storage events”Writes optionally publish StorageEvent messages to a configured event channel. Subscribers:
- The frontend’s “live execution view” via
/api/v1/executions/<id>/storage-streamSSE - The CLI’s
factflow storage watch --execution <id>stream - The webhook dispatcher (for subscribed URLs)
- Replay (to trigger when an expected object arrives late)
Events are best-effort. A missed event never causes data loss — the artefact is still in storage.
Performance notes
Section titled “Performance notes”- Filesystem is fast for writes, slow for
list(prefix)when prefixes have many objects. Fine for dev, acceptable for single-node prod. - MinIO / S3 has constant-time listing but higher per-request latency. Use for multi-node or when you need external durability guarantees.
- Neither is optimised for small-object random reads. Pipelines that need frequent random access should materialise into Postgres (for structured data) or a vector DB (for embeddings) rather than re-reading storage.
Related
Section titled “Related”- factflow-infra reference — provider implementations
- factflow-protocols reference — the contracts
- Storage guide — CLI commands and day-to-day ops
- Replay guide — how storage powers replay