Overview
Factflow is a YAML-configured pipeline orchestration platform for AI data processing. You declare pipelines as YAML, the server turns them into running Python adapter chains, and messages flow between stages through a message broker while the lineage service records every hop.
This page is the big picture. Each subsystem has its own dedicated Guide with the “how it works” and “how to operate it” in one place.
The mental model
Section titled “The mental model”Three moving parts, one invariant:
flowchart LR Y[YAML config] --> O[Orchestrator] O --> R1[Route processor] O --> R2[Route processor] R1 -->|messages on queue| R2 R1 --> S[(Storage)] R1 --> L[(Lineage)] R2 --> S R2 --> L
- Declarative. A pipeline is a YAML file. Operators iterate on configs without a deploy.
- Queue-scoped per execution. Multiple executions share one broker without cross-talk.
- Commit-independent lineage. Lineage records every message without ever blocking the main flow.
Four things Factflow is not, to set expectations:
- Not a generic task queue (Celery, RQ) — pipelines are graphs, not one-off tasks.
- Not a workflow engine like Airflow — the unit of work is a message through an adapter, not a DAG tick.
- Not a pile of LLM scripts — LLM calls are abstracted, rate-limited, and routed by config.
- Not schedule-driven — executions are reactive, not polled.
Package tiers
Section titled “Package tiers”Factflow is a uv workspace of 16 Python packages organised into three tiers. Packages import only from the same tier or below.
flowchart TB subgraph Workflows["WORKFLOWS (8) — pipeline adapters"] WS[factflow-webscraper] MD[factflow-markdown] EM[factflow-embeddings] BO[factflow-boost] TR[factflow-translator] KN[factflow-knowledge] SP[factflow-sharepoint] RP[factflow-replay] end subgraph Shared["SHARED SERVICES (7)"] SRV[factflow-server] EX[factflow-execution] EN[factflow-engine] INF[factflow-infra] LLM[factflow-llm] LIN[factflow-lineage] FN[factflow-foundation] end subgraph Core["CORE PROTOCOLS (1)"] PR[factflow-protocols] end Workflows --> Shared Shared --> Core style Core fill:#e6f2ff,stroke:#2563eb,color:#111 style Shared fill:#f0fdf4,stroke:#16a34a,color:#111 style Workflows fill:#fef3c7,stroke:#d97706,color:#111
Core protocols (1)
Section titled “Core protocols (1)”factflow-protocols — zero factflow workspace dependencies. Defines abstract contracts: PipelineAdapter, QueueProtocol, StorageProtocol, LLMClientProtocol, EmbeddingClientProtocol, plus supporting types. Every other package implements one or more of these.
Shared services (7)
Section titled “Shared services (7)”Concrete services that implement the protocols and run the system:
factflow-foundation— config framework, paths, observabilityfactflow-lineage— lineage service and repositoryfactflow-llm— LLM + embedding client factory with AIMD rate limitingfactflow-infra— queue (Artemis / RabbitMQ / Pulsar) + storage (filesystem / MinIO) providersfactflow-engine— pipeline orchestration corefactflow-execution— execution lifecycle +ExecutionScopedQueuefactflow-server— FastAPI app, 91-endpoint HTTP surface, chat, webhooks
Workflow packages (8)
Section titled “Workflow packages (8)”Each is a self-contained, domain-specific feature package with its own adapters. Workflows never import each other — they compose via queues.
factflow-webscraper— web scraping + adaptive rate strategies (also hosts theweb_crawleradapter for JS rendering)factflow-markdown— HTML → markdown → segments, token-awarefactflow-embeddings— multi-model embedding with pgvectorfactflow-boost— Boost.AI chatbot export ingestionfactflow-translator— LLM-based translation with markdown preservationfactflow-knowledge— concept detection, consolidation, knowledge difffactflow-sharepoint— Graph-based ingest + document conversionfactflow-replay— storage replay + recovery workflows
Each workflow has its own Guides section (e.g. Workflow: Webscraper) covering the adapters it provides and how they compose.
Execution at a glance
Section titled “Execution at a glance”What actually happens when you run factflow config run <id>:
sequenceDiagram actor Op as Operator participant CLI as factflow CLI participant API as factflow-server participant Mgr as OrchestratorManager participant Orch as PipelineOrchestrator participant P as ReactiveRouteProcessor participant Q as QueueProtocol (ExecScoped) participant A as PipelineAdapter participant Store as StorageProtocol participant Lin as LineageService Op->>CLI: factflow config run ID CLI->>API: POST /executions API->>Mgr: start(execution) Mgr->>Orch: new PipelineOrchestrator Orch->>P: start route processors P->>Q: subscribe exec:EXEC:route.in Orch->>Q: publish init_message Q->>P: deliver message P->>A: await adapter.process(ctx) A->>Store: write artefact A->>Lin: record lineage row A-->>P: AdapterResult P->>Q: publish to next route P-->>Q: ACK original message Orch->>API: completion signal API-->>CLI: SSE status=completed
Every arrow above lives under a deeper page:
- Queue details (scoping, providers, ack, payloads) → Guides / Queue Protocol
- Orchestrator internals (lifecycle, backpressure, circuit breaker) → Guides / Pipeline Orchestrator Engine
- Storage layout and replay → Guides / Storage Protocol
- Lineage commit-independence → Guides / Lineage Tracking
Design decisions worth knowing
Section titled “Design decisions worth knowing”The non-obvious calls that shape day-to-day use:
- Flat package namespace. Every package imports as
factflow_<name>(no dottedfactflow.X). Chosen for IDE tooling. - Protocol / Provider / Adapter naming. Abstract contract / infrastructure implementation / domain adapter. Three-way split keeps each package’s blast radius small.
ExecutionScopedQueueover per-execution brokers. One broker, isolated namespaces — operational simplicity at zero correctness cost.- Config snapshot is immutable per execution. In-flight executions are unaffected by config edits. Replay resolves routes from the parent execution’s snapshot, not the global directory.
- Handler-return ack. Adapters return
AdapterResultor raise; queue ack happens only after return. No fire-and-forget. - Commit-independent lineage. Lineage writes decouple from pipeline writes. Either can fail without affecting the other.
- Reactive, not polled. No central scheduler. Processors consume-and-ack. Natural backpressure.
Where to go next
Section titled “Where to go next”- Package map — the dependency graph in detail
- Execution flow — detailed sequence of a running pipeline
- Message lifecycle — one message end-to-end through a route
- Multi-execution topology — how isolation actually works
- Extension points — what you can plug in
- Deployment topology — what runs where in prod