Overview

Factflow is a YAML-configured pipeline orchestration platform for AI data processing. You declare pipelines as YAML, the server turns them into running Python adapter chains, and messages flow between stages through a message broker while the lineage service records every hop.

This page is the big picture. Each subsystem has its own dedicated Guide with the “how it works” and “how to operate it” in one place.

The mental model

Three moving parts, one invariant:

flowchart LR
Y[YAML config] --> O[Orchestrator]
O --> R1[Route processor]
O --> R2[Route processor]
R1 -->|messages on queue| R2
R1 --> S[(Storage)]
R1 --> L[(Lineage)]
R2 --> S
R2 --> L

Pipeline = YAML + orchestrator + routes wired through a queue, with every step recorded to storage and lineage

Declarative. A pipeline is a YAML file. Operators iterate on configs without a deploy.
Queue-scoped per execution. Multiple executions share one broker without cross-talk.
Commit-independent lineage. Lineage records every message without ever blocking the main flow.

Four things Factflow is not, to set expectations:

Not a generic task queue (Celery, RQ) — pipelines are graphs, not one-off tasks.
Not a workflow engine like Airflow — the unit of work is a message through an adapter, not a DAG tick.
Not a pile of LLM scripts — LLM calls are abstracted, rate-limited, and routed by config.
Not schedule-driven — executions are reactive, not polled.

Package tiers

Factflow is a uv workspace of 16 Python packages organised into three tiers. Packages import only from the same tier or below.

flowchart TB
subgraph Workflows["WORKFLOWS (8) — pipeline adapters"]
  WS[factflow-webscraper]
  MD[factflow-markdown]
  EM[factflow-embeddings]
  BO[factflow-boost]
  TR[factflow-translator]
  KN[factflow-knowledge]
  SP[factflow-sharepoint]
  RP[factflow-replay]
end
subgraph Shared["SHARED SERVICES (7)"]
  SRV[factflow-server]
  EX[factflow-execution]
  EN[factflow-engine]
  INF[factflow-infra]
  LLM[factflow-llm]
  LIN[factflow-lineage]
  FN[factflow-foundation]
end
subgraph Core["CORE PROTOCOLS (1)"]
  PR[factflow-protocols]
end
Workflows --> Shared
Shared --> Core
style Core fill:#e6f2ff,stroke:#2563eb,color:#111
style Shared fill:#f0fdf4,stroke:#16a34a,color:#111
style Workflows fill:#fef3c7,stroke:#d97706,color:#111

Three tiers. Workflows depend on shared services, which depend on the core protocol package. No reverse imports, no cross-workflow imports.

Core protocols (1)

factflow-protocols — zero factflow workspace dependencies. Defines abstract contracts: PipelineAdapter, QueueProtocol, StorageProtocol, LLMClientProtocol, EmbeddingClientProtocol, plus supporting types. Every other package implements one or more of these.

Shared services (7)

Concrete services that implement the protocols and run the system:

factflow-foundation — config framework, paths, observability
factflow-lineage — lineage service and repository
factflow-llm — LLM + embedding client factory with AIMD rate limiting
factflow-infra — queue (Artemis / RabbitMQ / Pulsar) + storage (filesystem / MinIO) providers
factflow-engine — pipeline orchestration core
factflow-execution — execution lifecycle + ExecutionScopedQueue
factflow-server — FastAPI app, 91-endpoint HTTP surface, chat, webhooks

Workflow packages (8)

Each is a self-contained, domain-specific feature package with its own adapters. Workflows never import each other — they compose via queues.

factflow-webscraper — web scraping + adaptive rate strategies (also hosts the web_crawler adapter for JS rendering)
factflow-markdown — HTML → markdown → segments, token-aware
factflow-embeddings — multi-model embedding with pgvector
factflow-boost — Boost.AI chatbot export ingestion
factflow-translator — LLM-based translation with markdown preservation
factflow-knowledge — concept detection, consolidation, knowledge diff
factflow-sharepoint — Graph-based ingest + document conversion
factflow-replay — storage replay + recovery workflows

Each workflow has its own Guides section (e.g. Workflow: Webscraper) covering the adapters it provides and how they compose.

Execution at a glance

What actually happens when you run factflow config run <id>:

sequenceDiagram
actor Op as Operator
participant CLI as factflow CLI
participant API as factflow-server
participant Mgr as OrchestratorManager
participant Orch as PipelineOrchestrator
participant P as ReactiveRouteProcessor
participant Q as QueueProtocol (ExecScoped)
participant A as PipelineAdapter
participant Store as StorageProtocol
participant Lin as LineageService
Op->>CLI: factflow config run ID
CLI->>API: POST /executions
API->>Mgr: start(execution)
Mgr->>Orch: new PipelineOrchestrator
Orch->>P: start route processors
P->>Q: subscribe exec:EXEC:route.in
Orch->>Q: publish init_message
Q->>P: deliver message
P->>A: await adapter.process(ctx)
A->>Store: write artefact
A->>Lin: record lineage row
A-->>P: AdapterResult
P->>Q: publish to next route
P-->>Q: ACK original message
Orch->>API: completion signal
API-->>CLI: SSE status=completed

One execution end-to-end. Dive deeper in Architecture → Execution flow.

Every arrow above lives under a deeper page:

Queue details (scoping, providers, ack, payloads) → Guides / Queue Protocol
Orchestrator internals (lifecycle, backpressure, circuit breaker) → Guides / Pipeline Orchestrator Engine
Storage layout and replay → Guides / Storage Protocol
Lineage commit-independence → Guides / Lineage Tracking

Design decisions worth knowing

The non-obvious calls that shape day-to-day use:

Flat package namespace. Every package imports as factflow_<name> (no dotted factflow.X). Chosen for IDE tooling.
Protocol / Provider / Adapter naming. Abstract contract / infrastructure implementation / domain adapter. Three-way split keeps each package’s blast radius small.
ExecutionScopedQueue over per-execution brokers. One broker, isolated namespaces — operational simplicity at zero correctness cost.
Config snapshot is immutable per execution. In-flight executions are unaffected by config edits. Replay resolves routes from the parent execution’s snapshot, not the global directory.
Handler-return ack. Adapters return AdapterResult or raise; queue ack happens only after return. No fire-and-forget.
Commit-independent lineage. Lineage writes decouple from pipeline writes. Either can fail without affecting the other.
Reactive, not polled. No central scheduler. Processors consume-and-ack. Natural backpressure.

Where to go next

Package map — the dependency graph in detail
Execution flow — detailed sequence of a running pipeline
Message lifecycle — one message end-to-end through a route
Multi-execution topology — how isolation actually works
Extension points — what you can plug in
Deployment topology — what runs where in prod