factflow-boost

Boost.AI chatbot export processing — ingestion, deduplication, clustering, and cataloguing of Boost conversation data for downstream knowledge consolidation.

Tier and role

Tier: workflow
Import name: factflow_boost
Source: backend/packages/workflows/factflow-boost/

Runs as a batch pipeline against a Boost export folder. Not typically composed with other workflows — it’s a closed pipeline producing a structured catalogue.

Context

Boost.AI exports are folder trees named by chatbot and date. extract_chatbot_origin() normalises every naming convention to a canonical boost:{NAME} origin tag. The CHATBOT_ALIASES dict maps legacy names (e.g. DNBSERVICEDESK → FIX) to current conventions.

The pipeline runs through:

enumerator → filter → norwegian_filter → deduplicate → clustering → catalog → storage_writer → renderer

Each stage is a discrete adapter under factflow_boost.boost_processor. Post-processing routines (factflow_boost.boost_routines) render the catalogue to shareable formats.

Rationale

Closed pipeline. Boost ingestion is complete — it doesn’t feed other workflow packages today. That makes it a good sandbox for new adapter patterns before promoting them elsewhere.
Fuzzy deduplication. rapidfuzz + datasketch MinHash identify near-duplicate conversations; clustering groups them for downstream triage.
Language-specific filtering. norwegian_filter uses a language detector to scope processing to Norwegian content (the primary Boost use case at DNB).

Public API

Top-level __init__.py exports:

from factflow_boost import extract_chatbot_origin, CHATBOT_ALIASES

Adapters and internal processors are not re-exported at the package root; they’re registered via the engine’s discovery mechanism from their subpackage (factflow_boost.boost_processor).

Subpackages

factflow_boost.boost_processor — the pipeline adapters (batch_processor, catalog, clustering, deduplicate, enumerator, filter, norwegian_filter, storage_writer, filtered_collector, stored_keys_fanout, …)
factflow_boost.boost_routines — post-pipeline rendering (renderer, parser, models)

Dependencies

Runtime: datasketch (MinHash), rapidfuzz (string similarity), scikit-learn + scipy (clustering)
Workspace: factflow-protocols, factflow-foundation, factflow-engine, factflow-llm
External services: storage provider (for the export folder + written catalogue)

Testing

Tests at backend/packages/workflows/factflow-boost/tests/. The test-cli harness includes a Boost scenario (s3) — scripts/test-cli/run.sh s3.

factflow-knowledge — downstream consumer of the cleaned catalogue
factflow-markdown — sometimes used to segment long conversations