factflow-boost
Boost.AI chatbot export processing — ingestion, deduplication, clustering, and cataloguing of Boost conversation data for downstream knowledge consolidation.
Tier and role
Section titled “Tier and role”- Tier: workflow
- Import name:
factflow_boost - Source:
backend/packages/workflows/factflow-boost/
Runs as a batch pipeline against a Boost export folder. Not typically composed with other workflows — it’s a closed pipeline producing a structured catalogue.
Context
Section titled “Context”Boost.AI exports are folder trees named by chatbot and date. extract_chatbot_origin() normalises every naming convention to a canonical boost:{NAME} origin tag. The CHATBOT_ALIASES dict maps legacy names (e.g. DNBSERVICEDESK → FIX) to current conventions.
The pipeline runs through:
enumerator → filter → norwegian_filter → deduplicate → clustering → catalog → storage_writer → renderer
Each stage is a discrete adapter under factflow_boost.boost_processor. Post-processing routines (factflow_boost.boost_routines) render the catalogue to shareable formats.
Rationale
Section titled “Rationale”- Closed pipeline. Boost ingestion is complete — it doesn’t feed other workflow packages today. That makes it a good sandbox for new adapter patterns before promoting them elsewhere.
- Fuzzy deduplication.
rapidfuzz+datasketchMinHash identify near-duplicate conversations; clustering groups them for downstream triage. - Language-specific filtering.
norwegian_filteruses a language detector to scope processing to Norwegian content (the primary Boost use case at DNB).
Public API
Section titled “Public API”Top-level __init__.py exports:
from factflow_boost import extract_chatbot_origin, CHATBOT_ALIASESAdapters and internal processors are not re-exported at the package root; they’re registered via the engine’s discovery mechanism from their subpackage (factflow_boost.boost_processor).
Subpackages
Section titled “Subpackages”factflow_boost.boost_processor— the pipeline adapters (batch_processor, catalog, clustering, deduplicate, enumerator, filter, norwegian_filter, storage_writer, filtered_collector, stored_keys_fanout, …)factflow_boost.boost_routines— post-pipeline rendering (renderer, parser, models)
Dependencies
Section titled “Dependencies”- Runtime:
datasketch(MinHash),rapidfuzz(string similarity),scikit-learn+scipy(clustering) - Workspace:
factflow-protocols,factflow-foundation,factflow-engine,factflow-llm - External services: storage provider (for the export folder + written catalogue)
Testing
Section titled “Testing”Tests at backend/packages/workflows/factflow-boost/tests/. The test-cli harness includes a Boost scenario (s3) — scripts/test-cli/run.sh s3.
Related
Section titled “Related”factflow-knowledge— downstream consumer of the cleaned cataloguefactflow-markdown— sometimes used to segment long conversations