factflow-markdown

Markdown processing pipeline — HTML → markdown conversion, token-aware segmentation, storage I/O for the segmented chunks.

Tier and role

Tier: workflow
Import name: factflow_markdown
Source: backend/packages/workflows/factflow-markdown/

Users wire these adapters in pipeline YAML. Direct imports are rare — MarkdownSegmenter and TokenCounter are occasionally pulled into ad-hoc scripts for token budgeting.

Context

Five-adapter chain covering the “give me vector-ready text from an HTML page” path:

storage_retriever — reads an HTML payload from storage by lineage reference
html_to_markdown — conversion via a GitHub-flavoured markdown parser
smart_segmenter — token-aware splits using tiktoken so each segment fits an LLM context budget
segment_publisher — fan-out: one queue message per segment
markdown_storage_writer — persist the canonical markdown + each segment

Public API

Adapters (`type:` values in YAML)

from factflow_markdown import (
    StorageRetrieverAdapter,              # type: storage_retriever
    HtmlToMarkdownAdapter,                 # type: html_to_markdown
    SmartSegmenterAdapter,                 # type: smart_segmenter
    SegmentPublisherAdapter,               # type: segment_publisher
    MarkdownStorageWriterAdapter,          # type: markdown_storage_writer
    SegmentStorageAdapter,                 # segment-specific storage helper
    register_markdown_adapters,            # bulk-register all of the above
)

Config

from factflow_markdown import (
    MarkdownStorageConfig,
    MergeStrategy,                         # enum for merging adjacent segments
)

Building blocks

from factflow_markdown import (
    MarkdownSegmenter,          # standalone segmenter usable outside a pipeline
    TextSegment,
    TokenCounter,                # tiktoken-backed token counter
    GitHubMarkdownParser,
    Node,
    NodeType,
)

Dependencies

Runtime: (pulls tiktoken transitively via factflow-engine) — no additional direct deps
Workspace: factflow-protocols, factflow-foundation, factflow-engine
External services: storage provider

Testing

Tests at backend/packages/workflows/factflow-markdown/tests/. Token counting is deterministic — unit tests assert exact token counts for canonical inputs.

factflow-webscraper — typical upstream producer
factflow-embeddings — typical downstream consumer of segments
Rule: .claude/rules/adapter-conventions.md