Skip to content

factflow-markdown

Markdown processing pipeline — HTML → markdown conversion, token-aware segmentation, storage I/O for the segmented chunks.

Users wire these adapters in pipeline YAML. Direct imports are rare — MarkdownSegmenter and TokenCounter are occasionally pulled into ad-hoc scripts for token budgeting.

Five-adapter chain covering the “give me vector-ready text from an HTML page” path:

  1. storage_retriever — reads an HTML payload from storage by lineage reference
  2. html_to_markdown — conversion via a GitHub-flavoured markdown parser
  3. smart_segmenter — token-aware splits using tiktoken so each segment fits an LLM context budget
  4. segment_publisher — fan-out: one queue message per segment
  5. markdown_storage_writer — persist the canonical markdown + each segment
from factflow_markdown import (
StorageRetrieverAdapter, # type: storage_retriever
HtmlToMarkdownAdapter, # type: html_to_markdown
SmartSegmenterAdapter, # type: smart_segmenter
SegmentPublisherAdapter, # type: segment_publisher
MarkdownStorageWriterAdapter, # type: markdown_storage_writer
SegmentStorageAdapter, # segment-specific storage helper
register_markdown_adapters, # bulk-register all of the above
)
from factflow_markdown import (
MarkdownStorageConfig,
MergeStrategy, # enum for merging adjacent segments
)
from factflow_markdown import (
MarkdownSegmenter, # standalone segmenter usable outside a pipeline
TextSegment,
TokenCounter, # tiktoken-backed token counter
GitHubMarkdownParser,
Node,
NodeType,
)
  • Runtime: (pulls tiktoken transitively via factflow-engine) — no additional direct deps
  • Workspace: factflow-protocols, factflow-foundation, factflow-engine
  • External services: storage provider

Tests at backend/packages/workflows/factflow-markdown/tests/. Token counting is deterministic — unit tests assert exact token counts for canonical inputs.