factflow-markdown
Markdown processing pipeline — HTML → markdown conversion, token-aware segmentation, storage I/O for the segmented chunks.
Tier and role
Section titled “Tier and role”- Tier: workflow
- Import name:
factflow_markdown - Source:
backend/packages/workflows/factflow-markdown/
Users wire these adapters in pipeline YAML. Direct imports are rare — MarkdownSegmenter and TokenCounter are occasionally pulled into ad-hoc scripts for token budgeting.
Context
Section titled “Context”Five-adapter chain covering the “give me vector-ready text from an HTML page” path:
storage_retriever— reads an HTML payload from storage by lineage referencehtml_to_markdown— conversion via a GitHub-flavoured markdown parsersmart_segmenter— token-aware splits usingtiktokenso each segment fits an LLM context budgetsegment_publisher— fan-out: one queue message per segmentmarkdown_storage_writer— persist the canonical markdown + each segment
Public API
Section titled “Public API”Adapters (type: values in YAML)
Section titled “Adapters (type: values in YAML)”from factflow_markdown import ( StorageRetrieverAdapter, # type: storage_retriever HtmlToMarkdownAdapter, # type: html_to_markdown SmartSegmenterAdapter, # type: smart_segmenter SegmentPublisherAdapter, # type: segment_publisher MarkdownStorageWriterAdapter, # type: markdown_storage_writer SegmentStorageAdapter, # segment-specific storage helper register_markdown_adapters, # bulk-register all of the above)Config
Section titled “Config”from factflow_markdown import ( MarkdownStorageConfig, MergeStrategy, # enum for merging adjacent segments)Building blocks
Section titled “Building blocks”from factflow_markdown import ( MarkdownSegmenter, # standalone segmenter usable outside a pipeline TextSegment, TokenCounter, # tiktoken-backed token counter GitHubMarkdownParser, Node, NodeType,)Dependencies
Section titled “Dependencies”- Runtime: (pulls tiktoken transitively via factflow-engine) — no additional direct deps
- Workspace:
factflow-protocols,factflow-foundation,factflow-engine - External services: storage provider
Testing
Section titled “Testing”Tests at backend/packages/workflows/factflow-markdown/tests/. Token counting is deterministic — unit tests assert exact token counts for canonical inputs.
Related
Section titled “Related”factflow-webscraper— typical upstream producerfactflow-embeddings— typical downstream consumer of segments- Rule:
.claude/rules/adapter-conventions.md