Markdown workflow
The markdown workflow turns raw HTML into vector-ready text. Five adapters in a chain:
storage_retriever → html_to_markdown → smart_segmenter → segment_publisher → markdown_storage_writerTypical upstream: webscraper. Typical downstream: embeddings.
Minimal pipeline
Section titled “Minimal pipeline”version: "1.0"
routes: markdown_processor: inbound: queue: "/queue/markdown.input" subscription: "markdown-processors" concurrency: 10
adapters: - type: "storage_retriever" config: key_field: "storage_key"
- type: "html_to_markdown" config: preserve_images: false
- type: "smart_segmenter" config: max_tokens: 2048 tokenizer: "cl100k_base"
- type: "segment_publisher"
- type: "markdown_storage_writer"Adapters
Section titled “Adapters”| Type | Purpose |
|---|---|
storage_retriever | Read HTML (or other input) from storage by lineage reference |
html_to_markdown | GitHub-flavoured conversion, preserves structure |
smart_segmenter | Token-aware splitting using tiktoken; each segment fits an LLM context |
segment_publisher | Fan-out: one message per segment |
markdown_storage_writer | Persist canonical markdown + each segment |
Tokens and segmentation
Section titled “Tokens and segmentation”smart_segmenter respects max_tokens while keeping markdown structure intact (splits on section boundaries, not mid-sentence). Tokenizer choice:
cl100k_base— OpenAI GPT-4o200k_base— OpenAI GPT-4o / o-series- Others — see
factflow_markdown.token_counter.TokenCounter
Pick the tokenizer matching your downstream embedding model.
Merge strategy
Section titled “Merge strategy”If segments end up very small, MergeStrategy determines whether to merge adjacent ones back together. Configure per pipeline via the segmenter config.
Direct use of building blocks
Section titled “Direct use of building blocks”Outside a pipeline, import the segmenter or token counter for ad-hoc work:
from factflow_markdown import MarkdownSegmenter, TokenCounter
counter = TokenCounter(tokenizer="cl100k_base")print(counter.count("Some text."))
segmenter = MarkdownSegmenter(max_tokens=1000, tokenizer="cl100k_base")for segment in segmenter.segment(markdown_text): print(segment.text, segment.token_count)Related
Section titled “Related”- factflow-markdown reference
- Webscraper workflow — typical source of HTML
- Embeddings workflow — typical consumer of segments