Skip to content

Markdown workflow

The markdown workflow turns raw HTML into vector-ready text. Five adapters in a chain:

storage_retriever → html_to_markdown → smart_segmenter → segment_publisher → markdown_storage_writer

Typical upstream: webscraper. Typical downstream: embeddings.

version: "1.0"
routes:
markdown_processor:
inbound:
queue: "/queue/markdown.input"
subscription: "markdown-processors"
concurrency: 10
adapters:
- type: "storage_retriever"
config:
key_field: "storage_key"
- type: "html_to_markdown"
config:
preserve_images: false
- type: "smart_segmenter"
config:
max_tokens: 2048
tokenizer: "cl100k_base"
- type: "segment_publisher"
- type: "markdown_storage_writer"
TypePurpose
storage_retrieverRead HTML (or other input) from storage by lineage reference
html_to_markdownGitHub-flavoured conversion, preserves structure
smart_segmenterToken-aware splitting using tiktoken; each segment fits an LLM context
segment_publisherFan-out: one message per segment
markdown_storage_writerPersist canonical markdown + each segment

smart_segmenter respects max_tokens while keeping markdown structure intact (splits on section boundaries, not mid-sentence). Tokenizer choice:

  • cl100k_base — OpenAI GPT-4
  • o200k_base — OpenAI GPT-4o / o-series
  • Others — see factflow_markdown.token_counter.TokenCounter

Pick the tokenizer matching your downstream embedding model.

If segments end up very small, MergeStrategy determines whether to merge adjacent ones back together. Configure per pipeline via the segmenter config.

Outside a pipeline, import the segmenter or token counter for ad-hoc work:

from factflow_markdown import MarkdownSegmenter, TokenCounter
counter = TokenCounter(tokenizer="cl100k_base")
print(counter.count("Some text."))
segmenter = MarkdownSegmenter(max_tokens=1000, tokenizer="cl100k_base")
for segment in segmenter.segment(markdown_text):
print(segment.text, segment.token_count)