Skip to content

factflow-webscraper

Web scraping pipeline adapters — sitemap parsing, URL expansion, HTTP fetching with adaptive rate limiting, and storage writing.

Users configure these adapters in pipeline YAML. Adapter authors sometimes import SimpleHttpClient or RateLimitStrategyRegistry when building new adapters that need the same HTTP + rate-limiting behaviour.

Canonical pipeline:

sitemap_parser → url_expander → web_scraper → web_content_storage

Each step is an adapter; queues pass messages between them.

The rate limiter is adaptive — each domain gets its own token bucket, tuned at runtime based on response timings. RateLimitStrategyRegistry lets downstream packages register alternative strategies (e.g. per-endpoint quotas) without modifying the core adapter.

Every symbol in __all__:

from factflow_webscraper import (
SitemapParserAdapter, # type: sitemap_parser
URLExpanderAdapter, # type: url_expander
WebScraperAdapter, # type: web_scraper
WebCrawlerAdapter, # type: web_crawler (crawl4ai-based)
WebContentStorageAdapter, # type: web_content_storage
)
from factflow_webscraper import (
WebCrawlerConfig,
WebContentStorageConfig,
WebScraperSettings,
)

Each adapter’s config: block in YAML validates against its config class (see .claude/skills/backend/adapter-schema/).

from factflow_webscraper import (
SimpleHttpClient, # httpx-based async client
SitemapParser, # standalone sitemap XML parser
PageData, # scraped-page model
RateLimitStrategyRegistry, # register alternative rate strategies
)
  • Runtime: httpx[http2], beautifulsoup4, crawl4ai>=0.8.0 (for the JS-rendered crawler adapter)
  • Workspace: factflow-protocols, factflow-foundation, factflow-engine
  • External services: storage provider (for WebContentStorageAdapter); crawl4ai needs Chromium available for the crawler adapter

Tests at backend/packages/workflows/factflow-webscraper/tests/. Uses the pipeline-testing skill patterns.

  • factflow-engine — adapter discovery picks these up at startup
  • factflow-markdown — typical downstream stage (HTML → markdown)
  • Rule: .claude/rules/adapter-conventions.md