factflow-webscraper

Web scraping pipeline adapters — sitemap parsing, URL expansion, HTTP fetching with adaptive rate limiting, and storage writing.

Tier and role

Tier: workflow
Import name: factflow_webscraper
Source: backend/packages/workflows/factflow-webscraper/

Users configure these adapters in pipeline YAML. Adapter authors sometimes import SimpleHttpClient or RateLimitStrategyRegistry when building new adapters that need the same HTTP + rate-limiting behaviour.

Context

Canonical pipeline:

sitemap_parser → url_expander → web_scraper → web_content_storage

Each step is an adapter; queues pass messages between them.

The rate limiter is adaptive — each domain gets its own token bucket, tuned at runtime based on response timings. RateLimitStrategyRegistry lets downstream packages register alternative strategies (e.g. per-endpoint quotas) without modifying the core adapter.

Public API

Every symbol in __all__:

Adapters (used as `type:` values in YAML)

from factflow_webscraper import (
    SitemapParserAdapter,           # type: sitemap_parser
    URLExpanderAdapter,             # type: url_expander
    WebScraperAdapter,              # type: web_scraper
    WebCrawlerAdapter,              # type: web_crawler (crawl4ai-based)
    WebContentStorageAdapter,       # type: web_content_storage
)

Config classes

from factflow_webscraper import (
    WebCrawlerConfig,
    WebContentStorageConfig,
    WebScraperSettings,
)

Each adapter’s config: block in YAML validates against its config class (see .claude/skills/backend/adapter-schema/).

Building blocks (for adapter authors)

from factflow_webscraper import (
    SimpleHttpClient,              # httpx-based async client
    SitemapParser,                 # standalone sitemap XML parser
    PageData,                      # scraped-page model
    RateLimitStrategyRegistry,     # register alternative rate strategies
)

Dependencies

Runtime: httpx[http2], beautifulsoup4, crawl4ai>=0.8.0 (for the JS-rendered crawler adapter)
Workspace: factflow-protocols, factflow-foundation, factflow-engine
External services: storage provider (for WebContentStorageAdapter); crawl4ai needs Chromium available for the crawler adapter

Testing

Tests at backend/packages/workflows/factflow-webscraper/tests/. Uses the pipeline-testing skill patterns.

factflow-engine — adapter discovery picks these up at startup
factflow-markdown — typical downstream stage (HTML → markdown)
Rule: .claude/rules/adapter-conventions.md