factflow-webscraper
Web scraping pipeline adapters — sitemap parsing, URL expansion, HTTP fetching with adaptive rate limiting, and storage writing.
Tier and role
Section titled “Tier and role”- Tier: workflow
- Import name:
factflow_webscraper - Source:
backend/packages/workflows/factflow-webscraper/
Users configure these adapters in pipeline YAML. Adapter authors sometimes import SimpleHttpClient or RateLimitStrategyRegistry when building new adapters that need the same HTTP + rate-limiting behaviour.
Context
Section titled “Context”Canonical pipeline:
sitemap_parser → url_expander → web_scraper → web_content_storage
Each step is an adapter; queues pass messages between them.
The rate limiter is adaptive — each domain gets its own token bucket, tuned at runtime based on response timings. RateLimitStrategyRegistry lets downstream packages register alternative strategies (e.g. per-endpoint quotas) without modifying the core adapter.
Public API
Section titled “Public API”Every symbol in __all__:
Adapters (used as type: values in YAML)
Section titled “Adapters (used as type: values in YAML)”from factflow_webscraper import ( SitemapParserAdapter, # type: sitemap_parser URLExpanderAdapter, # type: url_expander WebScraperAdapter, # type: web_scraper WebCrawlerAdapter, # type: web_crawler (crawl4ai-based) WebContentStorageAdapter, # type: web_content_storage)Config classes
Section titled “Config classes”from factflow_webscraper import ( WebCrawlerConfig, WebContentStorageConfig, WebScraperSettings,)Each adapter’s config: block in YAML validates against its config class (see .claude/skills/backend/adapter-schema/).
Building blocks (for adapter authors)
Section titled “Building blocks (for adapter authors)”from factflow_webscraper import ( SimpleHttpClient, # httpx-based async client SitemapParser, # standalone sitemap XML parser PageData, # scraped-page model RateLimitStrategyRegistry, # register alternative rate strategies)Dependencies
Section titled “Dependencies”- Runtime:
httpx[http2],beautifulsoup4,crawl4ai>=0.8.0(for the JS-rendered crawler adapter) - Workspace:
factflow-protocols,factflow-foundation,factflow-engine - External services: storage provider (for
WebContentStorageAdapter);crawl4aineeds Chromium available for the crawler adapter
Testing
Section titled “Testing”Tests at backend/packages/workflows/factflow-webscraper/tests/. Uses the pipeline-testing skill patterns.
Related
Section titled “Related”factflow-engine— adapter discovery picks these up at startupfactflow-markdown— typical downstream stage (HTML → markdown)- Rule:
.claude/rules/adapter-conventions.md