Skip to content

Webscraper workflow

The webscraper workflow ingests HTML content from websites that publish a sitemap. Canonical chain:

sitemap_parser → url_expander → web_scraper → web_content_storage

Rate limiting is adaptive per domain — no hardcoded RPS.

version: "1.0"
routes:
sitemap_scraper:
name: "Sitemap Web Scraper"
inbound:
queue: "/queue/webscraper.sitemap"
subscription: "sitemap-processors"
concurrency: 5
prefetch: 10
adapters:
- type: "sitemap_parser"
config:
max_urls: 1000
- type: "url_expander"
- type: "web_scraper"
config:
follow_redirects: true
timeout: 30
- type: "web_content_storage"
init_message:
route: "sitemap_scraper"
payload:
sitemap_url: "https://example.com/sitemap.xml"
TypePurpose
sitemap_parserFetch a sitemap, extract URLs; supports sitemap indexes
url_expanderFan out: one incoming URL list → one outgoing message per URL
web_scraperFetch one URL, return HTML + metadata
web_crawlerFetch with JS rendering via crawl4ai (requires Chromium)
web_content_storagePersist HTML + metadata to storage

See the adapter catalog for each adapter’s full config shape.

Default: per-domain token bucket that probes capacity via AIMD (see the rate limiting guide). No config needed unless you need per-URL quotas — then register a custom strategy via RateLimitStrategyRegistry.

Some sites require JS execution to produce useful content. Use web_crawler instead of web_scraper:

- type: "web_crawler"
config:
wait_for_selector: "article.content"
timeout: 60

Requires Chromium. In embedded mode, Chromium auto-installs; in AWS deployments, use the unclecode/crawl4ai image.

Webscraper output feeds the markdown workflow: HTML → markdown → segments → embeddings.