Webscraper workflow
The webscraper workflow ingests HTML content from websites that publish a sitemap. Canonical chain:
sitemap_parser → url_expander → web_scraper → web_content_storageRate limiting is adaptive per domain — no hardcoded RPS.
Minimal pipeline
Section titled “Minimal pipeline”version: "1.0"
routes: sitemap_scraper: name: "Sitemap Web Scraper" inbound: queue: "/queue/webscraper.sitemap" subscription: "sitemap-processors" concurrency: 5 prefetch: 10
adapters: - type: "sitemap_parser" config: max_urls: 1000
- type: "url_expander"
- type: "web_scraper" config: follow_redirects: true timeout: 30
- type: "web_content_storage"
init_message: route: "sitemap_scraper" payload: sitemap_url: "https://example.com/sitemap.xml"Adapters
Section titled “Adapters”| Type | Purpose |
|---|---|
sitemap_parser | Fetch a sitemap, extract URLs; supports sitemap indexes |
url_expander | Fan out: one incoming URL list → one outgoing message per URL |
web_scraper | Fetch one URL, return HTML + metadata |
web_crawler | Fetch with JS rendering via crawl4ai (requires Chromium) |
web_content_storage | Persist HTML + metadata to storage |
See the adapter catalog for each adapter’s full config shape.
Rate limiting
Section titled “Rate limiting”Default: per-domain token bucket that probes capacity via AIMD (see the rate limiting guide). No config needed unless you need per-URL quotas — then register a custom strategy via RateLimitStrategyRegistry.
JavaScript-rendered pages
Section titled “JavaScript-rendered pages”Some sites require JS execution to produce useful content. Use web_crawler instead of web_scraper:
- type: "web_crawler" config: wait_for_selector: "article.content" timeout: 60Requires Chromium. In embedded mode, Chromium auto-installs; in AWS deployments, use the unclecode/crawl4ai image.
Typical downstream
Section titled “Typical downstream”Webscraper output feeds the markdown workflow: HTML → markdown → segments → embeddings.