Webscraper workflow

The webscraper workflow ingests HTML content from websites that publish a sitemap. Canonical chain:

sitemap_parser → url_expander → web_scraper → web_content_storage

Rate limiting is adaptive per domain — no hardcoded RPS.

Minimal pipeline

version: "1.0"

routes:
  sitemap_scraper:
    name: "Sitemap Web Scraper"
    inbound:
      queue: "/queue/webscraper.sitemap"
      subscription: "sitemap-processors"
      concurrency: 5
      prefetch: 10

    adapters:
      - type: "sitemap_parser"
        config:
          max_urls: 1000

      - type: "url_expander"

      - type: "web_scraper"
        config:
          follow_redirects: true
          timeout: 30

      - type: "web_content_storage"

init_message:
  route: "sitemap_scraper"
  payload:
    sitemap_url: "https://example.com/sitemap.xml"

Adapters

Type	Purpose
`sitemap_parser`	Fetch a sitemap, extract URLs; supports sitemap indexes
`url_expander`	Fan out: one incoming URL list → one outgoing message per URL
`web_scraper`	Fetch one URL, return HTML + metadata
`web_crawler`	Fetch with JS rendering via crawl4ai (requires Chromium)
`web_content_storage`	Persist HTML + metadata to storage

See the adapter catalog for each adapter’s full config shape.

Rate limiting

Default: per-domain token bucket that probes capacity via AIMD (see the rate limiting guide). No config needed unless you need per-URL quotas — then register a custom strategy via RateLimitStrategyRegistry.

JavaScript-rendered pages

Some sites require JS execution to produce useful content. Use web_crawler instead of web_scraper:

- type: "web_crawler"
  config:
    wait_for_selector: "article.content"
    timeout: 60

Requires Chromium. In embedded mode, Chromium auto-installs; in AWS deployments, use the unclecode/crawl4ai image.

Typical downstream

Webscraper output feeds the markdown workflow: HTML → markdown → segments → embeddings.