Skip to content

Webcrawler workflow

For sites where web_scraper isn’t enough — client-side rendered pages, single-page applications, content behind JS-driven interactions — use the web_crawler adapter from the webscraper package. It drives Chromium via crawl4ai.

  • web_scraper returns nearly-empty HTML (SPA skeleton)
  • Content appears only after specific selectors load
  • The site blocks non-browser user agents
version: "1.0"
routes:
crawler:
inbound:
queue: "/queue/webcrawler.urls"
subscription: "crawler-processors"
concurrency: 2 # browser is heavy
adapters:
- type: "web_crawler"
config:
wait_for_selector: "article.content"
timeout: 60
viewport_width: 1280
- type: "web_content_storage"
init_message:
route: "crawler"
payload:
url: "https://example.com/spa-page"

crawl4ai requires:

  • Chromium — auto-installed in embedded mode; in prod, use the unclecode/crawl4ai base image
  • Disk space for browser caches (~500MB)
  • More memory than web_scraper (~500MB per concurrent browser)

One web_crawler instance ≈ one Chromium tab. Keep concurrency low (2–4 per CPU core). For high-volume scraping, consider whether web_scraper with careful retry logic would suffice.

A common pattern: try the cheap path first, fall back to the expensive one:

routes:
cheap_scraper:
adapters:
- type: "web_scraper"
- type: "content_size_filter" # drop suspiciously-empty results
- type: "web_content_storage"
heavy_scraper:
adapters:
- type: "web_crawler"
- type: "web_content_storage"

Route URLs that fell through the size filter into the heavy route (via a condition or explicit fan-out).