Skip to content

Webcrawler workflow

For sites where web_scraper isn’t enough — client-side rendered pages, single-page applications, content behind JS-driven interactions — use the web_crawler adapter from the webscraper package. It drives Chromium via crawl4ai.

When to use

web_scraper returns nearly-empty HTML (SPA skeleton)
Content appears only after specific selectors load
The site blocks non-browser user agents

Minimal pipeline

version: "1.0"

routes:
  crawler:
    inbound:
      queue: "/queue/webcrawler.urls"
      subscription: "crawler-processors"
      concurrency: 2   # browser is heavy

    adapters:
      - type: "web_crawler"
        config:
          wait_for_selector: "article.content"
          timeout: 60
          viewport_width: 1280

      - type: "web_content_storage"

init_message:
  route: "crawler"
  payload:
    url: "https://example.com/spa-page"

Dependencies

crawl4ai requires:

Chromium — auto-installed in embedded mode; in prod, use the unclecode/crawl4ai base image
Disk space for browser caches (~500MB)
More memory than web_scraper (~500MB per concurrent browser)

Performance

One web_crawler instance ≈ one Chromium tab. Keep concurrency low (2–4 per CPU core). For high-volume scraping, consider whether web_scraper with careful retry logic would suffice.

Combine with web_scraper

A common pattern: try the cheap path first, fall back to the expensive one:

routes:
  cheap_scraper:
    adapters:
      - type: "web_scraper"
      - type: "content_size_filter"     # drop suspiciously-empty results
      - type: "web_content_storage"

  heavy_scraper:
    adapters:
      - type: "web_crawler"
      - type: "web_content_storage"

Route URLs that fell through the size filter into the heavy route (via a condition or explicit fan-out).

Webscraper workflow — the lightweight path
factflow-webscraper reference