Webcrawler workflow
For sites where web_scraper isn’t enough — client-side rendered pages, single-page applications, content behind JS-driven interactions — use the web_crawler adapter from the webscraper package. It drives Chromium via crawl4ai.
When to use
Section titled “When to use”web_scraperreturns nearly-empty HTML (SPA skeleton)- Content appears only after specific selectors load
- The site blocks non-browser user agents
Minimal pipeline
Section titled “Minimal pipeline”version: "1.0"
routes: crawler: inbound: queue: "/queue/webcrawler.urls" subscription: "crawler-processors" concurrency: 2 # browser is heavy
adapters: - type: "web_crawler" config: wait_for_selector: "article.content" timeout: 60 viewport_width: 1280
- type: "web_content_storage"
init_message: route: "crawler" payload: url: "https://example.com/spa-page"Dependencies
Section titled “Dependencies”crawl4ai requires:
- Chromium — auto-installed in embedded mode; in prod, use the
unclecode/crawl4aibase image - Disk space for browser caches (~500MB)
- More memory than
web_scraper(~500MB per concurrent browser)
Performance
Section titled “Performance”One web_crawler instance ≈ one Chromium tab. Keep concurrency low (2–4 per CPU core). For high-volume scraping, consider whether web_scraper with careful retry logic would suffice.
Combine with web_scraper
Section titled “Combine with web_scraper”A common pattern: try the cheap path first, fall back to the expensive one:
routes: cheap_scraper: adapters: - type: "web_scraper" - type: "content_size_filter" # drop suspiciously-empty results - type: "web_content_storage"
heavy_scraper: adapters: - type: "web_crawler" - type: "web_content_storage"Route URLs that fell through the size filter into the heavy route (via a condition or explicit fan-out).
Related
Section titled “Related”- Webscraper workflow — the lightweight path
- factflow-webscraper reference