Skip to content

Crawler

The Quant Crawler creates static snapshots of websites, storing all content and assets in the Quant static edge. It supports both standard HTTP crawling and JavaScript-enabled headless browser mode for dynamic sites.

  • Static archive: Preserve a website before decommissioning
  • Failover/disaster recovery: Keep a fresh static copy for emergency fallback
  • Revision history: Track all content changes via the revisions viewer
  • Static serving: Crawl daily and serve public traffic from the static edge
  • CMS integration: Crawl WordPress, Drupal, or other CMS sites without plugins
  1. Select your project in the dashboard
  2. Navigate to Crawler > Configs
  3. Click Add
  4. Enter a descriptive name and the domain to crawl (e.g., https://www.example.com)
  5. Enable Headless browser if your site requires JavaScript to render
  6. Click Save

From Crawler > Configs, click Run > All URLs to start a crawl.

Monitor progress from Crawler > Runs. The crawler processes pages in parallel and reports discovered vs. crawled page counts.

After completion, preview your static copy on the project’s preview domain (found in the Domains section).

Setting Description
Name Descriptive name for this configuration
Domain Full URL including protocol (e.g., https://www.example.com)
Headless browser Enable JavaScript rendering via headless Chrome

These options control how the crawler discovers and processes pages.

Setting Default Description
Workers 2 Concurrent requests (1-20). More workers = faster crawls.
Delay 4s Seconds between requests per worker. Reduce for faster crawls.
Depth -1 Link depth limit. -1 = unlimited, 0 = starting pages only.
Max pages 50* Maximum HTML pages to crawl. 0 = unlimited.
Max requests 0 Total request limit (HTML + assets). 0 = unlimited.

*Unverified domains are limited to 50 pages. Verify your domain to unlock higher limits.

Setting Description
Starting URLs Paths to begin crawling from (e.g., /, /products/). Crawler discovers links from these pages.
Individual URLs Specific paths to crawl. When set, the crawler only visits these URLs (no link discovery).
Exclude patterns URL patterns to skip (e.g., /admin/*, *?sessionid=*)
Include patterns Only crawl URLs matching these patterns
Setting Description
Headers Custom HTTP headers to send with requests (one per line, Header-Name: value)
User agent Custom user agent string (non-browser mode only)
Status codes Acceptable response codes (default: 200). Comma-separated list.

Enable crawling across multiple domains for sites with assets on CDNs or subdomains:

Setting Description
Allowed domains Additional domains to crawl (e.g., cdn.example.com, assets.example.com)

When allowed domains are set, assets from those domains are included in the static snapshot.

Setting Default Description
Max errors 100 Stop crawling after this many errors. 0 = continue regardless of errors.

Enable headless browser mode for sites that:

  • Render content with JavaScript (React, Vue, Angular, etc.)
  • Use client-side routing (SPAs)
  • Load content dynamically via AJAX
  • Have JavaScript-dependent navigation

When enabled, the crawler:

  • Waits for JavaScript to execute before capturing content
  • Intercepts network requests to capture dynamically-loaded assets
  • Renders pages as they appear in a real browser

In browser mode, enable Execute JavaScript to:

  • Wait for dynamic content to load
  • Capture AJAX-loaded resources
  • Process JavaScript-driven navigation

This adds processing time but ensures complete page capture for dynamic sites.

Advanced options (workers, delay, page limits) require domain verification. This proves you control the domain and prevents abuse.

DNS record: Add a TXT record to your domain’s DNS.

File verification: Place a file with a specific token at a defined URL path.

Instructions appear in the dashboard when you click on the domain in the Domain verification column of your crawler configuration.

Without verification:

  • Maximum 50 pages per crawl
  • Maximum 20 starting URLs
  • Fixed 2 workers with 4-second delay

Receive notifications when crawler events occur.

Setting Description
Webhook URL Endpoint to receive POST notifications
Authorization header Optional auth header value (e.g., Bearer token123)
Extra variables Additional key-value pairs to include in webhook payload

The crawler sends POST requests for these events:

  • Crawl started: Job has begun processing
  • Crawl completed: All pages processed successfully
  • Crawl failed: Job terminated with errors
  • Page indexed: Individual page successfully stored (when tracking enabled)
{
"event": "crawl_completed",
"config_name": "Production site",
"domain": "https://www.example.com",
"pages_crawled": 150,
"pages_errored": 2,
"started_at": "2024-01-15T10:00:00Z",
"completed_at": "2024-01-15T10:15:30Z"
}

Automate crawls to keep your static snapshot up to date.

  1. Navigate to Crawler > Schedules
  2. Click Add
  3. Select a crawler configuration
  4. Set the schedule (hourly, daily, weekly, or custom cron expression)
  5. Save
Pattern Description
Hourly Run every hour on the hour
Daily Run once per day at a specified time
Weekly Run once per week on a specified day and time
Custom Cron expression for advanced scheduling (e.g., 0 2 * * * for 2am daily)

By default, the crawler checks /sitemap.xml and follows all URLs found there. Sitemaps are processed recursively (sitemap index files are supported).

To disable sitemap processing or specify a different sitemap location, adjust the configuration via the API.

Cause: Links aren’t in the HTML or require JavaScript to render.

Solutions:

  • Enable headless browser mode
  • Provide a sitemap
  • List pages manually in Individual URLs

Cause: Assets are on a different domain.

Solutions:

  • Add the domain to Allowed domains
  • Ensure the domain is accessible from the crawler

Cause: Page limit reached or too many errors.

Solutions:

  • Verify your domain to increase page limits
  • Check Max errors setting
  • Review error logs in the run details

Cause: Browser mode not enabled or JS not fully executed.

Solutions:

  • Enable Headless browser mode
  • Enable Execute JavaScript option
  • Check for JavaScript errors in the page

Solutions:

  • Verify your domain to increase workers (up to 20)
  • Reduce delay between requests
  • Ensure the origin server can handle the load

Crawler configurations can also be managed via the Quant API. See the API documentation for details on programmatic access.

Role Capabilities
Developer View configs, edit existing, trigger runs, view history
Administrator All of the above, plus create/delete configs
Organization owner All capabilities
Content manager No crawler access
Read only View configs and history only