Skip to content

Crawler

The Quant Crawler creates static snapshots of websites, storing all content and assets in the Quant static edge. It supports both standard HTTP crawling and JavaScript-enabled headless browser mode for dynamic sites.

  • Static archive: Preserve a website before decommissioning
  • Failover/disaster recovery: Keep a fresh static copy for emergency fallback
  • Revision history: Track all content changes via the revisions viewer
  • Static serving: Crawl daily and serve public traffic from the static edge
  • CMS integration: Crawl WordPress, Drupal, or other CMS sites without plugins
  1. Select your project in the dashboard
  2. Navigate to Crawler > Configs
  3. Click Add
  4. Enter a descriptive name and the domain to crawl (e.g., https://www.example.com)
  5. Enable Headless browser if your site requires JavaScript to render
  6. Click Save

From Crawler > Configs, click Run > All URLs to start a crawl.

Monitor progress from Crawler > Runs. The crawler processes pages in parallel and reports discovered vs. crawled page counts.

After completion, preview your static copy on the project’s preview domain (found in the Domains section).

SettingDescription
NameDescriptive name for this configuration
DomainFull URL including protocol (e.g., https://www.example.com)
Headless browserEnable JavaScript rendering via headless Chrome

These options control how the crawler discovers and processes pages.

SettingDefaultDescription
Workers2Concurrent requests (1-20). More workers = faster crawls.
Delay4sSeconds between requests per worker. Reduce for faster crawls.
Depth-1Link depth limit. -1 = unlimited, 0 = starting pages only.
Max pages50*Maximum HTML pages to crawl. 0 = unlimited.
Max requests0Total request limit (HTML + assets). 0 = unlimited.

*Unverified domains are limited to 50 pages. Verify your domain to unlock higher limits.

SettingDescription
Starting URLsPaths to begin crawling from (e.g., /, /products/). Crawler discovers links from these pages.
Individual URLsSpecific paths to crawl. When set, the crawler only visits these URLs (no link discovery).
Exclude patternsURL patterns to skip (e.g., /admin/*, *?sessionid=*)
Include patternsOnly crawl URLs matching these patterns
SettingDescription
HeadersCustom HTTP headers to send with requests (one per line, Header-Name: value)
User agentCustom user agent string (non-browser mode only)
Status codesAcceptable response codes (default: 200). Comma-separated list.

Enable crawling across multiple domains for sites with assets on CDNs or subdomains:

SettingDescription
Allowed domainsAdditional domains to crawl (e.g., cdn.example.com, assets.example.com)

When allowed domains are set, assets from those domains are included in the static snapshot.

SettingDefaultDescription
Max errors100Stop crawling after this many errors. 0 = continue regardless of errors.

Enable headless browser mode for sites that:

  • Render content with JavaScript (React, Vue, Angular, etc.)
  • Use client-side routing (SPAs)
  • Load content dynamically via AJAX
  • Have JavaScript-dependent navigation

When enabled, the crawler:

  • Waits for JavaScript to execute before capturing content
  • Intercepts network requests to capture dynamically-loaded assets
  • Renders pages as they appear in a real browser

In browser mode, enable Execute JavaScript to:

  • Wait for dynamic content to load
  • Capture AJAX-loaded resources
  • Process JavaScript-driven navigation

This adds processing time but ensures complete page capture for dynamic sites.

Advanced options (workers, delay, page limits) require domain verification. This proves you control the domain and prevents abuse.

DNS record: Add a TXT record to your domain’s DNS.

File verification: Place a file with a specific token at a defined URL path.

Instructions appear in the dashboard when you click on the domain in the Domain verification column of your crawler configuration.

Without verification:

  • Maximum 50 pages per crawl
  • Maximum 20 starting URLs
  • Fixed 2 workers with 4-second delay

Receive notifications when crawler events occur.

SettingDescription
Webhook URLEndpoint to receive POST notifications
Authorization headerOptional auth header value (e.g., Bearer token123)
Extra variablesAdditional key-value pairs to include in webhook payload

The crawler sends POST requests for these events:

  • Crawl started: Job has begun processing
  • Crawl completed: All pages processed successfully
  • Crawl failed: Job terminated with errors
  • Page indexed: Individual page successfully stored (when tracking enabled)
{
"event": "crawl_completed",
"config_name": "Production site",
"domain": "https://www.example.com",
"pages_crawled": 150,
"pages_errored": 2,
"started_at": "2024-01-15T10:00:00Z",
"completed_at": "2024-01-15T10:15:30Z"
}

Automate crawls to keep your static snapshot up to date.

  1. Navigate to Crawler > Schedules
  2. Click Add
  3. Select a crawler configuration
  4. Set the schedule (hourly, daily, weekly, or custom cron expression)
  5. Save
PatternDescription
HourlyRun every hour on the hour
DailyRun once per day at a specified time
WeeklyRun once per week on a specified day and time
CustomCron expression for advanced scheduling (e.g., 0 2 * * * for 2am daily)

By default, the crawler checks /sitemap.xml and follows all URLs found there. Sitemaps are processed recursively (sitemap index files are supported).

To disable sitemap processing or specify a different sitemap location, adjust the configuration via the API.

Cause: Links aren’t in the HTML or require JavaScript to render.

Solutions:

  • Enable headless browser mode
  • Provide a sitemap
  • List pages manually in Individual URLs

Cause: Assets are on a different domain.

Solutions:

  • Add the domain to Allowed domains
  • Ensure the domain is accessible from the crawler

Cause: Page limit reached or too many errors.

Solutions:

  • Verify your domain to increase page limits
  • Check Max errors setting
  • Review error logs in the run details

Cause: Browser mode not enabled or JS not fully executed.

Solutions:

  • Enable Headless browser mode
  • Enable Execute JavaScript option
  • Check for JavaScript errors in the page

Solutions:

  • Verify your domain to increase workers (up to 20)
  • Reduce delay between requests
  • Ensure the origin server can handle the load

Crawler configurations can also be managed via the Quant API. See the API documentation for details on programmatic access.

RoleCapabilities
DeveloperView configs, edit existing, trigger runs, view history
AdministratorAll of the above, plus create/delete configs
Organization ownerAll capabilities
Content managerNo crawler access
Read onlyView configs and history only