Crawler
The Quant Crawler creates static snapshots of websites, storing all content and assets in the Quant static edge. It supports both standard HTTP crawling and JavaScript-enabled headless browser mode for dynamic sites.
Use cases
Section titled “Use cases”- Static archive: Preserve a website before decommissioning
- Failover/disaster recovery: Keep a fresh static copy for emergency fallback
- Revision history: Track all content changes via the revisions viewer
- Static serving: Crawl daily and serve public traffic from the static edge
- CMS integration: Crawl WordPress, Drupal, or other CMS sites without plugins
Getting started
Section titled “Getting started”Create a crawler configuration
Section titled “Create a crawler configuration”- Select your project in the dashboard
- Navigate to Crawler > Configs
- Click Add
- Enter a descriptive name and the domain to crawl (e.g.,
https://www.example.com) - Enable Headless browser if your site requires JavaScript to render
- Click Save
Run the crawler
Section titled “Run the crawler”From Crawler > Configs, click Run > All URLs to start a crawl.
Monitor progress from Crawler > Runs. The crawler processes pages in parallel and reports discovered vs. crawled page counts.
View results
Section titled “View results”After completion, preview your static copy on the project’s preview domain (found in the Domains section).
Configuration options
Section titled “Configuration options”Basic settings
Section titled “Basic settings”| Setting | Description |
|---|---|
| Name | Descriptive name for this configuration |
| Domain | Full URL including protocol (e.g., https://www.example.com) |
| Headless browser | Enable JavaScript rendering via headless Chrome |
Crawl behaviour
Section titled “Crawl behaviour”These options control how the crawler discovers and processes pages.
| Setting | Default | Description |
|---|---|---|
| Workers | 2 | Concurrent requests (1-20). More workers = faster crawls. |
| Delay | 4s | Seconds between requests per worker. Reduce for faster crawls. |
| Depth | -1 | Link depth limit. -1 = unlimited, 0 = starting pages only. |
| Max pages | 50* | Maximum HTML pages to crawl. 0 = unlimited. |
| Max requests | 0 | Total request limit (HTML + assets). 0 = unlimited. |
*Unverified domains are limited to 50 pages. Verify your domain to unlock higher limits.
URL configuration
Section titled “URL configuration”| Setting | Description |
|---|---|
| Starting URLs | Paths to begin crawling from (e.g., /, /products/). Crawler discovers links from these pages. |
| Individual URLs | Specific paths to crawl. When set, the crawler only visits these URLs (no link discovery). |
| Exclude patterns | URL patterns to skip (e.g., /admin/*, *?sessionid=*) |
| Include patterns | Only crawl URLs matching these patterns |
Request configuration
Section titled “Request configuration”| Setting | Description |
|---|---|
| Headers | Custom HTTP headers to send with requests (one per line, Header-Name: value) |
| User agent | Custom user agent string (non-browser mode only) |
| Status codes | Acceptable response codes (default: 200). Comma-separated list. |
Multi-domain crawling
Section titled “Multi-domain crawling”Enable crawling across multiple domains for sites with assets on CDNs or subdomains:
| Setting | Description |
|---|---|
| Allowed domains | Additional domains to crawl (e.g., cdn.example.com, assets.example.com) |
When allowed domains are set, assets from those domains are included in the static snapshot.
Error handling
Section titled “Error handling”| Setting | Default | Description |
|---|---|---|
| Max errors | 100 | Stop crawling after this many errors. 0 = continue regardless of errors. |
Headless browser mode
Section titled “Headless browser mode”Enable headless browser mode for sites that:
- Render content with JavaScript (React, Vue, Angular, etc.)
- Use client-side routing (SPAs)
- Load content dynamically via AJAX
- Have JavaScript-dependent navigation
When enabled, the crawler:
- Waits for JavaScript to execute before capturing content
- Intercepts network requests to capture dynamically-loaded assets
- Renders pages as they appear in a real browser
Execute JavaScript option
Section titled “Execute JavaScript option”In browser mode, enable Execute JavaScript to:
- Wait for dynamic content to load
- Capture AJAX-loaded resources
- Process JavaScript-driven navigation
This adds processing time but ensures complete page capture for dynamic sites.
Domain verification
Section titled “Domain verification”Advanced options (workers, delay, page limits) require domain verification. This proves you control the domain and prevents abuse.
Verification methods
Section titled “Verification methods”DNS record: Add a TXT record to your domain’s DNS.
File verification: Place a file with a specific token at a defined URL path.
Instructions appear in the dashboard when you click on the domain in the Domain verification column of your crawler configuration.
Unverified domain limits
Section titled “Unverified domain limits”Without verification:
- Maximum 50 pages per crawl
- Maximum 20 starting URLs
- Fixed 2 workers with 4-second delay
Webhooks
Section titled “Webhooks”Receive notifications when crawler events occur.
| Setting | Description |
|---|---|
| Webhook URL | Endpoint to receive POST notifications |
| Authorization header | Optional auth header value (e.g., Bearer token123) |
| Extra variables | Additional key-value pairs to include in webhook payload |
Webhook events
Section titled “Webhook events”The crawler sends POST requests for these events:
- Crawl started: Job has begun processing
- Crawl completed: All pages processed successfully
- Crawl failed: Job terminated with errors
- Page indexed: Individual page successfully stored (when tracking enabled)
Example webhook payload
Section titled “Example webhook payload”{ "event": "crawl_completed", "config_name": "Production site", "domain": "https://www.example.com", "pages_crawled": 150, "pages_errored": 2, "started_at": "2024-01-15T10:00:00Z", "completed_at": "2024-01-15T10:15:30Z"}Scheduling
Section titled “Scheduling”Automate crawls to keep your static snapshot up to date.
Create a schedule
Section titled “Create a schedule”- Navigate to Crawler > Schedules
- Click Add
- Select a crawler configuration
- Set the schedule (hourly, daily, weekly, or custom cron expression)
- Save
Schedule patterns
Section titled “Schedule patterns”| Pattern | Description |
|---|---|
| Hourly | Run every hour on the hour |
| Daily | Run once per day at a specified time |
| Weekly | Run once per week on a specified day and time |
| Custom | Cron expression for advanced scheduling (e.g., 0 2 * * * for 2am daily) |
Sitemap support
Section titled “Sitemap support”By default, the crawler checks /sitemap.xml and follows all URLs found there. Sitemaps are processed recursively (sitemap index files are supported).
To disable sitemap processing or specify a different sitemap location, adjust the configuration via the API.
Troubleshooting
Section titled “Troubleshooting”Pages not discovered
Section titled “Pages not discovered”Cause: Links aren’t in the HTML or require JavaScript to render.
Solutions:
- Enable headless browser mode
- Provide a sitemap
- List pages manually in Individual URLs
Assets missing from snapshot
Section titled “Assets missing from snapshot”Cause: Assets are on a different domain.
Solutions:
- Add the domain to Allowed domains
- Ensure the domain is accessible from the crawler
Crawl stops early
Section titled “Crawl stops early”Cause: Page limit reached or too many errors.
Solutions:
- Verify your domain to increase page limits
- Check Max errors setting
- Review error logs in the run details
JavaScript content not captured
Section titled “JavaScript content not captured”Cause: Browser mode not enabled or JS not fully executed.
Solutions:
- Enable Headless browser mode
- Enable Execute JavaScript option
- Check for JavaScript errors in the page
Slow crawl performance
Section titled “Slow crawl performance”Solutions:
- Verify your domain to increase workers (up to 20)
- Reduce delay between requests
- Ensure the origin server can handle the load
API access
Section titled “API access”Crawler configurations can also be managed via the Quant API. See the API documentation for details on programmatic access.
Permissions
Section titled “Permissions”| Role | Capabilities |
|---|---|
| Developer | View configs, edit existing, trigger runs, view history |
| Administrator | All of the above, plus create/delete configs |
| Organization owner | All capabilities |
| Content manager | No crawler access |
| Read only | View configs and history only |