Create a new crawler
POST /api/v2/organizations/{organization}/projects/{project}/crawlers
Authorizations
Parameters
Path Parameters
Organization identifier
Project identifier
Request Body required
object
Crawler name
Test CrawlerDomain to crawl
test-domain.comEnable browser mode
URLs to crawl
[ "/", "/about", "/contact"]Starting URLs for crawl
[ "/", "/blog"]Custom headers
object
{ "Authorization": "Bearer token123", "X-Custom-Header": "value"}URL patterns to exclude (regex)
[ "/admin/*", "/private/*"]URL patterns to include (regex)
[ "/blog/*", "/products/*"]Webhook URL for notifications
https://example.com/webhookAuthorization header for webhook
Bearer token123Extra variables for webhook
key1=value1&key2=value2Number of concurrent workers (default: 2, non-default requires verification)
2Delay between requests in seconds (default: 4, non-default requires verification)
4Maximum crawl depth, -1 for unlimited
-1Maximum total requests, 0 for unlimited (default: 0, non-default requires verification)
Maximum HTML pages, 0 for unlimited (default: org limit, non-default requires verification)
50HTTP status codes that will result in content being captured and pushed to Quant
[ 200, 201]Sitemap configuration
object
Sitemap URL
/sitemap.xmlRecursively follow sitemap links
true[ { "url": "/sitemap.xml", "recursive": true }]Allowed domains for multi-domain crawling, automatically enables merge_domains
[ "example.com", "assets.example.com"]Custom user agent, only when browser_mode is false
Mozilla/5.0...Asset harvesting configuration
object
Network intercept configuration for asset collection
object
Enable network intercept
trueRequest timeout in seconds
30Execute JavaScript during asset collection
Parser configuration for asset extraction
object
Enable parser
true{ "network_intercept": { "enabled": true, "timeout": 30, "execute_js": false }, "parser": { "enabled": true }}Maximum errors before stopping crawl
100Responses
200
The request has succeeded.
object
Crawler ID
456Crawler name
Test CrawlerProject ID
789Crawler UUID
550e8400-e29b-41d4-a716-446655440000Crawler configuration (YAML)
domain: test-domain.com\nconfig:\n max_html: 100\n browser_mode: falseCrawler domain
test-domain.comDomain verification status
1URLs list (YAML)
single_url:\n - /\n - /about\n - /contactWebhook URL for notifications
https://example.com/webhookAuthorization header for webhook
Bearer token123Extra variables for webhook
key1=value1&key2=value2Browser mode enabled
Number of concurrent workers
2Delay between requests in seconds
4Maximum crawl depth
-1Maximum total requests
Maximum HTML pages
50HTTP status codes for content capture
[ 200]Custom user agent
Mozilla/5.0...Maximum errors before stopping
100Starting URLs
[ "/", "/blog"]URLs list
[ "/", "/about"]Custom headers
object
{ "Authorization": "Bearer token"}URL patterns to exclude
[ "/admin/*"]URL patterns to include
[ "/blog/*"]Sitemap configuration
object
Sitemap URL
/sitemap.xmlRecursively follow sitemap links
true[ { "url": "/sitemap.xml", "recursive": true }]Allowed domains
[ "example.com"]Asset harvesting configuration
object
Network intercept configuration for asset collection
object
Enable network intercept
trueRequest timeout in seconds
30Execute JavaScript during asset collection
Parser configuration for asset extraction
object
Enable parser
true{ "network_intercept": { "enabled": true, "timeout": 30, "execute_js": false }, "parser": { "enabled": true }}Creation timestamp
2024-01-20T09:15:00ZLast update timestamp
2024-10-11T16:45:00ZDeletion timestamp
400
The server could not understand the request due to invalid syntax.
object
Error message
The requested resource was not foundError flag
true403
Access is forbidden.
object
Error message
The requested resource was not foundError flag
true