Create a new crawler

Authorizations

BearerAuth

Path Parameters

organization

required

string

Organization identifier

Example

test-org

project

required

string

Project identifier

Example

test-project

Request Body^required

object

name

Crawler name

string

Test Crawler

domain

required

Domain to crawl

string

test-domain.com

browser_mode

Enable browser mode

boolean

urls

URLs to crawl

Array<string>

[
  "/",
  "/about",
  "/contact"
]

start_urls

Starting URLs for crawl

Array<string>

[
  "/",
  "/blog"
]

headers

Custom headers

object

key

additional properties

string

{
  "Authorization": "Bearer token123",
  "X-Custom-Header": "value"
}

exclude

URL patterns to exclude (regex)

Array<string>

[
  "/admin/*",
  "/private/*"
]

include

URL patterns to include (regex)

Array<string>

[
  "/blog/*",
  "/products/*"
]

webhook_url

Webhook URL for notifications

string

https://example.com/webhook

webhook_auth_header

Authorization header for webhook

string

Bearer token123

webhook_extra_vars

Extra variables for webhook

string

key1=value1&key2=value2

workers

Number of concurrent workers (default: 2, non-default requires verification)

integer

>= 1 <= 20

delay

Delay between requests in seconds (default: 4, non-default requires verification)

number format: float

<= 10

depth

Maximum crawl depth, -1 for unlimited

integer

>= -1

-1

max_hits

Maximum total requests, 0 for unlimited (default: 0, non-default requires verification)

integer

0

max_html

Maximum HTML pages, 0 for unlimited (default: org limit, non-default requires verification)

integer

status_ok

HTTP status codes that will result in content being captured and pushed to Quant

Array<integer>

sitemap

Sitemap configuration

Array<object>

object

url

Sitemap URL

string

/sitemap.xml

recursive

Recursively follow sitemap links

boolean

true

[
  {
    "url": "/sitemap.xml",
    "recursive": true
  }
]

allowed_domains

Allowed domains for multi-domain crawling, automatically enables merge_domains

Array<string>

[
  "example.com",
  "assets.example.com"
]

user_agent

Custom user agent, only when browser_mode is false

string

Mozilla/5.0...

assets

Asset harvesting configuration

object

network_intercept

Network intercept configuration for asset collection

object

enabled

Enable network intercept

boolean

true

timeout

Request timeout in seconds

integer

execute_js

Execute JavaScript during asset collection

boolean

parser

Parser configuration for asset extraction

object

enabled

Enable parser

boolean

true

{
  "network_intercept": {
    "enabled": true,
    "timeout": 30,
    "execute_js": false
  },
  "parser": {
    "enabled": true
  }
}

max_errors

Maximum errors before stopping crawl

integer

200

The request has succeeded.

object

id

required

Crawler ID

integer

name

Crawler name

string

Test Crawler

project_id

required

Project ID

integer

uuid

required

Crawler UUID

string

550e8400-e29b-41d4-a716-446655440000

config

required

Crawler configuration (YAML)

string

domain: test-domain.com\nconfig:\n  max_html: 100\n  browser_mode: false

domain

required

Crawler domain

string

test-domain.com

domain_verified

Domain verification status

integer

urls_list

URLs list (YAML)

string

single_url:\n  - /\n  - /about\n  - /contact

webhook_url

Webhook URL for notifications

string

https://example.com/webhook

webhook_auth_header

Authorization header for webhook

string

Bearer token123

webhook_extra_vars

Extra variables for webhook

string

key1=value1&key2=value2

browser_mode

Browser mode enabled

boolean

workers

Number of concurrent workers

integer

delay

Delay between requests in seconds

number format: float

depth

Maximum crawl depth

integer

-1

max_hits

Maximum total requests

integer

0

max_html

Maximum HTML pages

integer

status_ok

HTTP status codes for content capture

Array<integer>

[
  200
]

user_agent

Custom user agent

string

Mozilla/5.0...

max_errors

Maximum errors before stopping

integer

start_urls

Starting URLs

Array<string>

[
  "/",
  "/blog"
]

urls

URLs list

Array<string>

[
  "/",
  "/about"
]

headers

Custom headers

object

key

additional properties

string

{
  "Authorization": "Bearer token"
}

exclude

URL patterns to exclude

Array<string>

[
  "/admin/*"
]

include

URL patterns to include

Array<string>

[
  "/blog/*"
]

sitemap

Sitemap configuration

Array<object>

object

url

Sitemap URL

string

/sitemap.xml

recursive

Recursively follow sitemap links

boolean

true

[
  {
    "url": "/sitemap.xml",
    "recursive": true
  }
]

allowed_domains

Allowed domains

Array<string>

[
  "example.com"
]

assets

Asset harvesting configuration

object

network_intercept

Network intercept configuration for asset collection

object

enabled

Enable network intercept

boolean

true

timeout

Request timeout in seconds

integer

execute_js

Execute JavaScript during asset collection

boolean

parser

Parser configuration for asset extraction

object

enabled

Enable parser

boolean

true

{
  "network_intercept": {
    "enabled": true,
    "timeout": 30,
    "execute_js": false
  },
  "parser": {
    "enabled": true
  }
}

created_at

Creation timestamp

string format: date-time

2024-01-20T09:15:00Z

updated_at

Last update timestamp

string format: date-time

2024-10-11T16:45:00Z

deleted_at

Deletion timestamp

string format: date-time

nullable

400

The server could not understand the request due to invalid syntax.

object

message

required

Error message

string

The requested resource was not found

error

required

Error flag

boolean

true

{
  "message": "The requested resource was not found",
  "error": true
}

403

Access is forbidden.

object

message

required

Error message

string

The requested resource was not found

error

required

Error flag

boolean

true

{
  "message": "The requested resource was not found",
  "error": true
}

Create a new crawler

Authorizations

Parameters

Path Parameters

Example

Example

Request Body required

Responses

200

400

403

Request Body^required