Crawler options

Crawler options

Use robots.txt and "nofollow" directives
Select this option to let the program ignore links on pages forbidden to be indexed by robots in the robots.txt file. This option also ignores links with "nofollow" directives from Robots META Tag, X-Robots-Tag HTTP header and link rel attributes.

Only check the first URL
This option allows you to check only on the page specified in the Start URL field. The program will not follow links it will find on this page, but will only check their availability.

Ignore case when comparing links
Normally, all links are considered to be case-sensitive. Select this option to make the program ignore the case of letters when comparing links.

Do not accept cookies from the scanned site
Do not accept cookies for the currently scanned site.

Do not ignore "www." prefix
When this option is selected the program does not ignore "www." prefix. For example, "example.com" and "www.example.com" will be considered as different sites.

Check files in parent directory
Normally, if you specify start URL like "http://www.example.com/subdir/", the pages in directories below "subdir" will not be checked. With this option entire site will be checked not depending on the start directory.

Do not send HEAD requests
When this option is selected the program will use GET requests everywhere. This option may be useful if your web server does not support HEAD requests for some reason.

Do not send referer headers
No referer headers will be sent.

Do not follow HTTP redirects
HTTP redirects will be ignored.

Process all links in queue
When enabled, all links will be processed via queue. This option is not recommended for large websites as it will significantly slow down the process by saving large amount of usually unneeded data.

Do not check if Internet is down
When enabled, all connection-related errors will be added to the scan results, otherwise the program wait until the Internet resumes, then will try to recover from such errors.

Slowdown factor
This option slows down the scanning process. You can use this option to reduce server load during the scan.

Max. directory level
If you check links on large sites, you can restrict the level of subdirectories to be checked. For example, if you specify "3" in this field, the program will not check links on pages below the third level from the checking folder. When checking "www.example.com", the pages located in the directory "www.mysite.com/level_1/level_2/level_3/" will not be checked. This option allows you to check large sites in parts, restricting the level of subfolders.

Max. link depth
In order to prevent endless loop, this option limits the number of followed links or redirects.

Stop scan if found/processed/queued links exceed number
Stop scan if number of the specified links exceeds the specified value.

Check only the following links
Skip all resources which do not meet the wildcards specified. See Wildcard matching and Resource matching for wildcard format.

Skip the following links
Skip all resources which meet the wildcards specified. See Wildcard matching and Resource matching for wildcard format.

Status codes and content types

HTTP success codes
This list contains HTTP error codes that are treated as success codes.

CSS content types
Content types for CSS documents.

HTML content types
Content types for HTML documents.

Parser options

Default charset
The charset that is used when it was not possible to detect charset automatically.

Always use default charset
The charset from the Default charset option will be used for all pages.

Ignore FORM tag
Ignore links from <form> tags.

February 20, 2017 Webtweaktools

Crawler options

Leave a Reply Cancel reply