How does SiteSpect handle robots, bots, and crawlers?

Reverse-Proxy Implementations

For reverse-proxy implementations SiteSpect identifies requests from robots and crawlers using a known list outlined on passthrough and Automatic Robot Detection (ARD) that excludes them from any Campaign data. If you are using the SiteSpect Engine API then the approach will differ and you can find more information on this use case here.

ARD is a SiteSpect feature that provides very accurate detection of robots, crawlers, and other User-Agents that do not explicitly identify themselves as such. These are often referred to as cloaked robots since they masquerade as official browsers, and are not filtered by SiteSpect's explicit Pass-Through settings.

The ARD feature works by inserting a short block of JavaScript code into the first web page viewed by the user (or robot). When viewed by a legitimate web browser (operated by a human), the JavaScript code runs and signals to SiteSpect that the browser executed the code and also accepted SiteSpect's tracking cookies. But, if SiteSpect inserts the JavaScript and does not receive the signal, then it knows that the user is a robot who should be filtered because it did not execute JavaScript and/or pass back the tracking cookies.

To use the ARD feature, which requires system administrator privileges:

  1. Select SiteConfigurationSite Settings.
  2. Select the User Tracking tab and scroll to Automatic Robot Detection settings.

Robot Detection Location

Robot Detection Location determines where you want to inject the JavaScript code. The choices are:

  • Prepend to top of page – This is the recommended (and default) setting, and accurately filters robots while still capturing even those users who quickly enter/exit a site (i.e., a bounce). The test code is inserted just after the opening of the <head> tag where available, or at the top of the page otherwise.
  • Append to bottom of page – This accurately filters robots and most rapid bounces where a human user clicks to a site, views only one page, then quickly presses their "back button" to return to the referring page (typically a search engine). You may want to use this setting if you do not want rapid bounces counted towards your Campaign data. The test code is inserted just before the end of the tag where available, or at the end of the page otherwise.
  • Append to absolute bottom of page – Similar to the prior option, but only attempts to append the test code to the end of the page. This option is recommended only for Sites where the prior two options are problematic.
  • Off – ARD is disabled, and only those robots that are explicitly filtered by pass-through settings or HTTP Request Exclusions are caught. When ARD is disabled, its settings lower on this page are not visible.

How does SiteSpect handle robots, bots, and crawlers - Automatic Robot Detection

Robot Detection Method

Select the method you want to use to inject SiteSpect's Automatic Robot Detection onto the page:

  • new Image() – This is a non-blocking method for injecting code onto the page using a new HTMLImageElement instance. While it does not block the drawing of the page or subsequent HTTP requests, it can block/delay the load event on the window.
  • AJAX () – This is a non-blocking method for injecting JavaScript code that allows a timeout (provided in milliseconds) to delay the Ajax call. When you select this option, a box opens allowing you to enter a number of milliseconds from 0 to 10000. The default is 0.
  • document.write() – We do not recommend this choice, since it is a blocking injection and will be deprecated soon.

How does SiteSpect handle robots, bots, and crawlers - Automatic Robot Detection II

Disable Robot Detection Header Name and Value

To indicate to SiteSpect that you do not want any robot detection on specific requests, use the Disable Robot Detection Header Name and Disable Robot Detection Header Value fields. When this header is present in a request, SiteSpect does not perform its usual bot check. Disabling robot detection is important for any AJAX or Mobile App requests, which normally never execute JavaScript, if you want SiteSpect to count them in Campaigns and not mark them as a bot.

IAB/ABC International Spiders and Bots List

In addition to the pass-through list and Automatic Robot Detection (ARD) for managing bots SiteSpect also supports bot detection through a managed list from IAB. The level of IAB detection required will depend on your SiteSpect configuration so please speak to your SiteSpect Consultant, Account Manager or email helpdesk@sitespect.com to discuss enabling this.

IAB

The robots and crawlers that SiteSpect identifies using the methods above are automatically excluded from the usage calculation pertaining to Visit entitlements.