How to use search engine robots

    October 17, 2019

    How to configure and use search engine crawlers with either a robots.txt file or an .htaccess file.

    What you need 

    • FTP access to your Nexcess server. For details about how to use FileZilla, a popular FTP client, refer to How to use FileZilla.  
    • A Nexcess account in a physical (non-cloud) environment.

    Search engine robots in robots.txt 

    Locating 

    1. Using your preferred FTP client, type your site's directory and /html directory.
    2. Within this directory, locate the robots.txt file. If you are unable to locate this file, create a text file with the name robots.txt.

    Adding functions  

    The following sections describe the formatting for allowing or disallowing crawlers to access specific folders on your web site.

    ATTENTION: Search engine crawlers do not scan the robots.txt file each time they crawl your site, so changes to your robots.txt file might not be read by the search engine for up to a week.

    Blocking search-engine crawlers 

    If you are performing development work on your site, and would prefer Google or Bing to not crawl your site, blocking your site from search engines is an option.

    1. The first line of the robots.txt file will be User-agent: followed by the name of the search engine you want to block.
    2. On the next line, type Disallow: followed by the folders and files you want to block the bot from crawling. For example:
      User-agent: googlebot
      Disallow: /photos

    Allowing search engines to crawl specific folders of your site

    If you would like to block all of your folders from search engine crawlers, configure an allow rule.

    1. The first line of the robots.txt file will be User-agent:, followed this name of the search engine crawler.
    2. On the next line, type Allow:,  followed by the name of the folder you would like to allow the bot to crawl.

    Adding crawl delays for search-engine robots

    If your site is experiencing a large amount of traffic, and it appears to be caused by multiple search engine crawlers simultaneously visiting your site, configure a search engine crawler delay.

    ATTENTION: Adding a crawl delay to your robots.txt file is considered a non-standard entry, and some search engines do not abide by this rule. You will need to check with the specific search engine you want to delay for specific details.

    1. The first line of your robots.txt addition will be the User-agent: and the name of the search engine.

    2. The second line will be Crawl-delay: followed by a number between 1 and 30. This is for the second delay a crawling search engine can crawl your site. If your site is being crawled by multiple bots simultaneously, adding a crawl delay of 10 seconds or more.

    The following table is a list of search engines and their corresponding bot names:

    Search Engines Search Bot Name
    Google googlebot
    Bing bingbot
    Baidu baiduspider
    MSN Bot msnbot
    Yandex.ru yandex
    All Search Engines *

    For example, to block Google bot from viewing your /photos folder, the following con figures a line in your robots.txt file:

    User-agent: googlebot
    Disallow: /photos

    Search engine robots in .htaccess

    Depending on the way your website is configured, your robots.txt file might not properly work with search engine crawlers. You can make changes to your .htaccess file instead.

    ATTENTION: Search engine crawlers do not scan the robots.txt file each time they crawl your site, so changes to your robots.txt file might not be read by the search engine for as long as a week.

    Locating .htaccess

    1. Using your preferred FTP navigator, enter your directory and /html file.

    2. Within this directory, locate the .htaccess file. If the files does not exist, create a text file with the name .htaccess.

    Adding functions  

    1. Once you have located or created the .htaccess file, open the file in your preferred text editor.

    2. If you are creating a new .htaccess file, the first line should be RewriteEngine On

    3. If the .htaccess file already exists, and you are editing it, you will want to make sure that RewriteEngine On line is located in the file at the top.

    4. The following line reads the user or bot's user agent name and matches it to what was provided in the .htaccess file. Your users will not be matched with this variable, therefore will not be blocked. Replace [crawler] with the name of the search engine.

      RewriteCond %{HTTP_USER_AGENT} ^[crawler]$ [NC]

    5. The final line tells the system what to do with a user that has been matched correctly, in this example, they will be provided a 403 Forbidden Message.
      RewriteRule .* - [R=403,L]

    For example, to block Yandex from crawling any pages of your site, the .htaccess file will look something like this:

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^Yandex$ [NC,OR]
    RewriteRule .* - [R=403,L]

    Adding crawl delays for search engine robots

    If adding a crawl delay to the robots.txt file was unsuccessful, add the following to your .htaccess file:

    SetEnvIf User-Agent [botname] GoAway=1 
    Order allow,deny 
    Allow from all 
    Deny from env=GoAway
    • The first line checks the user ID, where [botname] is the name of the bot:
      SetEnvIf User-Agent [botname] GoAway=1 
    • The second and third lines allows all traffic not matching the first line:
      Order allow,deny
      Allow from all
    • The fourth line denies all traffic that matches the GoAway variable.
      Deny from env=GoAway


    For 24-hour assistance any day of the year, contact our support team by email or through your Client Portal.


    Was this article helpful?

    Send feedback

    Can’t find what you’re looking for?

    Our award-winning customer care team is here for you.

    Contact Support