Follow

Robots.txt customization

Every account has its own limit of crawling pages. How to use the available number smart especially if you need some certain pages to be crawled? This task has very simple solution. You are able to specify a rule that you need to customize our crawler to point what pages you want to be scanned. All you need to do is specify the robots.txt file at the Site Auditor settings section.

overview table

What is robots.txt?

The robots.txt file is one of the primary ways of telling crawlers where they can and can’t go on a website. To customize our bots you should set certain rules.

How to customize robots.txt for our crawler?

There are common rules for all bots. Let's take a closer look!

Disallow: or  Allow: / will allow any crawler to scan whole website without exceptions.

Disallow: / - using this rule you will restrict scanning of your website.

Disallow: /directory/file.html - you will restrict scanning of a certain file

Disallow: /dir/ - our crawler won’t scan a certain directory

Disallow: /dir - the crawling of an exact resource will be restricted

Also, you can specify the startpoint page. What does it mean? Our bot will start its work on your website from the page which URL you have specified. You will find this possibility very useful if you have, for instance, multilingual websites or those which contain different categories that are unnecessary to be scanned.

overview table

Let’s consider ‘example.com’ website in English with French as the second version. You want our crawler to scan each of the versions separately. In such case, you should specify a rule in the robots.txt.

English version. First of all, you should enter the startpoint (where the crawler should start scanning):

example.com/fr/

After that specify rules like this:

Disallow: /

Allow: /fr/

In this case, only pages that are specified by the rule.

If you want to retrieve results only for English version of website (‘example.com’)

you should set the next directions:

Disallow: /fr/

In such case, French version will be ignored and you won’t see the data for /fr/ pages that may confuse you.

Another case is when you have content on your website that is dated, like an infinite calendar. There may be thousands of pages if not millions that look like:

/history/2016/12

/history/2016/11

/history/1900/11

If you want to disallow the ‘history’ category you should create such rule:

Disallow: /history/

If you want to scan separate years like 2015 and 2016 you have to add such directions:

Allow: /history/2016/

Allow: /history/2015/

With this permissions the crawler will scan only  /history/2016/ and  /history/2015/ paths from /history/ directory.

You can also allow or disallow scanning by our crawler at your robots.txt file where you set the rules for all bots. All you should do is specify directions for RSiteAuditor (our crawler).

Furthermore, you can set the priority for our crawler among other crawling bots using such simple command:

User-Agent: GoogleBot

Allow: /

User-Agent: RSiteAuditor # priority!

Disallow: /
If for some reasons you have restricted an access to your website for RSiteAuditor bot and then want it to scan your website again, don’t forget to check if you allow to do that in your robots.txt.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk