What Is A Robots.txt File?
Robots.txt is a text file containing a set of instructions which are used to tell search engines how to crawl pages on that website. This is a great way of ‘locking away’ certain pages e.g a user admin page or the shopping basket on an e-commerce website from search engines such as Google by using a disallow: crawl directive for either all user agents or specific ones.
What Does A Robots.txt File Look Like?
A basic e-commerce robots.txt file may look like the below:
The top line starting with # is ignored by the crawlers but is just letting humans know that it’s a shopify site. Then the text file is essentially saying all user agents e.g Googlebot, Bingbot and Baiduspider can crawl all pages on this website except from any urls containing /admin, /search and /cart. Then the bottom directive tells the search engine crawlers where this website’s sitemap can be found, which makes it easier for the bots to find all of the websites indexable pages.
How Robots.txt Works
A robots.txt file has two key roles:
- Crawling the web to discover all content.
- Then indexing the crawled content so it can be displayed when users are performing searches.
Typically a bot uses a technique called spidering when crawling from one website to another by crawling each site’s links. Each time the bot discovers a new website it’ll firstly look for the robots.txt file to find out where it can and can’t crawl along with any crawl delays, before using these instructions to continue the crawl.
Why You Need A Robots.txt File
Typically speaking a robots.txt file isn’t vital for smaller websites however it’s nice to have, as by including a robots.txt file it provides some degree of control on how and where bots can crawl your website which is especially useful for large websites to help maximise crawl budget.
A robots.txt file can help with the following:
- Preventing search engines from wasting crawl budget.
- Prevent internal search results pages of a website from being crawled.
- Keeping either sections of a website private such as /cart/ or a whole site private such as a staging website.
- Reducing the likelihood of server overload by specifying a crawl delay.
- Making bots aware of the xml sitemaps by referencing it in the robots.txt.
Robots Directives Explained
The most common directives are:
- User-agent: This is used to specify which web crawler(s) you’re giving crawl instructions to e.g Googlebot or Bingbot.
- Disallow: Is used to tell a user-agent to not crawl a particular URL or section of a website.
- Allow: It’s important to remember that only the Googlebot follows this command and is used to tell Googlebot it can access a page or subfolder despite the parent page/subfolder being disallowed.
- Crawl-delay: This command is used to outline how many seconds a crawler should wait before loading and crawling the content on the page.
- Sitemap: Is used to highlight the location of the XML sitemaps.
Robots.txt Best Practices
When it comes to robots.txt there are a number of best practises to follow:
- As a robots.txt file is case sensitive it can only be named ‘robots.txt’ in order to be discovered.
- Along with only being called robots.txt it’s crucial for this to sit at the top level of a website.
- For each new command always use a new line.
- Where possible use wildcards ‘*’ to simplify the robots.txt instructions.
- Only declare each user agent once.
- All root domains and subdomains should have their own robots.txt file.
- Always reference the location of the xml sitemaps within the robots.txt file.
- Only block URLs or sections of a website that you do not want crawling.
- Make your robots.txt file understandable for humans by adding comments starting with a hash as crawlers will ignore this.
- Typically search engines cache robots.txt content and therefore after updating the robots.txt file it should always be resubmitted in Google Search Console.
- A robots.txt file is publicly available and therefore no private or sensitive user information should be displayed here and instead should be password protected.