What is Robots.txt and how to configure it?

Chandra Shekar

If you are into tech and the web, or if you manage an online webpage, it is imperative to understand what a robots.txt file is and how to configure it. This file plays an important role in helping businesses operate efficiently on the web. Here we will cover all the necessary topics so that you fully understand what the robot.txt is. Let us begin!

What is the robots.txt file?

For Google to crawl the pages of your website that you deem important, you need to have a robots.txt file in your domain. This will let Google know what content they should access and index on their search engine.

Furthermore, the robots.txt file is also used to block Google bots from accessing specific parts of our website that we don’t want to be indexed by search engines.

According to Google, this is the exact definition of robots.txt: “A robots.txt file is a file that sits in the root of a site and indicates which parts you don’t want search engine crawlers to access. The file uses the Robot Exclusion Standard, which is a protocol with a small set of commands that can be used to indicate access to the website by section and by specific types of web crawlers.”

What is the robots txt file for?

In addition to the main function of the robot.txt, here we tell you about its specialties:

Control access to image files: A robot.txt file prevents images from the web from appearing as individual search results.

Control access to web pages: In addition to blocking access to restricted or irrelevant pages, it also helps the server not be overwhelmed by the number of results.

Block access to resource files: Robots.txt can also block access to unimportant style files and commands.

How does robots.txt work?

Robots.txt works in a much simpler way than one might imagine.

Let’s start from the fact that they are lines of code that give recommendations, not orders, to Google robots to explain what to review and what not.

Robots.txt Commands

In order to create a robots.txt file and to understand how it works, you need to know the 4 commands. Here we present them to you:

Disallow Command

This command is in charge of deciding which pages should not be included in the search results in the SERP.

To guide Google bots from accessing the “beta.php” page of your site, this is the command you should use:

Disallow: /beta.php

You can also block access to specific folders with this command:

Disallow: /files/

There is the possibility of restricting access to a particular content by limiting access to those elements that begin with a specific letter. For example, if you want to prevent access to all files and directories that start with the letter “a”, you could use the following command:

Disallow: /a

Allow Command

This is the exact opposite of the previous command: it tells Google which pages you want to index and display in search results.

If you don’t have an allow or disallow command, pages will be indexed by default, so we recommend only using it to indicate a specific page, file, or folder within a locked directory.

In the event that it is required to restrict access to a specific folder, but at the same time allow access to a specific page, a command such as the following can be used:

Disallow: /files/

Allow: /files/products.php

If you want to block access to the “files” folder, but need to allow access to the projects folder, you can use the following command:

Disallow: /files/

Allow: /files/projects/

Also Read: Web Analytics

Sitemap command

This was one of the most useful commands: you tell Google what your sitemap is.

However, it has fallen out of favor because it can be done in much simpler ways.

User agent command

It is possible to set particular instructions for each market search robot within the robots.txt file by using the User-agent command, which allows to identify which robot is being referenced.

If you want to know the specific name of each User-agent, you can consult the Web Robots database, which presents a list of the most outstanding search robots in the industry.

It is important to note that Googlebot is the main search robot used by the Google search engine.

If you want to give it specific commands, the command you should enter in your robots.txt would be this:

User-agent: Googlebot

Instead, if you want to leave specific commands for the Bing search bot, the command is this:

User-agent: Bingbot