In this post, you’ll learn all about robots.txt.
What Is Robots.txt?
Also known as the “Robots Exclusion Protocol”, the robots.txt file is a text file that is saved to your website’s server.
The file tells search engine robots (also known as crawlers) how to crawl particular pages within a website.
When a search engine crawler reads the robots.txt file, they will learn which URLs within a website they can and can’t crawl.
All major search engines will respect the robots.txt file, so you shouldn’t have to worry about any ignoring the file.
What Is The Robots.txt File Used For?
Generally speaking, the file is used to manage crawler traffic to your site, and sometimes exclude a page from search results depending on the file type.
There are three-page types that you can add to a robots.txt file, and the functionalities differ between each of them.
This includes HTML, PDF, and other non-media formats that are readable by Google.
The robots.txt file can be used to manage crawler traffic to web pages, but not restrict them from being indexed in search results.
You must not use the robots.txt file to attempt to hide a web page from search results, this is because if it is linked to by other pages on the internet, it could still be indexed by search engines.
This includes image, video and audio files.
The robots.txt file can be used to manage crawler traffic, and prevent these types of pages from appearing in search results.
Excluding a media file from search results will not prevent other pages from being able to link to it.
This includes image, script, or style files.
Adding a resource file URL to your robots.txt file can manage its crawler traffic, and stop it from being indexed in search results depending on the parameters that you set.
It is important that you only block unimportant resource files that will not significantly affect the functionality of webpages that use the particular file you are blocking.
Why Should You Create A Robots.txt File?
Before search engine crawlers index a web page on a domain that they have not previously discovered, they will read it’s robots.txt file.
Upon reading the robots.txt file, they will see which URLs they can and can’t crawl.
If you have a web page that you do not want a particular search engine to crawl, you can add it to your robots.txt file.
Typically, search engines will refresh the robots.txt file daily, so any changes you make will come into effect relatively swiftly.
How To Check If You Have A Robots.txt File
If you are unsure whether or not your website already has a robots.txt file, you can find out in no time at all.
Simply type your domain into the search box, and add “/robots.txt” to the end.
It should look like this: “www.website.com/robots.txt”
Then search for this URL.
If your site does have a robots.txt file, you will find a page that looks something like this:
If your site doesn’t yet have a robots.txt file you will just see an error.
Basic Format, Syntax & Rules
When you create a robots.txt file there is some basic information that you should understand first.
The robots.txt file usually uses these three keywords, User-agent, Disallow and Allow.
User-agent: This is used to identify a specific crawler C or set of crawlers.
Disallow: This rule is added when you want to block a search engine robot from crawling a particular file or folder.
Allow: You add the allow rule when you want to allow a robot to crawl a file or folder.
Let’s look at some basic examples:
As you can see, in each of these examples we have put User-agent, followed by “*”, this means these rules apply to all search engine crawlers.
On the second line, we have the disallow rule.
As you can see we have used the disallow rule in four different ways.
First, leaving it empty will disallow nothing, so every crawler can access everything.
In the second example, we just have a forward slash, this basic says “everything”, so it blocks all folders and files.
For the third example, we have used the disallow rule to block one specific folder.
Then, for the final example, we have used the disallow rule to block a specific file.
Let’s look at some specific crawler examples:
The only change that we have made is specifying a user-agent, the Googlebot.
The Googlebot is Google’s web crawler.
All of these examples have the same rules, just with this one change.
Because I have specified a user-agent, these rules will only apply Google’s crawler “Googlebot”.
This is a useful feature if you want to block a specific file, or folder from a specific user-agent.
It’s likely that you won’t remember the names of all user-agents, so you can view the complete list of user-agents here
- The robots.txt file must be located at the root of the website
- The file must be named “robots.txt“
- Your website must only have one robots.txt file
- You can add comments to the file after a “#”
Read through the file after writing it, and test it to ensure there are no errors.
How To Create A Robots.txt File
To create a robots.txt file you just need a text editor, I will use notepad.
First, create a new text document, name it “robots.txt”, and then open the file.
In the robots.txt file that we create, we are going to block specific crawlers from specific webpages, and then place our sitemap URL at the bottom of the file.
By adding the sitemap URL, we are telling crawlers where our sitemap is located.
We are going to block the “wp-admin” page from all crawlers.
This is the page that admins will use to login to the WordPress dashboard, so we do not want crawlers to crawl this webpage.
Then we will block the “contact” page from Google and MSN crawlers.
This isn’t needed, but it is just an example to show you how to block specific pages from specific crawlers.
In your case, you only want to block pages or folders that you need to block, so don’t just copy me as this is just an example.
So, now that we have an empty file, let’s start writing.
First, I’m going to add a comment, crawlers will ignore this, so it won’t affect the file.
To add a comment, just add a “#” symbol, and then write anything you want on the same line.
You do not need to add a comment, usually, they are used in more complicated robots.txt files to explain what does what.
Next, we will add the text that will block all crawlers from the wp-admin page.
To do this we will put “User-Agent:”, then add a space followed by an “*”.
This says “we are talking to all crawlers”.
Then, on the next line we will put “Disallow:”, followed by a space, and then “/wp-admin”.
This will disallow this specific page from being crawled by all crawlers.
That is the first part done.
Now, we will disallow the Googlebot from crawling the “contact” page.
So, type “User-Agent:”, followed by a space, and then “Googlebot”.
This line is telling the Googlebot to listen to this specific part of the file, and all other crawlers will ignore it.
On the next line type “Disallow”, followed by a space, and then “/contact”.
This will disallow the contact page from being crawled by the Googlebot.
And now we will do the same thing with the MSNBot, and blog it from crawling the contact page.
Once again, type “User-Agent:” followed by a space, and this time add “MSNBot”.
On the next line type “Disallow:”, space, then “/contact”.
Now we have successfully blocked the Googlebot and the MSNBot from crawling the contact page.
Finally, we will tell crawlers where our sitemap is located.
To do this, type in “Sitemap:”, space, and then enter the URL for your sitemap file.
This will tell all crawlers where our sitemap.xml file is located.
Save your file, and ensure that it is saved as a text document, and the file is named “robots.txt”.
You have now successfully created a robots.txt file for your website.
All that’s left is to upload the file to the root of your website, with all of your other webpages.
It is important to remember that you only need to block the pages and folders from crawlers that you require to be blocked.
You shouldn’t just copy exactly what I’ve done, and you definitely shouldn’t block your contact page from crawlers as it could result in fewer people finding that page organically.
This is just an example to help you understand how to create a robots.txt file.
You should now understand how to create a simple robots.txt file that is sufficient for your needs.
Having a robots.txt file is crucial, and now that you know how to create one, your website should have one.
Create a robots.txt file for your site, upload it to your web server, and let us know if you need any help.
Did we miss anything?
Leave a comment below to let us know.