How do you control how Search Engine robots crawl your site for SEO? Included in this article is some of the important information you will need to know to control the web crawlers and robots. You can take advantage of several tools provided with robots.txt, robots meta tags, and canonical etc. to control how content is crawled and indexed on your website.
Using the robots.txt
Robots.txt is a file you place in the root of your site which follows the Robot Exclusion Protocol (REP) also known as Robot Exclusion Standard. REP is used for search engine indexing, based on a the web standards regulating the behaviour of web robots. If you are delving deep into robots.txt for SEO you might want to check out the Moz best practice robots.txt.
Robots.txt files are commonly used for Blocking resources for SEO purposes, and there are a few ways to block search engines and control them on a given domain. Here are some main uses of robots.txt:
- Block bots from accessing private directories
- Block bots crawling unimportant content
- Give bots the url of your sitemap.
- Server logs return 404 errors if robots.txt file is missing.
There are many uses of the robots.txt file but here we are mainly interested in the blocking features. There is a more detailed guide for on robots.txt for beginners on Woorank.
Block entire site with Robots.txt
Blocking the entire site with robots.txt tells all robots not to crawl any resources on the entire website.
This can be useful if you are developing your site, or making changes you don’t want to be indexed.
User-agent: * Disallow: /
Block specifically with robots.txt
You can be a lot more specific within your robots.txt file, by specifying particular robots you want to block. You can also be more specific with the directories you want to block.
User-agent: googlebot # blocks Google bots Disallow: /private-directory/ # this directory User-agent: googlebot-news # the news bot Disallow: / # block everything User-agent: * # wildcard for every robot Disallow: /admin/ # block admin directory
Allow with robots.txt
You can use the robots.txt file to allow access to specific directories and URLs even when you have blocked the parent.
Allow: /directory/file.html # bots can access Disallow: /directory/ # bots cannot access
Sitemap url in robots.txt
You can give search engine bots the URL of your sitemap. To do this enter the Sitemap directive in your robot.txt to include the location of your sitemap.
If you have a small well-structured site with a clean link structure, you may not go to the trouble of creating an XML sitemap. However, you may have a larger site. In this case, XML sitemaps are a helpful tool for getting important content indexed. SEObook’s robots.txt guide explains the XML Sitemap well.
Because search engines do check for the existence of a sitemap, so it has become common practice for SEOs to place the Sitemap directive into the robots.txt file.
Using meta tags to control web robots
We are also able to use meta tags with the robots attribute which can be applied on a page by page basis in the code of your website. This allows you to control the way HTML files are crawled and indexed, but cannot be used for other file types such as text files, PDF files, or images etc. But it can be very useful when wanting to control specific pages on your site.
If you want the page not to be indexed and the links not to be followed you can do this with the same tag.
<meta name="robots" content="noindex,nofollow">
Blocking with rel=”nofollow”
Links are used as one of the most important ranking factors and indicate the the quality of your website.
For example, when a blog is open to comments and can publish links, the quality of the links placed in the comments can have an affect on the quality of the website.
The nofollow tag tells the bot not to follow the links, as a result the bot crawl the page but understands you don’t want it to follow those links.
<a href="http://www.mysite.com/" rel="nofollow">link text</a>
Duplicate content with rel=”canonical“
You may need your website to sometimes display the same content on different pages or, multiple versions of the same content on different URLs. Search engines will need to be given the the location original content if we want link juice to be passed effectively to the original content. Your rankings will also suffer if the search engines cannot figure out which version of the content is the original.
You can add rel=”canonical” to pass all the authority and trust, the ranking power, back to the original content.
<link rel="canonical" href="http://www.mysite.com/original-page" />
We all make our websites with a desire to appear in the top organic search results, this often requires an expert understanding to of how to direct the search engine robots to check your site. It is easy to forget that websites are intended to be used by humans. The end user performs a search using a keyword phrase which is then answered by the search engine when it shows the website data that best matches the query. Robots.txt and the other meta tags using the REP can instruct the search engines to display the most relevant details from your website and block irrelevant and unwanted data or links.
About the Guest Author
Neha Bhatia is passionate about innovative marketing strategies, and loves all things to do with Digital Marketing including SEM and SEO.