OPTIMISE WORDPRESS ROBOTS.TXT FILE FOR GREAT SEO BOTS

OPTIMISE WORDPRESS ROBOTS.TXT FILE FOR GREAT SEO BOTS

Google Search Engine has undergone a remarkable change since it first began its operations. This can be made out from the fact it has come a long way from the days when it just used to fetch a Website’s HTML while ignoring your styling and JavaScript.

Today, it fetches everything and what more, renders all of your web pages completely. This means that when you deny Google access to your CSSS or JavaScript files, it does not like it at all and instead would adopt newer tactics of own to negate the same.

Thanks to this aspect, the old situation wherein a Robots.txt used to block access to your wp- including directory as well as your Plug-Ins directory, no longer stands valid. This is the reason precisely as to why in WordPress 4.0, a patch is written to remove wp-includes/.* from the default WordPress robots.txt.

Like Google SE, Robots.txt file too has come a long way from its first inception. Today, it’s a very powerful file, especially if it’s made use of while working on your Website’s SEO. By using it with due care, it will allow you to deny Search Engines access to certain files and folders. However, over the years, a better Google SE has done away with most of the old practices and changed the way as to how it should crawl the web.

Like Google SE, SEO too has undergone a lot of change and most of the old practices that came with restrictions have been done away with. Today, SEO comprises of hundreds of elements, one among which them is Robots.txt file.

Robots.txt, a small text file that stands at the root of your Website, does a lot of work and most importantly helps in optimizing your Website and making it better, faster and above all, effective.

Now coming to Wp blogs, mention must be made of the fact that WordPress Robots.txt file plays a vital role in search engine ranking in the sense that it helps to block search engine bots to index and crawl important part of your blog.

There are times when a wrongly configured Robots.txt file can thwart the presence of your site and blog on the search engines. In such a situation, you need to make changes in your Robots.Txt file, ensure that its well-optimized and does not block access to important parts of your blog.

OPTIMISING ROBOTS.TXT

Majority of the webmasters tend to avoid editing Robots.txt file. But when it becomes unavoidable, like when the Website grows, it becomes necessary to do it. In such a case, one would not like everything of the said process to be visible and accessible. It’s here certain methods needs to be employed, one of which is modifying the Robots.txt file.

The whole process can be done with ease. Anybody with basic knowledge can create and edit the Robots.txt file.

WORDPRESS ROBOTS.TXT – WHAT IT IS & FOR WHAT ITS USED?

A text file that resides on your computer’s server, Robots.Txt file is described as a tool that directly communicates with the Search Engines and also contains rules for indexing your Website thereby helping the Search Engine bots/crawlers/spiders which part of your Website to crawl on (where they can go) and which part to avoid the same (where they cannot go). In short, it can be described as an ‘Instruction Manual’ that tells the bots/crawlers/spiders of search engines like Google, Bing, Yahoo, etc, to do what they are allowed to see on your Website and what not to see on the same site.

The importance of Robots.Txt file lies in the fact that when a search engine crawls (or visits) your Website, the first thing its Search Bot or Spider does is to look for your Robots.txt file. The Robots.txt file

  • Tells whether the Search Engine can crawl on your Website or not, though does not actually block access to your Website.
  • Provides direction for indexing which page of the Website to index and which not to.
  • Indicates the location of your XML SiteMap. Following the same, the Search Engine then sends its ‘Bot’ or ‘Robot’ or ‘Spider’ to crawl your Website as directed in the Robots.txt file.

Its important to note that though Honorable and legitimate Bots will honor your directive on whether they can visit or not, there are some Rogue bots that may simply ignore Robots.Txt file.

If you’re

  • Using WordPress, you will find Robots.Txt file in the root of your WordPress installation.
  • Using a static Website, if you have not created a Robots.txt file, then you need to create one and that can be found under your Root folder.

NOTE:You cannot simply create a new Notepad file and name it as Robots.txt file. And later upload the same into Root directory of your Domain using FTP.

For better idea about Robots.txt file, just open Emblixsolutions Robots txt file, where you will be able to see the content and also its locations at the root of the domain – https://www.emblixsolutions.com/robots.txt

REASONS FOR BEING ASKED NOT TO CRAWL

Many wonder as to why to tell Google or other search engines to not crawl on your Website. Is it that its harmful from a SEO perspective?
Well, to be honest there are many reasons for the same. Three common reasons are:

  • The Website is still in its development stage.
  • The Website is just a staging version of your Website, that is, where changes are being tried out before committing them to Live Version.
  • There are some files on your server that you would not want to pop-up on the Internet for such files are meant for only your users.
IS HAVING A ROBOTS.TXT FILE NECESSARY?

The answer is ‘NO’. This is because your WordPress Website can be indexed by search engines even without a Robots.Txt file. Mention must also be made that WordPress itself already contains a virtual Robots.txt file.

That said, its still recommended that a Robots.txt file be created for it will make things much easier and better.

NOTE

  • Robots.txt file though is recognized and even respected by major Search Engines, could be disregarded completely by malicious crawlers and low-quality search crawlers.
  • Refrain from creating an incorrect Robots.txt file. Else the search engines will disregard them completely or perform wrong operations.
SEARCH ENGINE AND BOTS

Most search engines have their own Bots. Google has one that is called ‘GoogleBot’, while Microsoft’s Bing has one that is called ‘BingBot’. Likewise other search engines like Lycos, Excite, Alexa and AskJeeves, too have their own bots.

Though most Bots are usually from the search engines there are some sites that also send out Bots for various reasons. For instance, some sites could ask you to put Code on your Website to verify whether you’re the owner of that Website and later send a Bot to see if you have put the Code on your Website.

MAKING OF A ROBOTS.TXT FILE

Now we will learn as to how to create a Robots.Txt file.All you need to do is to OPEN up any kind of Text Editor.

NOTE:

  • Refrain from using WYSIWYG software (Web Page Design Software), for this tool may add extra code that you would not want.
  • Use any one of the following and keep the whole process simple:
  • NotePad.
  • Notepad ++.
  • Brackets.
  • Text Wrangler.
  • TextMate.
  • Sublime Text.
  • Vim.
  • Atom, etc.
  • Open any type of text editor, like say, NotePad or WordPad. Convert them into a Robots.txt file with one or more records, each of which will contain important information for the search engine.

For example:

User-agent: googlebot
Disallow: /cgi-bin

In case, the above lines write on Robots.txt file, then it will allow
GoogleBot to index every page of your Website.

  • Once Notepad is opened, then enter in your ‘RULES’.
  • On completing entering all the ‘RULES’, just save the file with the name ‘Robots’ and ensure that its saved with the extension of ‘Text Document’ (*.txt)”.
  • NOTE:

  • Cgi-bin folder of Root Directory does not allow for indexing, which means that GoogleBot cannot index cgi-bin folder.
  • If using ‘Disallow’ option, then you can restrict any kind of Search Bot or Spider from indexing any type of page or folder. For instance, there are many Sites that use no index in Archive folder or page for not making duplicate content.
WHAT KIND OF RULES SHOULD BE ENTERED IN ROBOTS?

Basically it depends upon what you would like to accomplish. Once clear on the same, next you need to decide what text you would want to ‘Block’ OR ‘Hide’ so as to from being crawled on your Website. Once done with the same, then only enter the ‘RULES’.

Also, note that the Folders on your Website that you would not want to be crawled and indexed in the search engine results must in such a case include the following:

  • Site-search pages.
  • Checkout/Ecommerce sections.
  • User log-in areas.
  • Sensitive data
  • Testing/Staging/Duplicate data, etc.

Once the process is compiled with, setting up your Rules becomes easy.

FINDING OUT WHETHER YOU HAVE A ROBOTS.TXT File OR NOT

If your Website already has a Robots.txt file, then it’s better to leave it as it is for this way you will not override anything that is currently there.

In case, you do not know that your Website has a Robots.txt file OR does not have one, then VISIT your Website, followed by ‘Robots.txt’. For Example: www.emblix.com/robots.txt, now replace ‘emblix’ with your own domain name.

NOTE: Your Robots.txt file’s location must always be in the ‘Root’ OR ‘Home’ level of your Website. This means that it should always be present in the same folder as your site’s Home Page or Index Page.

Now suppose you do not see anything when you visit the URL of your Website, then it means that your Website does not have a Robots.txt file. On the other hand, if you see information on the URL of your Website, then it means it does have a current Robots.txt file.

EDIT / ADD / DELETE RULES

  • If you go to ‘EDIT’ or ‘ADD’ tabs for adding any rules, ensure that you do not delete anything you currently have for that may in the end “mess up” your Website.
  • To avoid accidental deletion or change making, take a backup of your Robots.txt file. Only after that, go ahead with editing it.
RULES OF WORDPRESS ROBOTS.TXT FILE

There is a standard format for creating rules in the WordPress Robots.txt file. They include:

  • Asterisks should be used as a wildcard: *.
  • ‘Allow’ rule should be used to allow areas of your Website to be crawled on.
  • ‘Disallow’ rule should be used to disallow areas of your Website from being crawled on.

Now suppose you’re the owner of a Website called www.emblix.com, and it has a sub-folder containing duplicate information / testing material / stuff that you would like to keep totally private.

In such a case, you could set up this sub-folder as a staging or testing area (let’s call the sub-folder as ‘Staging’).

Once done, later your Robots.txt file would like as can be seen below:

  • User-agent: *
  • Disallow: /staging/

To understand the above better, let us elaborate it:

  • Start off your Robots.txt with ‘User-Agent: *.
  • User-Agent definition addresses the search engine spiders while the asterisk is used as a wildcard. This rule therefore is instructing ALL spiders from ALL search engines, that they need to follow ALL rules that are to come afterwards.
  • Previous rule would exist until another User-agent declaration is declared further in the robots.txt (if you had to use it again).
  • The very next rule is: Disallow: /staging/
  • Disallow rule tells the search engine spiders that they are not allowed to crawl anything on your Website which resides in the “Staging” Folder.
  • Finally, the location of the Website would look like – www.emblix.com/staging/.

NOTE

  • Bear in mind that just because a certain section of your Website is being disallowed from being crawled on, does not mean it may not show-up in the index of the search engines. This is because IF it was previously crawled on AND if you had allowed those pages to be indexed, it would show-up in the index of search engines.
  • In case, you would not want the above, you could do well to combine the disallow rule with ‘NoIndex’ Meta-tags added to your web pages.
  • If the web pages that you do not want to be crawled on, are already displayed within the index of a search engine, THEN you would have to remove them manually through WebMaster Tools Area of the concerned search engine (be it Google or Bing).
WHERE CAN YOU GET NAMES OF SEARCH BOT?

You can get it on your own Website’s Log.

Suppose you don’t want lots of visitors from the search engine, then you should facilitate the entry of every search bot. By this its meant that every search bot thus allowed will index your Website.

You can write User-agent: * for allowing every search bot.

Example:

  • User-agent: *.
  • Disallow: /cgi-bin.

This is the reason as to why every Search Bot indexes your Website.

RULES TO FOLLOW
  • Avoid using comments in Robots.txt file.
  • Avoid keeping space at the beginning of any line and also avoid making

ordinary space in the file.
For example:

Bad Practice:

– User-agent: *.
– Dis allow: /support.

Good Practice:

– User-agent: *.
– Disallow: /support.

Avoid changing rules of command. For Example:

Bad Practice:

– Disallow: /support.
– User-agent: *.

Good Practice:

– User-agent: *.
– Disallow: /support.

In case, you don’t want to Index more than one directory or page, then don’t write along with these names:

Bad Practice:

– User-agent: *.
– Disallow: /support /cgi-bin /images/.

Good Practice:

– User-agent: *.
– Disallow: /support.
– Disallow: /cgi-bin.
– Disallow: /images.

Always use capital and small letters properly. For Example: In case, you want No Index “Download” directory, but you write it wrongly as “Download” on Robots.txt file. This will make it confusing for the search bot.
In case, you want to index all the page and directory of your Website, then just write:

– User-agent: *.
– Disallow:

In case, you don’t want index for all page and directory of you Website, then just write:

– User-agent: *
– Disallow: /

Once you have edited Robots.txt file, then upload it vide any FTP software on Root OR Home Directory of your Website.

ROBOTS.TXT FOR WORDPRESS

You can either

  • Edit your WordPress Robots.txt file by logging onto your FTP account of the server; OR
  • Use plug-in like Robots meta to edit robots.txt file from WordPress dashboard.
  • NOTE:

  • You need to add a few things in your Robots.txt file along with your SiteMap URL.
  • Adding a SiteMap URL will help the Search Engine Bots to locate and find your SiteMap File thus facilitating faster indexing of web pages.

As an example, here is a sample Robots.txt file for any domain.
Before that in your SiteMap, replace SiteMap URL with your Blog URL SiteMap: https://www.example.com/sitemap.xml

  • User-agent: *
  • # disallow all files in these directories
  • Disallow: /cgi-bin/
  • Disallow: /wp-admin/
  • Disallow: /archives/
  • disallow: /*?*
  • Disallow: *?replytocom
  • Disallow: /comments/feed/
  • User-agent: Mediapartners-Google*
  • Allow: /
  • User-agent: Googlebot-Image
  • Allow: /wp-content/uploads/
  • User-agent: Adsbot-Google
  • Allow: /
  • User-agent: Googlebot-Mobile
  • Allow: /
HOW TO FIND OUT WHETHER YOUR SITE’S CONTENT HAS BEEN AFFECTED BY YOUR NEW ROBOTS.TXT FILE?

Once you have made certain changes in your Robots.txt file, next step is to check, if any of your content is impacted by the updation of Robots.txt File.
To know the same, use Google WebMaster tool and later fetch as Bot tool’ to see if your content can be accessed by Robots.txt file or not. Towards the same, you need to abide by the following guidelines:

  • Login into Google Webmaster Tool.
  • Go to Diagnostic & Fetch as Google bot.
  • Add your te posts and check if there is any issue for accessing your post.

In addition to the above, you can also check for the crawl errors caused due to Robots.txt file under Crawl error section of GWT. For that all you need to do is:

  • Login into Diagnostic >Crawl error.
  • Select Restricted by Robots.txt.
  • Once done, you will see what all links have been denied by Robots.txt file.
  • Below image shows a Robots.txt crawl Error:

    In the above image, you can also clearly see that ‘Reply to com links’ have been rejected by Robots.txt and as such any other links cannot be a part of Google.

    FYI, Robots.txt file is an essential element of SEO, and you can avoid many post duplication issues by updating your Robots.txt file.

Previous BUILDING & USING SITEMAP FOR WEBSITE
Next GROWING IMPORTANCE OF GOOGLE’S DATA HIGHLIGHTER

About author

You might also like

Technology 0 Comments

WEBSITE LOADING TIME OPTIMISATION & ALL THAT

Page load times have long played a critical role in search engine rankings. What once was commonly aired by corporate employees was acknowledged on the Google WebMaster Blog in April,

Digital Marketing 0 Comments

LOCAL LISTING HELPS YOUR WEBSITE TO IMPROVE YOUR BUSINESS

Phenomenal advances in internet technology have enabled the people the luxury of exploring local services and products online vide computers as well as mobile phones. A research study has revealed

Travel 0 Comments

ON-PAGE SEO FOR WEBSITE AND ITS OVERALL IMPACT

Looking to make your Website Keyword targeted & drive more traffic?  Want to make it easier for Search Engine to understand what Keywords your Website should rank for? In case

0 Comments

No Comments Yet!

You can be first to comment this post!

Leave a Reply