Linkshare Affiliate Network
Click here to watch The Conversion Blogging Video

Creating the ultimate WordPress robots.txt file

March 26th, 2007 · 78 Comments

One of the things I’d been meaning to do for some time was to get round to creating a robots.txt file for this site. I’d looked at various sources of information including A Standard For Robot Exclusion and How to create a robots.txt file but was still unsure as to which bits should be excluded so, in my usual fashion, I didn’t bother doing anything at all!

However the subject has come up again a couple of times over the last few days at both Wolf-Howl and Connected Internet and pricked my conscience so I thought it was time to revisit it.

One of the main reasons for creating a robots.txt file is to prevent the search engines from reaching your content from more than one location (i.e. in your monthly archives, your category folders, your xml feed and on your front page) because this could lead to duplicate content issues. Other reasons are that you may have a private directory which you don’t want the world to read but that’s something for Big G to explain. Today we’re only looking at the SEO reasons for creating the file so with that in mind, what should be included?

There seems to be a number of different view points on what should and shouldn’t be included in the robots.txt file. Some say that you should include the wp-content, wp-admin, wp-includes and feed information whereas others say that they’re fine to index but don’t let the Googlebot anywhere near your archives. Another train of thought is to disallow access to your images folder whilst others warn you that your AdSense could go belly up by making sweeping changes to your robots file. In fact, no matter where you look, people are giving conflicting advice on the right way to create a robots.txt file for WordPress. No wonder I hadn’t done anything about it until now!

If you were creating a regular web site you’d include robot meta tags to prevent Google from indexing certain pages. However you don’t have this option within WordPress because all of the meta data is contained within only one file (header.php) which appears on every single WP page so any “noindex,nofollow” rules would be applied across the whole site and you wouldn’t get indexed at all.

Taking it back to basics, the reason for creating the darned thing in the first place is to prevent Google potentially ditching your content into the supplemental index so what should you include in the file to prevent that from happening?

By including the bare minimum in your robots.txt file.

Why do I say that? Well because unless you know your WordPress install inside out you could end up shooting yourself in the foot. Every theme, plugin and tweak you’ve made to your site affects how your site is structured. If you change any of these parameters and don’t change your robots.txt file, you could end up seriously screwing yourself over.

A more sensible way of preventing duplicate content is by being more concise in the way that you structure your site. Michael Gray comes up with some excellent advice in his video blog about making WordPress search engine friendly and that is to only apply one category per post. I’d not even thought about that before but he’s right. Previously I was applying categories all over the shop but what that meant was that Google could find the same content in half a dozen different places so now I’m only using one category per post and will tidy up my archives shortly.

I’ve decided to implement a very basic version of the robots.txt file for the time being and will review the results in a few weeks time. I decided to keep the images folder accessible to Google because I do get some traffic through image search and whilst it’s not uber sticky, it’s still traffic at the end of the day. Equally I was specific about disallowing Googlebot to index /page/ but still allow the AdSense bot to look through archived pages.

I’m still confused as to whether to disallow access to the categories or dated archives to prevent possible issues. Looking at my Google data, I can’t work out which bit Big G doesn’t like so I’m going to implement the basic version of the robots.txt file first and then see what happens from there.

I may well have got it horribly wrong – as I’ve said before I’m no expert – but it’s a start and a key part of Twenty Steps is reporting my mistakes so you don’t have to make them! I’m pretty sure we’ll be revisiting this ol’ chestnut again ;)

If you’ve not had enough already reading about robots and meta tags, here’s some more reading for you :D

Meta Robots Tag 101: Blocking Spiders, Cached Pages & More
Google Webmaster Central – All about Googlebot
Inside Google sitemaps – Analyzing a robots.txt file
Official Google Blog – Robots Exclusion Protocol

UPDATE: I’ve tinkered around with the robots.txt file and you can read about the changes made by following this link

Tech Tags: , , , ,


Tags: Search Marketing

78 responses so far ↓

  • Eric Giguere // Mar 26, 2007 at 6:19 pm

    Instead of guessing, use Google’s own robots.txt validator to make sure that the Googlebot and the AdSense crawler are both seeing the “right” parts of your site. It takes a lot of the guesswork out by letting you play with the robots.txt until it does what you want it to do. See the last part of my series about AdSense and robots.txt for the details.

  • Mike // Mar 26, 2007 at 7:42 pm

    Hi Eric,

    Thanks for stopping by. I enjoyed reading through your series earlier today.

    I picked up the tip on the validator and checked through a few of my pages earlier but it didn’t seem to function in the way I expected so I was going to wait until Google had picked up the new robots.txt file.

    I think, with hindsight, the best option when I started here was to run with the SEO friendly permalink structure suggested by Jim Westergren (thanks for the tip, Eric) as that would enable me to then disallow the indexing of date related archive posts as well as the categories.

    Of course I could still do that and redirect the old pages to the new ones but that just opens up another can of worms that I’ve left on the shelf…

    Mike

  • Eric Giguere // Mar 27, 2007 at 3:48 am

    You don’t have to wait for the validator to pick up your robots.txt file, you can just paste the contents of your updated file into the text box they give you and then run your test URLs through it…. it’s pretty easy, trust me!

  • Mike // Mar 27, 2007 at 3:24 pm

    The point I was trying to make was that it didn’t reflect what I was expecting. Big G are now displaying the actual robots.txt file and yet still crawling the specific directories I’d asked them not to (i.e. wp-admin).

    :neutral:

  • Eric Giguere // Mar 28, 2007 at 3:51 pm

    Ah, well I don’t know how often they look at the robots.txt file, so it could just be that they haven’t updated their internal status. As long as you’ve used the validator to make sure that Googlebot’s excluded from those directories, changes will eventually take effect.

  • askApache // Mar 31, 2007 at 8:28 am

    This is the best attempt at figuring this whole thing out that I’ve come across.

    I checked out your many links and sources and ended up completely redoing my robots.txt optimized for wordpress, including using meta tags.

  • Mike // Apr 1, 2007 at 6:17 pm

    Thanks for stopping by and leaving a comment, askApache and for the tip on adding the meta tags in the header file :)

  • Mike // May 22, 2007 at 4:53 pm

    Some more robots.txt reading can be had over at 10e20 where they look at the specifics of stopping different agents.

  • Simon // May 27, 2007 at 1:13 am

    Thanks for the info. We’re looking to move our site to WordPress so this is great info.

  • Business // Jun 2, 2007 at 5:31 pm

    Thanks .. I have been looking for the right one that is simple enough and will do the right job!

  • PocketSEO // Jun 9, 2007 at 12:25 am

    I wouldn’t block the main feed from Googlebot. If you add a unique custom excerpt to each post when you write them it will keep the content in the main feed 100% different from the home page of your blog. The custom excerpt should also put different content on the category pages.

    Also, other search engines are looking for your main RSS feed, so if blocking the main feed, I recommend only blocking Googlebot.

  • Mike // Jun 9, 2007 at 12:28 pm

    I’m interested to know your thought process behind not blocking the feed. In my mind it’s duplication and doesn’t really provide a benefit to any of the search engines so can’t see a reason to include it.

  • PocketSEO // Jun 13, 2007 at 5:48 pm

    If you make a custom excerpt for each post there is no front page duplicate content. The main feed will contain the content of the custom excerpts.

    Some bots rely on the RSS feeds, so I would at least give them the main feed. (for example, Technorati: http://technorati.com/help/tags.html )

    Google loves RSS feeds — the only issue is the duplicate content problem. The custom excerpt prevents duplicate content.

  • Mike // Jun 15, 2007 at 12:04 am

    OK well my robots.txt file disallows all agents to read the /feed/ directory. I use Feedburner for my RSS so, in effect, the URL for my feed is actually one that belongs to Feedburner.

    What Google get is the ping for the main content to their feed aggregator. They’ll pick up the content from their regular indexing anyway so by disallowing the feed, I’m still getting the Big G love but without the potential duplicate content because the feed is actually on a Feedburner URL.

    Well, at least that’s how it seems to me at 1am ;)

  • David Airey :: Creative Design :: // Jun 29, 2007 at 1:22 pm

    A point you mention, about shooting yourself in the foot, is exactly the reason why I’ve been reluctant to tamper with my robots.txt file.

    Now perhaps I am missing out, but what’s more of a priority is forming my site to look how I want (I’m continuously tweaking) and to hold content that reflects how I want others to see me.

    This is an interesting post though and I’m glad of the discussion that’s followed.

  • Mike // Jun 29, 2007 at 1:34 pm

    I think the main point to consider, David, is whether you have a problem with supplemental results before tinkering with your robots.txt file. If Big G are penalising you because of perceived duplicate content then it’s worth doing.

    Without a doubt tweaking the site as you mention is an important part of the process. However so is getting the visitors to your site via the search engines in the first place.

    It’s great that the post has provoked a healthy discussion. That’s all part of the fun of blogging :D

  • Chris // Jul 4, 2007 at 7:36 pm

    Wow, you do have a slim robots file.

    Mine’s a tad fatter, and it’s based on the one from notsoboringlife.com

    I’ve kept an eye on what it’s done for my blog, and it’s moved about 400 pages from the supplemental to the main index, with no apparent ill-effects, in about a month.

    Why have you disallowed the comments in yours?

  • Mike // Jul 4, 2007 at 9:14 pm

    Crap in a hat! That’s a meaty ol’ robots.txt file you have there!
    I’m not disallowing comments at all. As you know, I subscribe to the DoFollow philosophy that all comments should receive some link love. What I’ve done, though, is prevented Big G from indexing both the main content of the page as well as the individual URLs that WordPress generates for comments due to duplicate content issues.

    In other words I’m preventing Big G from indexing both http://www.mydomain.com/most-excellent-post/ as well as http://www.mydomain.com/most-excellent-post/comment543.

    Whilst on the surface the second URL is only looking at the comment, the Googlebot is also reading everything else on the page so could flag it as dupe.

    I’m waiting a couple more weeks to re-evaluate my current robots.txt as well as a number of other factors. I recently changed my URL structure so I want to see an accurate G reflection of my site before I make any further changes.

  • Chris // Jul 5, 2007 at 8:35 am

    “That’s a meaty ol’ robots.txt file you have there!”

    Why thank you sir [blush] ;-)

    I see, when I saw ‘Disallow: /comments/’ in your Robots, I presumed it was preventing the individual comments from being indexed.

    I wonder if there is a definitive answer for this issue? Like you say, time will tell.

  • Ravi // Jul 23, 2007 at 3:46 am

    Thanks for this, I needed to figure out how to get Google stop indexing my duplicate content.

  • Mike // Jul 23, 2007 at 2:17 pm

    Hope it works out for you, Ravi :)

  • Earn Health And Money Online // Sep 30, 2007 at 5:33 pm

    Dunno the SEO aspect in putting up twice “Disallow: /page/” in your robots.txt

  • Mike // Oct 1, 2007 at 1:08 pm

    Oops! Not noticed it was in there twice :oops: Thanks for spotting it!

  • Earn Health And Money Online // Oct 1, 2007 at 2:34 pm

    Similarly, I have a doubt about the command:
    Disallow: /wp-

    I suspect it will prevent the contents from getting spidered. I’m just recovering from such a problem. Only titles are showing up while the contents are not. Anyway I’m not sure of it, still doing trial and error about it.

    Disallow: /date/ and Disallow: /2007/ are almost same command, I prefer latter, doing it yearwise. This prevents datewise, monthwise, yearwise, all in one.

    It’s better to have a separate set of instructions for googlebot to have a better control. Google allows wild characters while the rest are not.

  • Mike // Oct 2, 2007 at 5:28 pm

    The reason behind disallowing /wp- is to prevent some of the back end pages of the install being indexed by the search engines. None of my contents that I want to be searchable sit in that directory. All images are in a separate directory rather than in the wp-content directory.

    Disallowing all date based directories just seemed to make sense to me. Sure it’s an extra line in the code but I wanted to ensure that the only way Big G could find the content was via the front page or via the category extracts.

  • ComparetheLoan // Oct 15, 2007 at 2:02 pm

    Thanks for the info. We’re looking to move our site to WordPress so this is great info.

  • Mike // Oct 16, 2007 at 10:51 am

    I’ve removed the URL from your comment. Please have a read of the comment policy regarding acceptable user names. Thanks.

  • bLuefRogX // Nov 18, 2007 at 2:23 pm

    I’ll give this a shot later, looks promising :)

  • Indigo Clothing // Dec 2, 2007 at 11:02 am

    Thanks for this info. I was looking this up as am now concerned that since WP 2.3 and tags there is an even high risk of dupe content so I have also added the /tag/ dir to our robots txt as well as some of the items in your robots.txt.

  • Clement // Dec 29, 2007 at 2:08 pm

    Very informative article.Stumbled and reviewed

  • Mike // Dec 29, 2007 at 7:19 pm

    Thanks for the Stumble, Clement :)

  • Bollywood // Feb 9, 2008 at 11:13 pm

    Nice information. So far, I have seen more than 10 different robots.txt files on various sites. Each explain why they included this and excluded the other in a very convincing way. Now I have to create my own file based on the most convincing information on these sites. Of course, after making some deletions and additions and comparing results, would hopefully give me the best robots.txt file. Thank you for sharing your views.

  • Mike // Feb 12, 2008 at 5:17 pm

    As always, the thing here is to test, test, test. Try something out, have a look at how the changes have affected your rankings and spread of link juice.

    Hope it works out well for you.

  • Ibnu Asad // Feb 16, 2008 at 12:43 pm

    Personally, I think that Google is smart enough to filter those ‘duplicate content’ pages on your site but without a robots.txt…well, because there A LOT of WordPress based sites and I’m sure Google already figured it out ;)

    But on the other hand, for seo purposes…it is advisable to exclude archive pages being indexed and cached.

    FYI, I use the All-In-One SEO plugin :)

  • Mike // Feb 17, 2008 at 10:48 am

    Hi Ibnu. Thanks for stopping by and leaving some comments.

    I’m not so sure Big G *is* smart enough to work it out, y’know. Matt ‘The Enforcer‘ Cutts has suggested on his site that robots.txting out archives and any other potential duplication is a sensible idea for WP sites. I’m pretty sure I read something by Matt Mullenweg, the head honcho at WordPress, on the subject too.

    I guess from my perspective, there’s no harm in putting in place a solid robots.txt even if it isn’t absolutely essential.

    How are you getting on with the All in One SEO Pack? Same kind of results as I’ve had?

    How I Doubled My Search Engine Traffic In 5 Minutes

  • Gr.Zhang // Mar 16, 2008 at 3:52 pm

    WONDERFUL!THANKS!

  • Moncef // May 2, 2008 at 8:03 am

    Hi Mike, and everyone else who is wondering about disallowing comments:

    This line in your robots.txt has absolutely no effect on your site:

    Disallow: /comments/

    What this line is telling the user-agent to do is to not crawl this specifc page (or URL): http://www.twentysteps.com/comments/

    That URL does not even exist in your site, so whether or not you disallow it is not going to make any difference (unless you do actually create such a page in the future).

    Before anyone uploads a robots.txt file to their server, they should analyze it via Google Webmaster Tools. Their “Analyze robots.txt” tool allows you to test it without actually having a live robots.txt on your site.

    Mike, if you had tested your site, you would have noticed that even though you thought you were blocking comments, this URL would still be allowed:

    http://www.twentysteps.com/creating-the-ultimate-wordpress-robotstxt-file/#comments

    Also, if you pay attention to your comments URLs, you will notice that an URL such as:
    http://www.mydomain.com/most-excellent-post/comment543

    does not exist on a WordPress blog (at least not in the default installation). All comments links use the fragment identifier (#) and bots stop crawling at the “#”. If you view the source of this page, or if you hover your mouse over the dates of each comment, you will see that all comments have a unique id and an href of “#comment-id”. For example, to jump directly to Mike’s first comment, you would use this URL:
    http://www.twentysteps.com/creating-the-ultimate-wordpress-robotstxt-file/#comment-2481

    but the robot stops crawling when it reaches the “#”, so you don’t have to worry about duplicate content for comments links.

  • Mike // May 7, 2008 at 11:16 am

    I must admit that I’d forgotten that the /comments/ directory was even in there.

    Some very valid points regarding testing using Webmaster Tools which I think I’ve mentioned in subsequent articles. Perhaps it’s time to revisit the whole article and put everything together in one place.

    Thanks for the feedback. Much appreciated.

  • John Pash // Jun 29, 2008 at 2:05 pm

    I’ve recently put a line in my robots.txt to show spiders where my sitemap and feed are located. Reading these comments, I’m beginning to wonder if I should maybe remove them.

  • Mike // Jun 30, 2008 at 2:07 pm

    Hi John. I’d say leave in the sitemap reference but drop the rss feeds. Big G will find and index your feeds anyway via your onpage links.

  • Aaron // Jul 9, 2008 at 8:36 pm

    Mike, it’s been over a year since you wrote this post. How is the “ultimate” robots file working for you? Great post, I agree about the need to test what works for you. This post is definitely worth a stumble.

  • Mike // Jul 11, 2008 at 12:32 am

    Blimey. Has it been a year already? Time flies, eh?

    Well in answer to your question I noticed an upturn in search traffic when I initially implemented the changes. What muddies the water a little bit is that I then made some further tweaks to the onpage SEO factors and that might have made a change but fundamentally I think the robots.txt I created is probably still a good place to start.

    Thanks for dropping by and I see you’ve just moved to DoFollow on your blog. Will stop by and have a proper look round when I get a chance.

  • Andy - Creative Caravan // Jul 23, 2008 at 8:18 am

    Thanks for an excellent post. I have been having a nightmare trying to get my wordpress blog and robots file optimised correctly. There’s a lot of conflicting information out there which has been confusing me and the fact that Google’s search results are bouncing up and down at the moment certainly doesn’t help matters.

    Be sure I’m going to be going back to basics, starting again and reviewing things so I’ll let you know how I get on.

  • Mike // Jul 23, 2008 at 12:14 pm

    Using a simple robots.txt to stop duplicate content (i.e. date archives, category archives, author archives) is probably as much as most folks will ever need.

    Thanks for dropping by, Andy. Look forward to hearing how you get on.

  • Chris Spires // Jul 31, 2008 at 2:58 am

    Ok, that clears things up. Robots.txt files have always perplexed me. I’m sure this will help my internet business wordpress blog.

  • anomtejo // Aug 21, 2008 at 3:51 am

    nice info! i will implement it to my wordpress.

    thanks a lot! :)

  • SEOGranted // Aug 21, 2008 at 6:55 pm

    Ok, this clears things up. I found the plugins I had installed for my SEO blog weren’t that efficient. Thanks Twenty Steps, keep posting the same great material.

  • Webagentur // Aug 26, 2008 at 8:03 am

    Thank you, this tutorial has me very helped.

  • UKJim // Aug 27, 2008 at 2:23 pm

    Mocef’s comment about the /comments/ entry is generally correct, but I have a feeling Mike, that you included it because of the RSS Feed for Comments.

    If you include the META portion in the sidebar on later WordPress blogs, it will show two links labelled “Entries RSS” and “Comments RSS” with links like…
    http://www.mydomain.com/feed/
    http://www.mydomain.com/comments/feed/

    So adding an entry for /comments/ or more specifically /comments/feed/ can be used to allow or disallow the Comments RSS feed.

  • UKJim // Aug 27, 2008 at 2:41 pm

    One thing I have been trying to research is what to do in the situation of WordPress blog on a Windows 2003 Server running under IIS6, which does NOT support mod-rewrite.

    Mod-rewrite add-ons can be obtained, but the simplest fix is to amend php.ini, as follows…
    http://codex.wordpress.org/Permalinks#Permalinks_without_mod_rewrite

    So that PATHINFO “Almost pretty” WordPress Permalinks are the used instead, but they have to have the text “/index.php/” in the permalink structure.

    Nobody seems to have discussed this anywhere with regard to robots.txt.

    Typically this results in permalinks like:-
    http://www.mydomain.com/wordpress/index.php/my-blog-post-title/

    So presumably we need modify the robots.txt to include the index.php portion?

    e.g.
    Disallow: /wordpress/index.php/author/
    Disallow: /wordpress/index.php/tags/
    etc.

    Anyone got any experience of this?

  • Mike // Aug 28, 2008 at 8:53 am

    Jim – You’re probably right on the /comments/ part. To be honest it’s been so long since I put this together that I’ve forgotten what’s in there and why!

    I can’t help you with your other query, I’m afraid. I’ve not used IIS for over 8 years now. I started to write some suggestions and then realised all of them related to Linux. Sorry.

  • Robots.txt File Link Analysis // Oct 27, 2008 at 10:26 am

    Indeed, robots.txt file is very important in terms of SEO. You can find this file under the root directory of the domain : http://domain.com/robots.txt. As you mention the robots meta tag, it is also very important for link analysis. However, I don’t see much people understand this fact. You may wish to see link analysis with Robots Meta tag

  • arshad // Feb 3, 2009 at 12:14 pm

    this is a definitely a nice way to avoid duplicate content issues.thanks for sharing .

  • Josh Galvan // Feb 15, 2009 at 3:42 pm

    I am updating the robots.txt on my company site as well as our client sites and looking around to see how others are doing this.

    If you use wordpress conditional statements in your header to correctly nofollow and follow the appropriate posts, pages, categories, tags, etc., then I don’t see where all the confusion on duplicate content is.

    The robots.txt file should be used for the admin oriented files (wp-content, wp-admin, wp-includes, etc.) only. And use the meta no index, follow tags in the header for the rest.

    This prevents the huge issue of dangling pages and diluting pagerank.

  • Mike // Feb 16, 2009 at 1:42 pm

    Thanks for stopping by and leaving your comment, Josh.

    What you’re suggesting would certainly work but I think the solution would bring a lot of people out in a cold sweat. I’m sure I have seen plugins which allow conditional statements and, in truth, you can do it yourself on your specific php files. However I wonder how many regular WP users would go down that route?

    IMHO editing the robots.txt file is a quicker way of doing things. You can see the potential results right in front of you and can also check very quickly via Webmaster Tools if a change you’ve made could mess things up.

  • Psicologo // Mar 8, 2009 at 6:35 pm

    Thanks for this, I needed to figure out how to get Google stop indexing my duplicate content.

  • ender saraç // Apr 4, 2009 at 6:10 am

    Keep up the excellent work! Your website helps to keep me from boredom as well.

  • DSMBlogger // Apr 8, 2009 at 1:40 pm

    Thank you Mike for this post. I’ve started a blogsite a few weeks ago and finally made it ready for publishing, however I was having problems getting it indexed right way by Google therefore I start digging into the robot texts issue. Your post help me understand this process better.

    Regards
    dsmb

Trackbacks

  1. fiLi's tech
  2. WordPress robots.txt optimized for SEO
  3. Revisiting robots.txt | Twenty Steps
  4. Robots and WordPress | ChillyCool Web Digger
  5. June Is Busting Out All Over
  6. How To Validate Your Robots.txt File · Make Money Online With CMB
  7. Robots.txt - When Pros Get It Wrong
  8. Does DoFollow Increase Your Comments?
  9. WordPress SEO Techniques to Avoid Duplicate Content | Better Blogging with Michael Martine
  10. Keep A Test Site Of Your WordPress Blog
  11. WordPress robots.txt file optimized for SEO and Google
  12. Updated WordPress robots.txt example optimized for Google and SEO
  13. Google and the Marching Robots : JoeyPrimiani.com
  14. Sécuriser Son Blog WordPress [Etape 1] - Robots.txt | FabNet Revenue
  15. Add Robots.txt to get traffic increase | Sha Money Maker
  16. LTNS
  17. WordPress robots.txt file optimized for SEO and Google | Mustilife Payla??m Blogu
  18. Rebecca’s Blog » Blog Archive » Google SEO for Your Blog
  19. 29 easy ways to fine tune your blog » malcolm coles
  20. The Obligatory Round Up Post | Twenty Steps