One of the things I’d been meaning to do for some time was to get round to creating a robots.txt file for this site. I’d looked at various sources of information including A Standard For Robot Exclusion and How to create a robots.txt file but was still unsure as to which bits should be excluded so, in my usual fashion, I didn’t bother doing anything at all!
One of the main reasons for creating a robots.txt file is to prevent the search engines from reaching your content from more than one location (i.e. in your monthly archives, your category folders, your xml feed and on your front page) because this could lead to duplicate content issues. Other reasons are that you may have a private directory which you don’t want the world to read but that’s something for Big G to explain. Today we’re only looking at the SEO reasons for creating the file so with that in mind, what should be included?
There seems to be a number of different view points on what should and shouldn’t be included in the robots.txt file. Some say that you should include the wp-content, wp-admin, wp-includes and feed information whereas others say that they’re fine to index but don’t let the Googlebot anywhere near your archives. Another train of thought is to disallow access to your images folder whilst others warn you that your AdSense could go belly up by making sweeping changes to your robots file. In fact, no matter where you look, people are giving conflicting advice on the right way to create a robots.txt file for WordPress. No wonder I hadn’t done anything about it until now!
If you were creating a regular web site you’d include robot meta tags to prevent Google from indexing certain pages. However you don’t have this option within WordPress because all of the meta data is contained within only one file (header.php) which appears on every single WP page so any “noindex,nofollow” rules would be applied across the whole site and you wouldn’t get indexed at all.
Taking it back to basics, the reason for creating the darned thing in the first place is to prevent Google potentially ditching your content into the supplemental index so what should you include in the file to prevent that from happening?
By including the bare minimum in your robots.txt file.
Why do I say that? Well because unless you know your WordPress install inside out you could end up shooting yourself in the foot. Every theme, plugin and tweak you’ve made to your site affects how your site is structured. If you change any of these parameters and don’t change your robots.txt file, you could end up seriously screwing yourself over.
A more sensible way of preventing duplicate content is by being more concise in the way that you structure your site. Michael Gray comes up with some excellent advice in his video blog about making WordPress search engine friendly and that is to only apply one category per post. I’d not even thought about that before but he’s right. Previously I was applying categories all over the shop but what that meant was that Google could find the same content in half a dozen different places so now I’m only using one category per post and will tidy up my archives shortly.
I’ve decided to implement a very basic version of the robots.txt file for the time being and will review the results in a few weeks time. I decided to keep the images folder accessible to Google because I do get some traffic through image search and whilst it’s not uber sticky, it’s still traffic at the end of the day. Equally I was specific about disallowing Googlebot to index /page/ but still allow the AdSense bot to look through archived pages.
I’m still confused as to whether to disallow access to the categories or dated archives to prevent possible issues. Looking at my Google data, I can’t work out which bit Big G doesn’t like so I’m going to implement the basic version of the robots.txt file first and then see what happens from there.
I may well have got it horribly wrong - as I’ve said before I’m no expert - but it’s a start and a key part of Twenty Steps is reporting my mistakes so you don’t have to make them! I’m pretty sure we’ll be revisiting this ol’ chestnut again
If you’ve not had enough already reading about robots and meta tags, here’s some more reading for you
Meta Robots Tag 101: Blocking Spiders, Cached Pages & More
Google Webmaster Central - All about Googlebot
Inside Google sitemaps - Analyzing a robots.txt file
Official Google Blog - Robots Exclusion Protocol
UPDATE: I’ve tinkered around with the robots.txt file and you can read about the changes made by following this link