Mar 3 2008
Jeremy Schoemaker

Wordpress robots.txt tips against duplicate content

By Jeremy Schoemaker 88 comments

Been getting some questions about my robots.txt file and what certain things do.

Thankfully some regular expressions are supported in the robots.txt (but not many).

$ in regex means the end of the file. So if you do .php$ it your robots.txt that means it will match anything that ends in .php

This is really handy when you want to block all .exe .php or other files. For example:

Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$

Specifically this is some of the things I use in my robots.txt

Disallow: /*? - this blocks all urls with a ? in them. A good way to avoid duplicate content issues with wordpress blogs. Obviously you only want to use this if you have changed your url structure to not be 100% ?=.

Disallow: /*.php$ - This blocks all .php files. Another good way to avoid duplicate content with a wordpress blog.

Disallow: /*.inc$ - you should not be showing .inc or include files to bots (google code search will eat you alive)

Disallow: /*.css$ - why would you show css files for indexing seems silly.. The wildcard is used here in case there are many css files.

Disallow: */feed/ feeds being indexed dilute your site equity. The wildcard * is used incase there is preceding chars.

Disallow: */trackback/ - no reason a trackback url should be indexed. The wildcard * is used incase there is preceding chars.

Disallow: /page/ - assloads of duplicate content in pages for wordpress.

Disallow: /tag/ - more douplicate content.

Disallow: /category/ - even more duplicate content.

SO what if you want to ALLOW a page. Like for instance my serps tool is serps.php and from the above rules that would not fly.

Allow: /serps.php - this does the trick!

Keep in mind I am not a SEO but I have picked up a few tricks along the way.

  1. bob said on March 3rd, 2008 at 5:14 am

    I never mess around with this stuff, but does duplicate content reduce how well your site ranks overall?

    [Reply]

  2. Keith Cash said on March 3rd, 2008 at 5:20 am

    Some really good info, if you are not an SEO you are pretty darn close

    [Reply]

  3. bob c said on March 3rd, 2008 at 5:23 am

    I’m about to implement this:(how does it look?)
    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /trackback
    Disallow: /comments
    Disallow: /category/*/*
    Disallow: */trackback
    Disallow: */comments
    Disallow: /*?*
    Disallow: /*?
    Allow: /wp-content/uploads

    [Reply]

  4. ShoeMoney said on March 3rd, 2008 at 5:30 am

    bob not sure why you need the extra /*/* after category. just /category/ should get that and all sub directories of category.

    [Reply]

  5. Ian said on March 3rd, 2008 at 5:44 am

    Thanks for the tips shoe. A lot of people don’t realize how much duplicate content on your site can really hurt you.

    [Reply]

  6. bob c said on March 3rd, 2008 at 6:01 am

    Thanks, I copied that from another blog so I’ll fix that.

    [Reply]

  7. RacerX said on March 3rd, 2008 at 6:15 am

    Big Help! Thanks. This should help a bunch.

    [Reply]

  8. RacerX said on March 3rd, 2008 at 6:16 am

    I am not an SEO, but I play one on the internet…

    If Shoe isn’t an expert…he is the closest thing that will talk to us!

    [Reply]

  9. Arejay said on March 3rd, 2008 at 6:30 am

    Very Nice post! We all know how many site’s leave this simple step out (like the ebook sales people, who u do a simple site: and u find the members download area). You put it out plain and simple!!! Don’t you find it funny how people who are non seo people like you, make more money then the seo people. LOL. Have a fantastic week Shoe and everyone else! Make that $$$$$

    [Reply]

  10. brad said on March 3rd, 2008 at 6:38 am

    thx for the great tips for the robots.txt and wordpress blogs

    [Reply]

  11. Michelle said on March 3rd, 2008 at 6:49 am

    Thanks for the excellent tips Shoe. One of my blogs had been performing amazingly until Google decided to hate it last week. These tips are just what I need to try and work out if it’s a duplicate content issue..

    [Reply]

  12. TheMadHat said on March 3rd, 2008 at 7:08 am

    I disagree with this assessment on some level. Sure, you don’t want duplicate content and it will negatively impact your site, but using the robots.txt file to fix the problem wouldn’t be my way to go.

    The robots file tells Google not to even crawl the page. A better scenario would be to use the meta noindex and follow. This tells Google not to index the page, but it can and will still accumulate link juice to pass it on (unless this page is a dead end, then it’s pointless).

    See this interview with Matt from a few months ago for a little more in-depth conversation.

    [Reply]

  13. Solo Programmer said on March 3rd, 2008 at 7:15 am

    I have the all-in-one seo pack which applies noindex, nofollow meta tags on the actual archive/category/tag pages. I wonder if this is still worth doing but I guess it can’t hurt.

    [Reply]

  14. Hustle Strategy said on March 3rd, 2008 at 8:12 am

    it can.

    [Reply]

  15. Mayank Rocks said on March 3rd, 2008 at 8:35 am

    Thanks a lot for the tips Jeremy

    [Reply]

  16. Mayank Rocks said on March 3rd, 2008 at 8:37 am

    I agree there totally with the above person.

    [Reply]

  17. Paid Surveys Reviewed said on March 3rd, 2008 at 8:39 am

    Thanks for that, really need to get to grips with this robots stuff, I am sure it helps with SEO although don’t quite understand how. :-)

    [Reply]

  18. Money Blog said on March 3rd, 2008 at 9:22 am

    thanks, very helpful

    [Reply]

  19. Exposed SEO said on March 3rd, 2008 at 9:40 am

    lol at all the spammy comments. “I totally agree with everyone” lol

    [Reply]

  20. eMarketing Chat said on March 3rd, 2008 at 9:50 am

    This is very helpful! Thanks for sharing.

    [Reply]

  21. Guy said on March 3rd, 2008 at 9:53 am

    Disallow /category/ is a good one to add. Just make *extra* sure your Permalink structure isn’t set up to include “category” == otherwise nothing will be indexed.

    To help reduce DC, I also recommend blocking the archives (just add a new line for each year your blog has been online)

    # Block Duplicate Content from Archives
    Disallow: /2006/
    Disallow: /2007/
    Disallow: /2008/

    I also have this
    Disallow: /*?*

    instead of this;
    Disallow: /*?

    [Reply]

  22. TheOfficeCubicle said on March 3rd, 2008 at 10:02 am

    Thanks Shoe! I appreciate all you have done.

    :)

    [Reply]

  23. Guy said on March 3rd, 2008 at 10:05 am

    Blocking /category/ is a good one. Just need to be careful that your Permalink structure isn’t setup to include “category” — otherwise nothing will get indexed.

    I also use the following to block the archives. Just add a new line for each year your blog has been online.

    # Block Duplicate Content From Archives
    Disallow: /2006/
    Disallow: /2007/
    Disallow: /2008/

    One more is that I use;
    Disallow: /*?*

    instead of;
    Disallow: /*?

    [Reply]

  24. Homefinding Book said on March 3rd, 2008 at 10:53 am

    Great tutorial - more of this please! No matter what you say, its pretty good SEO stuff.

    [Reply]

  25. Paul said on March 3rd, 2008 at 11:02 am

    Thank you for the tips.

    [Reply]

  26. Terry Tay said on March 3rd, 2008 at 11:09 am

    Excellent post Jeremy! Every single day I’m learning something new from you it seems. Just the other day with the link rel= and now today with the robot.txt file.

    I’ve just read the basics about the robot.txt file and never really thought much more into it. It’s good we have people like you helping us out along the way.

    Thanks!
    ~Terry

    [Reply]

  27. jtGraphic said on March 3rd, 2008 at 11:12 am

    Thanks for the tip. I guess I have the same question as someone above. How does duplicate content hurt your ranking? Is it a consequence of PR being spread across multiple pages - or is it just a case of being penalized for duplication? I’ll have to do more research. Thanks again.

    [Reply]

  28. Deibson Albernas said on March 3rd, 2008 at 11:37 am

    yes ok, Thanks, use in 28 blogs maide in brasil

    [Reply]

  29. anty said on March 3rd, 2008 at 11:40 am

    Interesting that the question mark doesn’t have to be escaped. Normally a question mark would be a RegEx meta character, but I just looked it up in the Google guidelines: a question mark is treaded as a regular character.
    An important note: Not every crawler understands RegEx in the robots.txt. So you are “protecting” your sites against the major search engines, but not from normal bots. This is ok to avoid duplicate content, I guess.

    [Reply]

  30. anty said on March 3rd, 2008 at 11:43 am

    I wonder if Google isn’t already good at detecting a wordpress installation and can therefore react on the duplicate content accordingly (like ignoring part of the sites, indexing after a schema normal wp blogs will follow)… Just a thought :)

    [Reply]

  31. oakling said on March 3rd, 2008 at 11:44 am

    OMG. Will this keep spammers from doing that obnoxious thing where they copy a whole journal entry (or the majority of one) into their fake blogs, making it look like they are quoting it (”Someone said something great over at blahblahblah dot com, ‘entire post here,’”) with no other content? Just to get on google and steal my links? I’m sure they’re using robots at some stage….

    [Reply]

  32. ShoeMoney said on March 3rd, 2008 at 11:46 am

    well.. just had a conversion with mr cutts about this and many other things 3 days ago.

    You are getting the Disallow and noindex tags confused in the robots.txt. Disallow will still let the bots visit and index them but not take in the content.

    [Reply]

  33. ShoeMoney said on March 3rd, 2008 at 11:46 am

    well its not really true regex… its just a somewhat adaptation

    [Reply]

  34. ShoeMoney said on March 3rd, 2008 at 12:19 pm

    I doubt its going to keep spammers out ;)

    [Reply]

  35. TheMadHat said on March 3rd, 2008 at 12:40 pm

    Agreed that disallow will allow the bots to visit but not take the content. Maybe I said this wrong.

    Say for example you’ve got links coming into a page you’ve disallowed in robots.txt. This wastes any link juice that (linking) page is giving you. Using “meta noindex” will allow the bots to follow the links on the “meta noindexed” page and pass on the link juice, and also alleviate any dup issues.

    So has he changed his stance on the fact that a “meta noindexed” page accumulating and pass page rank? On a robots disallowed page the bots won’t take the content thereby there will be nowhere to pass page rank to.

    The way I understand it is this:

    meta noindex - don’t index but follow and pass pr
    meta nofollow - index but don’t follow links or pass any pr on entire page
    href nofollow - don’t pass pr on that link
    robots disallow - don’t index or follow or pass pr (they can reference the url still, just without content there is nowhere to pass any link juice).

    [Reply]

  36. Syed Balkhi said on March 3rd, 2008 at 1:10 pm

    Great list of tips shoe … i can bet this helps alot.

    [Reply]

  37. Syed Balkhi said on March 3rd, 2008 at 1:10 pm

    yeah nothing keeps them out

    [Reply]

  38. Gary R. Hess said on March 3rd, 2008 at 1:50 pm

    Matt Cutts says it does.

    [Reply]

  39. Gary R. Hess said on March 3rd, 2008 at 1:59 pm

    For smaller blogs this might not be the best thing to do when it comes to SEO. If implementing everything this way, you are relying on Google to find older posts (if they don’t have links to them) by going directly through the homepage. Requiring Google to go back 20 pages to find an article is a good way to end up in the supplemental index (which, of course they claim doesn’t exist anymore, but IMO it does).

    [Reply]

  40. Affiliate Confession said on March 3rd, 2008 at 2:01 pm

    Thanks for the list and explaining it. I need to add a robots.txt file to my blog.

    [Reply]

  41. Douglas Karr said on March 3rd, 2008 at 2:42 pm

    Thanks for these tips - I hadn’t even thought of leveraging the robots file against duplicate content (much easier than disabling those features!). Thanks!

    [Reply]

  42. Tom Beaton said on March 3rd, 2008 at 3:29 pm

    I shall have to take another look at my robots.txt!

    [Reply]

  43. David Harrison said on March 3rd, 2008 at 3:56 pm

    Typo in the title? Or am i seeing things

    [Reply]

  44. Squeaky said on March 3rd, 2008 at 4:14 pm

    Thank you for posting these tips for WordPress on the robot.txt file.

    [Reply]

  45. Charlie said on March 3rd, 2008 at 4:32 pm

    Bob,
    Why do you need to block /comments/, I thought having comments indexed would be a good thing. This is new to me so any pointers would be great.

    Thanks.

    [Reply]

  46. Uzair said on March 3rd, 2008 at 4:47 pm

    Thats great. But don’t you think you are getting off topic.

    [Reply]

  47. Uzair said on March 3rd, 2008 at 4:50 pm

    It does. Duplicate content ruins your site.

    [Reply]

  48. Uzair said on March 3rd, 2008 at 4:52 pm

    You can also use
    Disallow: /wp
    instead of all those others like
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes

    [Reply]

  49. Affiliate Confession said on March 3rd, 2008 at 5:26 pm

    I was wondering that as well. They can find and react to all sorts of things, I think they would know about WP installs and the issues it has.

    [Reply]

  50. Dexter | Techathand.net said on March 3rd, 2008 at 9:20 pm

    This is only applicable if your Permalink is not structuterd to have year on it. Or else this will result with a mess..

    [Reply]

  51. Dexter | Techathand.net said on March 3rd, 2008 at 9:23 pm

    Hello to all I just want to share my post regarding Robots.txt that really helps my site

    [Reply]

  52. Reynder (SEO) said on March 3rd, 2008 at 9:25 pm

    Thanks! Very useful again. Avoiding duplicate content really helped me ranking well.

    [Reply]

  53. Too Much Vodka said on March 3rd, 2008 at 11:00 pm

    Well, this seems to be a better version than all noindex plugins going arround. Def. will give it a try!

    [Reply]

  54. John said on March 4th, 2008 at 2:11 am

    Very helpful post for me as I have been looking how to use the robots.txt file in this way for some time.

    [Reply]

  55. Nullamatix said on March 4th, 2008 at 4:56 am

    I didn’t initially include a robots.txt in my blog and never had any issues with dupe content. It wasn’t until just recently I decided to add one, more for experimental purposes. So far, search engine traffic hasn’t improved or declined either way. Wordpress out of the box isn’t great for SEO purposes, but with minor tweeks, I find that a robots.txt isn’t really necessary.

    -Guy
    http://www.nullamatix.com

    [Reply]

  56. Nullamatix said on March 4th, 2008 at 4:59 am

    Uzair,

    How is this off topic? If Shoe thinks a robots.txt will help in SERPs, your site will get more traffic, and ultimately earn more cash. Isn’t that one of the focuses of this blog? “Skills to Pay the Bills” right?

    -Guy
    http://www.nullamatix.com

    [Reply]

  57. Nullamatix said on March 4th, 2008 at 5:00 am

    Um, no. The only way to prevent those types of attacks would involve IP based content delivery.

    [Reply]

  58. RacerX said on March 4th, 2008 at 5:47 am

    Do you have some before /after stats you can share? I understand the penalty, but just want to understand how it improves.

    [Reply]

  59. Yiwu said on March 4th, 2008 at 5:13 pm

    Ya,I dont use Disfollows..

    [Reply]

  60. Yiwu said on March 4th, 2008 at 5:15 pm

    Why my post cann’t be displayed.

    [Reply]

  61. Too Much Vodka said on March 4th, 2008 at 10:04 pm

    I agree, disallowing category and page is not the smartest move to let google find old content.

    [Reply]

  62. Andy Beard said on March 5th, 2008 at 4:20 am

    Shoe is making an “SEO Linking Gotcha”

    All the pages blocked with robots.txt will still gather juice and can still rank

    Simple proof is that my Wordpress SEO Masterclass page is still ranking after being blocked by robots.txt for a couple of weeks as it was written as a paid post - actually it is ranking higher that Joost’s similar page.

    This article explains why so many people have got this wrong for years
    http://andybeard.eu/2007/11/seo-linking-gotchas-even-the-pros-make.html

    It gets worse when people start mixing this kind of advice with their “All in one SEO” because the noindex statements added don’t get seen by googlebot.

    [Reply]

  63. Downloading... said on March 7th, 2008 at 5:08 am

    Thanks for this Jeremy. I have been looking for a good robots.txt file. I have no idea what to put in, so this will help.

    [Reply]

  64. Secrets Of Cash Gifting said on March 7th, 2008 at 2:44 pm

    Thats good that they added it, duplication is bad.

    [Reply]

  65. Erica DeWolf said on March 9th, 2008 at 6:45 pm

    Great post with some great descriptions of what these certain words will “do.” Thanks for the post!

    [Reply]

  66. HardGeek said on March 11th, 2008 at 8:40 pm

    wow!!! Never knew that..??

    [Reply]

  67. Chip said on March 13th, 2008 at 3:37 am

    Great tips, I’ll enhance my robots.txt file ASAP

    [Reply]

  68. SEO hosting said on April 11th, 2008 at 5:50 pm

    Shoe, I just checked your actual robots.txt. Why do you have;

    Disallow: /sitemap.xml

    That seems like trouble?

    [Reply]

  69. links for 2008-03-04 said on March 3rd, 2008 at 1:27 pm

    [...] Wordpress robots.txt tips against douplicate content - ShoeMoney® Some useful tips on updating your robots.txt file to avoid duplicate content problems with Wordpress. (tags: seo wordpress) [...]

  70. [...] Read more of this article at ShoeMoney.com [...]

  71. [...] Wordpress robots.txt tips against duplicate content [...]

  72. [...] citeva zile am citit un articol al lui Jeremy Shoemaker pe aceasta tema. El propunea folosirea unui fisier robots.txt, care este [...]

  73. Wordpress robots.txt tips against duplicate content said on March 4th, 2008 at 2:21 am

    [...] Wordpress robots.txt tips against duplicate content Disallow: /*? - this blocks all urls with a ? in them. A good way to avoid duplicate content issues with wordpress blogs. Obviously you only want to use this if you have changed your url structure to not be 100% ?=.   [...]

  74. ein-uwe.de » getunte Wordpress robots.txt said on March 4th, 2008 at 10:03 am

    [...] ist. Um dies wirkungsvoll zu vermeiden habe ich eine feine und vor allem schnelle Lösung bei Shoemoney.com gefunden. Er benutzt diese [...]

  75. WP: Doppelten Inhalt vermeiden - im Designpicks Blog said on March 6th, 2008 at 4:05 pm

    [...] Shoemoney hat vor einigen Tagen darüber berichtet wie man mit ein paar Einträgen in der robots.txt solche doppler vermeidet. In diesem Fall ist die Liste mit Befehlen auf Wordpress angepasst, kann aber auch für andere Systeme genutzt werden (evtl. Anpassungen nötig). [...]

  76. Weekly Links - March 7th | Vandelay Website Design said on March 7th, 2008 at 11:30 am

    [...] WordPress Robots.txt Tips Against Duplicate Content from Shoemoney. [...]

  77. [...] Sollte man nun besser die oben angegebenen Plugins oder die robots.txt-Methode verwenden? Um die Unterschiede zu verstehen, muss man ein weniger tiefer in SEO-Welten abtauchen: während die beschriebenen Plugins die von Google vorgesehene Syntax noindex bzw. nofollow in den Header der betreffenden Dateien einfügen, sorgt die robots.txt-Variante dafür, dass überhaupt nie auf die betreffenden Seiten zugegriffen wird. Ob die beiden Varianten in der Praxis einen Unterschied machen, darüber streiten derzeit die SEO-Experten - siehe auch die Diskussion zum betreffenden Eintrag bei Shoemoney. [...]

  78. 99 Ways to Improve Your Blog | PureBlogging said on March 10th, 2008 at 8:31 am

    [...] are less likely to suffer from the penalties of duplicate content. WordPress users should see the article on Shoemoney about robots.txt [...]

  79. Speedlinking - Back To The Basics » Derek Semmler dot com said on March 10th, 2008 at 10:17 pm

    [...] to basics” type post when he shared his tips on how to use the robots.txt in WordPress to prevent duplicate content. This is a great reference to use when editing your robots.txt to tweak your site and ensure you [...]

  80. meckator » Doppelten Content vermeiden said on March 25th, 2008 at 3:29 pm

    [...] Als robots.txt speichern. Das wurde alles für Wordpress optimiert. [...]

  81. Tweaking Your robots.txt File said on March 31st, 2008 at 12:49 pm

    [...] talks about his robots.txt file and how it guards against duplicate content in search engine results. Most of the strategies he’s using can be replicated for Movable Type and TypePad users. [...]

  82. 7 Ways To Improve SEO Optimization | Digital Tips said on April 9th, 2008 at 12:09 am

    [...] likely to suffer from the penalties of duplicate content. WordPress users should see the article on Shoemoney about robots.txt [...]

  83. Optimize WordPress for Search Engines with robots.txt said on April 24th, 2008 at 8:20 am

    [...] more tips on optimizing robots.txt for WordPress, check out Shoemoney’s suggestions. And keep in mind that like Shoemoney, I am not an SEO. I’ve just been using this method for [...]

  84. [...] Line: We looked at Josh’s robots.txt post, as well as at ShoeMoney’s robot.txt post to figure out what we want our robots.txt file to look [...]

  85. [...] Sort out your Robots.txt file to make sure Google doesn’t index that RSS or other content in Wordpress that can cause dupe content horror for you and your blog. Shoemoney said it better than I can here. [...]

  86. How to Setup a WordPress Blog | Niche Store Strategies said on July 31st, 2008 at 2:57 pm

    [...] Wordpress robots.txt tips against duplicate content [...]

  87. [...] Wordpress robots.txt tips against duplicate content [...]

  88. [...] to reading a post by Jeremy at ShoeMoney.com and reading about different User-agents, I was able to create a robots.txt file that will help [...]

What do you think? Join the discussion...

How do I change my avatar?

Go to gravatar.com and upload your preferred avatar.