Friday, February 17, 2017

Google: Disable certain querystring in robots.txt

http://stackoverflow.com/questions/19113788/google-disable-certain-querystring-in-robots-txt



http://www.site.com/shop/maxi-dress?colourId=94&optId=694
http://www.site.com/shop/maxi-dress?colourId=94&optId=694&product_type=sale
I have thousands of URLs like the above. Different combinations and names. I also have duplicates of these URLs which have the query string product_type=sale
I want to disable Google from indexing anything with product_type=sale
Is this possible in robots.txt
shareimprove this question
1 
This question appears to be off-topic because it is about SEO – John Conde Oct 1 '13 at 11:38

2 Answers

up vote15down voteaccepted
Google supports wildcards in robots.txt. The following directive in robots.txt will prevent Googlebot from crawling any page that has any parameters:
Disallow: /*?
This won't prevent many other spiders from crawling these URLs because wildcards are not a part of the standard robots.txt.
Google may take its time to remove the URLs that you have blocked from the search index. The extra URLs may still be indexed for months. You can speed the process up by using the "Remove URLs" feature in webmaster tools after they have been blocked. But that is a manual process where you have to paste in each individual URL that you want to have removed.
It may also hurt your site's Google rankings to use this robots.txt rule in the case that Googlbot doesn't find the version of the URL without parameters. If you commonly link to the versions with parameters you probably don't want to block them in robots.txt. It would be better to use one of the other options below.

A better option is to use the rel canonical meta tag on each of your pages.
So both your example URLs would have the following in the head section:
<link rel="canonical" href="http://www.site.com/shop/maxi-dress">
That tells Googlebot not to index so many variations of the page, only to index the "canonical" version of the URL that you choose. Unlike using robots.txt, Googlebot will still be able to crawl all your pages and assign value to them, even when they use a variety of URL parameters.

Another option is to log into Google Webmaster Tools and use the "URL Parameters" feature that is in the "Crawl" section.
Once there, click on "Add parameter". You can set "product_type" to "Does not affect page content" so that Google doesn't crawl and index pages with that parameter.
enter image description here
Do the same for each of the parameters that you use that don't change the page.
shareimprove this answer
   
This should be combined with the answer from moobot. I think it is fair to award him/her since they new. – TheBlackBenzKid Oct 1 '13 at 11:22
   
If I added this into my robots.txt file User-Agent: * Disallow: /flickering/*? That was just disallow queries for pages under the flikcering folder but all it still index all pages in that folder? – Richard Young Jun 1 '16 at 15:00
   
Looks right to me @Richard – Stephen Ostermiller Jun 1 '16 at 15:25
Yes this is quite straight forward to do. Add the following line in your robots.txt file:
Disallow: /*product_type=sale
The preceding wild card (*) means any URLs that contain product_type=sale will no longer be crawled by Google.
Although they may still stay in Google's index if they were there previously, but Google will no longer crawl them, and when viewed in a Google search will say : A description for this result is not available because of this site's robots.txt – learn more.
Further reading here: Robots.txt Specifications
shareimprove this answer
   
How can you remove them from the index? – TheBlackBenzKid Oct 1 '13 at 11:10
1 
Adding URLs to robots.txt will NOT remove them from the index promptly. It will just prevent Googlebot from crawling them again. To remove URLs from Google's index you need to add a noindex meta tag, or a rel canonical meta tag (and let google crawl them); or manually enter each one into webmaster tools removal request after they are in robots.txt – Stephen Ostermiller Oct 1 '13 at 11:32