So, I have a web-site with some automatically-generated pages. It includes very mildly personal information; for the sake of argument, let’s say it is their shoe-size.
This information is intended only for a select group of shoe-enthusiasts, doing research. I don’t want this web-site to appear in the search engines. People randomly searching for someone shouldn’t find their shoe-size.
The first option would be to password-protect it (or otherwise authenticate each user). However, that is hugely onerous on the legitimate users of the site, for something as mild as shoe-size.
So, I went for the honour system provided by robots.txt. “Please, please,” I asked the search engines, “Don’t go looking in my shoe-size directory.” While it is not guaranteed, the convention robots.txt is widely followed by bots.
Google honoured the letter of my request, but not the spirit. (I only pick Google out because I use it, not because it is necessarily the only offender.) It doesn’t search the page, but when someone links to it, Google will still return it. (e.g. <a href=wherever.html>Kevin Rudd's Shoe-Size</a>
will be found when searching for Kevin Rudd’s shoe-size, even though the URL itself is never visited by Google’s bots.)
How do you get around this? Google explain that there is another convention – the noindex meta tag.
I need to put in every page a note to the robots to say “Hey, don’t include this very page in your index at all.” But, the only way that they will see that note is if I let the search engines read the file – I need to remove the robots.txt restriction.
Now, I am in a dilemma for two reasons.
The first is that I have to let Google crawl all around my database, which may have hundreds of thousands of records and reports of shoe-sizes. That is going to cost me in bandwidth and CPU, just to say “This page shouldn’t appear in your list”.
The second is that there are sure to be robots out there that support the robots.txt convention but not the newer noindex meta tag convention. I don’t know who they are. I can’t exclude them in the robots.txt, while allowing the bots through that comply with the new convention.
We need a updated standard for robots.txt that says “Not only shouldn’t you look in this directory, you shouldn’t even admit that such URLs exist in your database.”
Comment by Richard on April 9, 2008
robots meta tag with its “noindex, nofollow” content has been around since at least 98: I was using it then fairly successfully with wget, the GNU web spider, which by now is probably the template of most link harvesters. It seems this tag was defined as part of the W3C recommendation for HTML 4.0, and that was in 1997. The text of the spec also has this interesting statement (even in the 4.01 update):
It’s been more than ten years, so I think you can stop worrying about whether it’s supported. In fact, I think any spider that doesn’t respect this tag isn’t going to respect your robots.txt either.
Comment by Aristotle Pagaltzis on April 9, 2008
Support OpenID?
Comment by Sunny Kalsi on April 9, 2008
There’s this sort-of club near where I live where they talk about shoe sizes. They meet up in this abandoned building with broken windows and stuff. There’s no “club leader” as such, and no “club registry”. People just politely assume that they belong to the club when they wander in.
Unfortunately, sometimes people tell other people about the club, or sometimes people see other people wander into the building and get curious. Worst case it sometimes gets into shoe mags and sometimes a bunch of randoms show up and it’s really awkward… The people don’t want to make it a proper club, because that makes it too much effort, and they want it to be casual.
They instituted a rule:
The first rule of shoe club is that nobody talks about shoe club.
The second rule: If this is your first night, you have to… shoe…
True story.
Comment by Julian on April 12, 2008
Aristotle,
I may be missing something, but supporting OpenID won’t solve the problem, for two reasons.
It will mean that (depending how I configure it) users won’t need yet another username and password for my site, but they would still need to authenticate somehow.
Similarly, there is still an initial registration. As a regular web-surfer, I still do not yet have an OpenID account (to my knowledge? perhaps some of the web accounts I have are OpenID-ready?). I don’t expect the occasional visitor will have one. I hope both of those facts change in the next five years
More importantly, while Googlebot won’t be able to read the contents of the page, it will still serve links to it. Kevin Rudd seekers will find a link to his shoe-size.
Comment by Julian on April 12, 2008
Richard,
You have assuaged my fear that common search bots might not support the NoIndex meta tag.
It still leaves the problem that each bot will need to visit each of the 10,000-or-so generated pages just to find out it shouldn’t index the page.
Perhaps I am being too miserly with CPU and bandwidth? 10,000 page hits per bot spread over time is possibly not worth worrying over. The pages aren’t large (assuming you don’t download the images).
Oh dear! Images! Suppose you don’t link to Kevin Rudd’s generated database page, but instead link straight to the image of Kevin Rudd’s footprint. While I can block the images with a judicious robots.txt file, I can’t include a NoIndex clause in a JPEG file.
Comment by Aristotle Pagaltzis on April 12, 2008
You may not have an account for it, but you already have an OpenID… or four. 😉
Comment by configurator on June 1, 2008
The footprint image will only be shown when someone embeds that image in their site, not when they link to it. So if your site has noindex and nobody embeds the image, it will not be indexed. And if they do embed it, you can’t really stop indexing from -their- site, can you?
Comment by Julian on June 1, 2008
Configurator,
If they link directly to the image (e.g.
<a href="mysite.com/wherever.jpg">Kevin Rudd's Shoe-Size<a>
), it will be found when you search for Kevin Rudd. I have no option to include a NoIndex modifier. You claim that is indexing their site. I claim that it is linking the name Kevin Rudd to my site, which I want to prevent.I can see this is two ways of different ways of looking at the same thing, but there doesn’t seem to be a way of me controlling the indexing of my images here, which seems wrong.
I could modify my server to give 404s to Google and other known bots when they look at images, but that seems over the top.