Garfield: Managing the Spiders

Submitted by Daniel Henry on 09/08/2011 - 03:29:pm
Have you heard of spiders?

Ah, no, not the ones squished onto the wall. I mean robots.

Oh, goodness, no, I wasn’t trying to give you nightmares about Terminators shaped like ten-foot-tall tarantulas. I’m not entirely sure how you even got to that image. Let me try again. Web crawlers? Sound familiar?

If you have or plan to have a website, these things matter to you. And not because you need to find a better method than a newspaper if you expect to kill a Terminator spider. The spiders I’m talking about are a tool that search engines use.

They are servers that surf the net and update any site they find. Once they find it, they catalogue the metadata, and that information is used for search engine rankings and such. These spiders are the kind you don’t so much mind having around, unlike that one that always shows up in the shower when you least expect it.

But then, there are some pages that you don’t really want showing up in a Google search. Like, say, a private FTP client that is publically accessible with a password. Only the people with the password ought to even realize that page exists. Or, perhaps, for some (such as Drupal users), admin login pages. Those aren’t the kinds of pages anyone wants the average, recreational Google searcher stumbling across. And, of course, the fact that most of us—who really ought to just admit that we’ve googled our own names, because everybody already knows we have—have never come across one of these kinds of pages in our internet ramblings is proof enough that there’s a way to keep the spiders out. That spider-stopper is the robots.txt file.

These files aren’t complicated and, of course, this being the internet and not your basement, work perfectly well. They’re simple text files that tell search engines not to index certain pages. And for single sites, that’s all anyone needs.

Yes, of course it can get more complicated than that.

Drupal allows multiple sites to be installed on the same core install—multi-site configuration. That means one code-base hosts multiple websites. They need not share a theme or even functionality; they simply share the same Drupal core install. Normally, with multiple sites on the same Drupal install, the robots.txt file goes into the root directory and applies to all the sites. But maybe you don’t want that the same set of rules to apply to both. Maybe you want closer control of each individual site. What then?

Well, Drupal provides a module that enables you to modify the robots.txt file for each install—customize for each site. Then, it stores the file in the database. This is important for two reasons. First, this means that it’s backed up in database backups. When changes are made to the robots.txt file, those changes are recorded through regular database backups. So if something is messed up, you can simply revert to a previous version. Secondly, this means that the file can be edited online, without an FTP client, from within the admin page—in a matter of seconds.

So if, say, you decided to host a developmental version of your site on the side, you could make sure that the search engines don’t index it. Or, if we’re on a job to transfer a site over from a single- to a multi-site install of Drupal, we can temporarily disable the new sites while we’re developing and working on them, then re-enable them once the’re finished. And remember that bit about how spiders are linked to search engine rankings? Making use of the robots.txt file this way keeps SEO intact.

So, let the robots.txt keep the spiders out of your admin pages. I’ll be over here trying to figure out how to get a non-text robot to keep the regular spiders out of the house. Or how to train a cat to do it. Either one’s fine.