Matthew Henry – Portent

The Complete Guide to Robots.txt

Matthew Henry — Thu, 15 Sep 2016 18:39:27 +0000

Robots.txt is a small text file that lives in the root directory of a website. It tells well-behaved crawlers whether to crawl certain parts of the site or not. The file uses simple syntax to be easy for crawlers to put in place (which makes it easy for webmasters to put in place, too). Write it well, and you’ll be in indexed heaven. Write it poorly, and you might end up hiding your entire site from search engines.

There is no official standard for the file. Robotstxt.org is often treated as such a resource, but this site only describes the original standard from 1994. It’s a place to start, but you can do more with robots.txt than the site outlines, such as using wildcards, sitemap links, and the “Allow” directive. All major search engines support these extensions.

In a perfect world, no one would need robots.txt. If all pages on a site are intended for public consumption, then, ideally, search engines should be allowed to crawl all of them. But we don’t live in a perfect world. Many sites have spider traps, canonical URL issues, and non-public pages that need to be kept out of search engines. Robots.txt is used to move your site closer to perfect.

How Robots.txt Works

If you’re already familiar with the directives of robots.txt but worried you’re doing it wrong, skip on down to the Common Mistakes section. If you’re new to the whole thing, read on.

The file

Make a robots.txt file using any plain text editor. It must live in the root directory of the site and must be named “robots.txt” (yes, this is obvious). You cannot use the file in a subdirectory.

If the domain is example.com, then the robots.txt URL should be:

http://example.com/robots.txt

The HTTP specification defines ‘user-agent’ as the thing that is sending the request (as opposed to the ‘server’ which is the thing that is receiving the request). Strictly speaking, a user-agent can be anything that requests web pages, including search engine crawlers, web browsers, or obscure command line utilities.

User-agent directive

In a robots.txt file, the user-agent directive is used to specify which crawler should obey a given set of rules. This directive can be either a wildcard to specify that rules apply to all crawlers:

User-agent: *

Or it can be the name of a specific crawler:

User-agent: Googlebot

Learn more about giving directives to multiple user-agents in Other user-agent pitfalls.

Disallow directive

You should follow the user-agent line by one or more disallow directives:

User-agent: *
Disallow: /junk-page

The above example will block all URLs whose path starts with “/junk-page”:

http://example.com/junk-page
http://example.com/junk-page?usefulness=0
http://example.com/junk-page/whatever
http://example.com/junk-pages-and-how-to-keep-them-out-of-search-results

It will not block any URL whose path does not start with “/junk-page”. The following URL will not be blocked:

http://example.com/subdir/junk-page

The key thing here is that disallow is a simple text match. Whatever comes after the “Disallow:” is treated as a simple string of characters (with the notable exceptions of * and $, which I’ll get to below). This string is compared to the beginning of the path part of the URL (everything from the first slash after the domain to the end of the URL) which is also treated as a simple string. If they match, the URL is blocked. If they don’t, it isn’t.

Allow directive

The Allow directive is not part of the original standard, but it is now supported by all major search engines.

You can use this directive to specify exceptions to a disallow rule, if, for example, you have a subdirectory you want to block but you want one page within that subdirectory crawled:

User-agent: *
Allow: /nothing-good-in-here/except-this-one-page
Disallow: /nothing-good-in-here/

This example will block the following URLs:

http://example.com/nothing-good-in-here/
http://example.com/nothing-good-in-here/somepage
http://example.com/nothing-good-in-here/otherpage
http://example.com/nothing-good-in-here/?x=y

But it will not block any of the following:

http://example.com/nothing-good-in-here/except-this-one-page
http://example.com/nothing-good-in-here/except-this-one-page-because-i-said-so
http://example.com/nothing-good-in-here/except-this-one-page/that-is-really-a-directory
http://example.com/nothing-good-in-here/except-this-one-page?a=b&c=d

Again, this is a simple text match. The text after the “Allow:” is compared to the beginning of the path part of the URL. If they match, the page will be allowed even when there is a disallow somewhere else that would normally block it.

Wildcards

The wildcard operator is also supported by all major search engines. This allows you to block pages when part of the path is unknown or variable. For example:

Disallow: /users/*/settings

The * (asterisk) means “match any text.” The above directive will block all the following URLs:

http://example.com/users/alice/settings
http://example.com/users/bob/settings
http://example.com/users/tinkerbell/settings
http://example.com/users/chthulu/settings

Be careful! The above will also block the following URLs (which might not be what you want):

http://example.com/users/alice/extra/directory/levels/settings
http://example.com/users/alice/search?q=/settings
http://example.com/users/alice/settings-for-your-table

End-of-string operator

Another useful extension is the end-of-string operator:

Disallow: /useless-page$

The $ means the URL must end at that point. This directive will block the following URL:

http://example.com/useless-page

But it will not block any of the following:

http://example.com/useless-pages-and-how-to-avoid-creating-them
http://example.com/useless-page/
http://example.com/useless-page?a=b

Blocking everything

But let’s say you’re really shy. You might want to block everything using robots.txt for a staging site (more on this later) or a mirror site. If you have a private site for use by a few people who know how to find it, you’d also want to block the whole site from being crawled.

To block the entire site, use a disallow followed by a slash:

User-agent: *
Disallow: /

Allowing everything

I can think of two reasons you might choose to create a robots.txt file when you plan to allow everything:

As a placeholder, to make it clear to anyone else who works on the site that you are allowing everything on purpose.
To prevent failed requests for robots.txt from showing up in the request logs.

To allow the entire site, you can use an empty disallow:

User-agent: *
Disallow:

Alternatively, you can just leave the robots.txt file blank, or not have one at all. Crawlers will crawl everything unless you tell them not to.

Sitemap directive

Though it’s optional, many robots.txt files will include a sitemap directive:

Sitemap: http://example.com/sitemap.xml

This specifies the location of a sitemap file. A sitemap is a specially formatted file that lists all the URLs you want to be crawled. It’s a good idea to include this directive if your site has an XML sitemap.

Common Mistakes Using Robots.txt

I see many, many incorrect uses of robots.txt. The most serious of those are trying to use the file to keep certain directories secret or trying to use it to block hostile crawlers.

The most serious consequence of misusing robots.txt is accidentally hiding your entire site from crawlers. Pay close attention to these things.

Forgetting to un-hide when you go to production

All staging sites (that are not already hidden behind a password) should have robots.txt files because they’re not intended for public viewing. But when your site goes live, you’ll want everyone to see it. Don’t forget to remove or edit this file.

Otherwise, the entire live site will vanish from search results.

User-agent: *
Disallow: /

You can check the live robots.txt file when you test, or set things up so you don’t have to remember this extra step. Put the staging server behind a password using a simple protocol like Digest Authentication. Then you can give the staging server the same robots.txt file that you intend to deploy on the live site. When you deploy, you just copy everything. As a bonus, you won’t have members of the public stumbling across your staging site.

Trying to block hostile crawlers

I have seen robots.txt files that try to explicitly block known bad crawlers, like this:

User-agent: DataCha0s/2.0
Disallow: /
User-agent: ExtractorPro
Disallow: /
User-agent: EmailSiphon
Disallow: /
User-agent: EmailWolf 1.00
Disallow: /

It’s like leaving a note on the dashboard of your car that says: “Dear thieves: Please do not steal this car. Thanks!”

This is pointless. It’s like leaving a note on the dashboard of your car that says: “Dear thieves: Please do not steal this car. Thanks!”

Robots.txt is strictly voluntary. Polite crawlers like search engines will obey it. Hostile crawlers, like email harvesters, will not. Crawlers are under no obligation to follow the guidelines in robots.txt, but major ones choose to do so.

If you’re trying to block bad crawlers, use user-agent blocking or IP blocking instead.

Trying to keep directories secret

If you have files or directories that you want to keep hidden from the public, do not EVER just list them all in robots.txt like this:

User-agent: *
Disallow: /secret-stuff/
Disallow: /compromising-photo.jpg
Disallow: /big-list-of-plaintext-passwords.csv

This will do more harm than good, for obvious reasons. It gives hostile crawlers a quick, easy way to find the files that you do not want them to find.

It’s like leaving a note on your car that says: “Dear thieves: Please do not look in the yellow envelope marked ‘emergency cash’ hidden in the glove compartment of this car. Thanks!”

The only reliable way to keep a directory hidden is to put it behind a password. If you absolutely cannot put it behind a password, here are three band-aid solutions.

Block based on the first few characters of the directory name.
If the directory is “/xyz-secret-stuff/” then block it like this:

Disallow: /xyz-

Block with robots meta tag.
Add the following to the HTML code:

Block with the X-Robots-Tag header.
Add something like this to the directory’s .htaccess file:

Header set X-Robots-Tag "noindex,nofollow"

Again, these are band-aid solutions. None of these are substitutes for actual security. If it really needs to be kept secret, then it really needs to be behind a password.

Accidentally blocking unrelated pages

Suppose you need to block the page:

http://example.com/admin

And also everything in the directory:

http://example.com/admin/

The obvious way would be to do this:

Disallow: /admin

This will block the things you want, but now you’ve also accidentally blocked an article page about pet care:

http://example.com/administer-medication-to-your-cat-the-easy-way.html

This article will disappear from the search results along with the pages you were actually trying to block.

Yes, it’s a contrived example, but I have seen this sort of thing happen in the real world. The worst part is that it usually goes unnoticed for a very long time.

The safest way to block both /admin and /admin/ without blocking anything else is to use two separate lines:

Disallow: /admin$
Disallow: /admin/

Remember, the dollar sign is an end-of-string operator that says “URL must end here.” The directive will match /admin but not /administer.

Trying to put robots.txt in a subdirectory

Suppose you only have control over one subdirectory of a huge website.

http://example.com/userpages/yourname/

If you need to block some pages, you may be tempted to try to add a robots.txt file like this:

http://example.com/userpages/yourname/robots.txt

This does not work. The file will be ignored. The only place you can put a robots.txt file is the site root.

If you do not have access to the site root, you can’t use robots.txt. Some alternative options are to block the pages using robots meta tags. Or, if you have control over the .htaccess file (or equivalent), you can also block pages using the X-Robots-Tag header.

Trying to target specific subdomains

Suppose you have a site with many different subdomains:

http://example.com/
http://admin.example.com/
http://members.example.com/
http://blog.example.com/
http://store.example.com/

You may be tempted to create a single robots.txt file and then try to block the subdomains from it, like this:

http://example.com/robots.txt

User-agent: *
Disallow: admin.example.com
Disallow: members.example.com

This does not work. There is no way to specify a subdomain (or a domain) in a robots.txt file. A given robots.txt file applies only to the subdomain it was loaded from.

So is there a way to block certain subdomains? Yes. To block some subdomains and not others, you need to serve different robots.txt files from the different subdomains.

These robots.txt files would block everything:

http://admin.example.com/robots.txt
http://members.example.com/robots.txt

User-agent: *
Disallow: /

And these would allow everything:

http://example.com/
http://blog.example.com/
http://store.example.com/

User-agent: *
Disallow:

Using inconsistent type case

Paths are case sensitive.

Disallow: /acme/

Will not block “/Acme/” or “/ACME/”.

If you need to block them all, you need a separate disallow line for each:

Disallow: /acme/
Disallow: /Acme/
Disallow: /ACME/

Forgetting the user-agent line

The user-agent line is critical to using robots.txt. A file must have a user-agent line before any allows or disallows. If the entire file looks like this:

Disallow: /this
Disallow: /that
Disallow: /whatever

Nothing will actually be blocked, because there is no user-agent line at the top. This file must read:

User-agent: *
Disallow: /this
Disallow: /that
Disallow: /whatever

Other user-agent pitfalls

There are other pitfalls of incorrect user-agent use. Say you have three directories that need to be blocked for all crawlers, and also one page that should be explicitly allowed on Google only. The obvious (but incorrect) approach might be to try something like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /dontcrawl/
User-agent: Googlebot
Allow: /dontcrawl/exception

This file actually allows Google to crawl everything on the site. Googlebot, (and most other crawlers) will only obey the rules under the more specific user-agent line, and will ignore all others. In this example, it will obey the rules under “User-agent: Googlebot” and will ignore the rules under “User-agent: *”.

To accomplish this goal, you need to repeat the same disallow rules for each user-agent block, like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /dontcrawl/
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Disallow: /dontcrawl/
Allow: /dontcrawl/exception

Forgetting the leading slash in the path

Suppose you want to block the URL:

http://example.com/badpage

And you have the following (incorrect) robots.txt file:

User-agent: *
Disallow: badpage

This will not block anything at all. The path must start with a slash. If it does not, it can never match anything. The correct way to block a URL is:

User-agent: *
Disallow: /badpage

Tips for Using Robots.txt

Now that you know how not to send hostile crawlers right to your secret stuff or disappear your site from search results, here are some tips to help you improve your robots.txt files. Doing it well isn’t going to boost your ranking (that’s what strategic SEO and content are for, silly), but at least you’ll know the crawlers are finding what you want them to find.

Competing allows and disallows

The allow directive is used to specify exceptions to a disallow rule. The disallow rule blocks an entire directory (for example), and the allow rule unblocks some of the URLs within that directory. This raises the question, if a given URL can match either of two rules, how does the crawler decide which one to use?

Not all crawlers handle competing allows and disallows exactly the same way, but Google gives priority to the rule whose path is longer (in terms of character counts). It is really that simple. If both paths are same length, then allow has priority over disallow. For example, suppose the robots.txt file is:

User-agent: *
Allow: /baddir/goodpage
Disallow: /baddir/

The path “/baddir/goodpage” is 16 characters long, and the path “/baddir/” is only 8 characters long. In this case, the allow wins over the disallow.

The following URLs will be allowed:

http://example.com/baddir/goodpage
http://example.com/baddir/goodpagesarehardtofind
http://example.com/baddir/goodpage?x=y

And the following will be blocked:

http://example.com/baddir/
http://example.com/baddir/otherpage

Now consider the following example:

User-agent: *
Allow: /some
Disallow: /*page

Will these directives block the following URL?

http://example.com/somepage

Yes. The path “/some” is 5 characters long, and the path “/*page” is 6 characters long, so the disallow wins. The allow is ignored, and URL will be blocked.

Block a specific query parameter

Suppose you want to block all URLs that include the query parameter “id,” such as:

http://example.com/somepage?id=123
http://example.com/somepage?a=b&id=123

You might be tempted to do something like this:

Disallow: /*id=

This will block the URLs you want, but will also block any other query parameters that end with “id”:

http://example.com/users?userid=a0f3e8201b
http://example.com/auction?num=9172&bid=1935.00

So how do you block “id” without blocking “userid” or “bid”?

If you know “id” will always be the first parameter, use a question mark, like this:

Disallow: /*?id=

This directive will block:

http://example.com/somepage?id=123

But it will not block:

http://example.com/somepage?a=b&id=123

If you know “id” will never be the first parameter, use an ampersand, like this:

Disallow: /*&id=

This directive will block:

http://example.com/somepage?a=b&id=123

But it will not block:

http://example.com/somepage?id=123

The safest approach is to do both:

Disallow: /*?id=
Disallow: /*&id=

There is no reliable way to match both with a single line.

Blocking URLs that contain unsafe characters

Suppose you need to block a URL that contains characters that are not URL safe. One common scenario where this can happen is when server-side template code is accidentally exposed to the web. For example:

http://example.com/search?q=<% var_name %>

If you try to block that URL like this, it won’t work:

User-agent: *
Disallow: /search?q=<% var_name %>

If you test this directive in Google’s robots.txt testing tool (available in Search Console), you will find that it does not block the URL. Why? Because the directive is actually checked against the URL:

http://example.com/search?q=%3C%%20var_name%20%%3E

All web user-agents, including crawlers, will automatically URL-encode any characters that are not URL-safe. Those characters include: spaces, less-than or greater-than signs, single-quotes, double-quotes, and non-ASCII characters.

The correct way to block a URL containing unsafe characters is to block the escaped version:

User-agent: *
Disallow: /search?q=%3C%%20var_name%20%%3E

The easiest way to get the escaped version of the URL is to click on the link in a browser and then copy & paste the URL from the address field.

How to match a dollar sign

Suppose you want to block all URLs that contain a dollar sign, such as:

http://example.com/store?price=$10

The following will not work:

Disallow: /*$

This directive will actually block everything on the site. A dollar sign, when used at the end of a directive, means “URL ends here.” So the above will block every URL whose path starts with a slash, followed by zero or more characters, followed by the end of the URL. This rule applies to any valid URL. To get around it, the trick is to put an extra asterisk after the dollar sign, like this:

Disallow: /*$*

Here, the dollar sign is no longer at the end of the path, so it loses its special meaning. This directive will match any URL that contains a literal dollar sign. Note that the sole purpose of the final asterisk is to prevent the dollar sign from being the last character.

An Addendum

Fun fact: Google, in its journey toward semantic search, will often correctly interpret misspelled or malformed directives. For example, Google will accept any of the following without complaint:

UserAgent: *
Disallow /this
Dissalow: /that

This does NOT mean you should neglect the formatting and spelling of directives, but if you do make a mistake, Google will often let you get away with it. However, other crawlers probably won’t.

Pet peeve: People often use trailing wildcards in robots.txt files. This is harmless, but it’s also useless; I consider it bad form.

For instance:

Disallow: /somedir/*

Does exactly the same thing as:

Disallow: /somedir/

When I see this, I think, “This person does not understand how robots.txt works.” I see it a LOT.

Summary

Remember, robots.txt has to be in the root directory, has to start with a user-agent line, cannot block hostile crawlers, and should not be used to keep directories secret. Much of the confusion around using this file stems from the fact that people expect it to be more complex than it is. It’s really, really simple.

Now, go forth and block your pages with confidence. Just not your live site, your secret stuff, or from hostile crawlers. I hope this guide prepared you to use robots.txt without screwing something up, but if you need more guidance, check out Robots.txt.org or Google’s Robots.txt Specifications.

The post The Complete Guide to Robots.txt appeared first on Portent.

Field Guide to Spider Traps: An SEO’s Companion

Matthew Henry — Wed, 03 Feb 2016 02:07:14 +0000

If search engines can’t crawl your site, SEO efforts do not amount to much. One of the problems I see most often are ‘spider traps’. Traps kill crawls and hurt indexation. Here’s how you find and fix them:

What is a spider trap?

A spider trap is a structural issue that causes a web crawler to get stuck in a loop, loading meaningless ‘junk’ pages forever. The junk pages might be any combination of the following:

Exact duplicates (endless different URLs that point to the same page)
Near-duplicates (pages that differ only by some small detail, e.g. the crumb trail)
The same information presented in endless different ways (e.g. millions of different ways to sort and filter a list of 1000 products)
Pages that are technically unique, but provide no useful information. (e.g. an event calendar that goes thousands of years into the future)

The special (worst) case: E-commerce sites

E-commerce sites are particularly good at creating spider traps. They often have product category pages you can sort and filter using multiple criteria such as price, color, style and product type. These pages often have URLs like “www.site.com/category?pricerange=1020&color=blue,red&style=long&type=pencils.”

If you have, say, ten product types, seven brands, six colors and ten price ranges, and four ways to sort it all, then you’ll have (I had to take my socks off for this one) 34,359,738,368 possible permutations. This number will vary depending on the number of options. The point is, it’s a big number.

That’s not ‘infinite,’ but it’s ‘an awful lot.’

Different causes, same result

This can make it impossible for a search engine to index all of the content on the site, and can prevent the pages that do get indexed from ranking well. There are a few reasons why this is bad for SEO:

It forces the search engines to waste most of their crawl budget loading useless near-duplicate pages. As a result, the search engines are often so busy with this, they never get around to loading all of the real pages that might otherwise rank well.
If the trap-generated pages are duplicates of a ‘real’ page (e.g. a product page, blog post etc.) then this may prevent the original page from ranking well by diluting link equity.
Quality-ranking algorithms like Google Panda may give the site a bad score because the site appears to consist mostly of low-quality or duplicate pages.

The result is the same: Lousy rankings. Lost revenue. Fewer leads. Unhappy bosses.

How to identify a spider trap

The best way to determine if a site has a spider trap is to use a crawler-based tool like Xenu’s Link Sleuth or Screaming Frog:

Start a crawl of the site and let it run for a while.
If the crawl eventually finishes by itself, then there is no spider trap.
If the crawl keeps running for a very long time, then there might be a spider trap (or the site might just be very large).
Stop the crawl.
Export a list of URLs.
If you find a pattern where all of the new URLs look suspiciously similar to each other, then a spider trap is likely.
Spot-check a few of these suspiciously similar URLs in a browser.
If the URLs all return exactly the same page, then the site definitely has a spider trap.
If the URLs return pages that are technically slightly different, but contain the same basic information, then a spider trap is very likely.

There are a lot of ways to create spider traps. Every time I think I have seen them all, our crawler finds another. These are the most common:

Expanding URL Trap

Identification

An expanding URL trap can be especially difficult to see in a browser, because it is usually caused by one or more malformed links buried deeply in the site. As with any spider trap, the easiest way to spot it is to crawl the site with a crawler-based tool. If the site has this issue, the crawl will reveal the following things:

At first the crawl will run normally. The spider trap will be invisible until the crawler finishes crawling most of the normal (non-trap) pages on the site. If the site is very large, this may take a while.
At some point in the crawl, the list of crawled URLs will get stuck in an unnatural-looking pattern in which each new URL is a slightly longer near-copy of the previous one. For example:
http://example.com/somepage.php
http://example.com/abcd/somepage.php
http://example.com/abcd/abcd/somepage.php
http://example.com/abcd/abcd/abcd/somepage.php
http://example.com/abcd/abcd/abcd/abcd/somepage.php
http://example.com/abcd/abcd/abcd/abcd/abcd/somepage.php
http://example.com/abcd/abcd/abcd/abcd/abcd/abcd/somepage.php
http://example.com/abcd/abcd/abcd/abcd/abcd/abcd/abcd/somepage.php…
Each new URL will contain an area of repeating characters that gets longer with each new step.
As the crawl continues, the URLs will get longer and longer until they are hundreds or thousands of characters long.

Causes

In most expanding URL spider traps, the file path is the part of the URL that gets longer. There are three ingredients that must all be present in order for this to happen:

Ingredient #1:
The site uses URL rewrite rules to convert path components into query parameters. For example, if the public URL is:
http://example.com/products/12345/xl/extra-large-blue-widget
then, on the server side, the rewrite rules might convert this to:
http://example.com/store/products.php?prod_id=12345&size=xl
In this example the “/extra-large-blue-widget” part is discarded, because it is ‘decorative’ text, added solely to get keywords into the URL.

Ingredient #2:
The rewrite rules are configured to ignore anything beyond the part of the URL they care about. For example, in:
http://example.com/products/12345/xl/extra-large-blue-widget
the rewrite rules would silently discard everything after “/products/12345/xl/”. You could change the URL to:
http://example.com/products/12345/xl/
or even:
http://example.com/products/12345/xl/here/is/junk/text/that/has/no/effect/on/the/output
and the server would return exactly the same page.

Ingredient #3:
The final ingredient is a malformed relative link that accidentally adds new directory levels to the current URL. There are many different ways this can happen. For example:

If the page:

http://example.com/products/12345/xl/extra-large-blue-widget

contains a link that is supposed to look like this:

which would point to the URL:

http://example.com/about/why-you-should-care-about-blue-widgets

but the author accidentally leaves out the leading slash:

This link actually points to the URL:

http://example.com/products/12345/xl/about/why-you-should-care-about-blue-widgets

If you repeatedly click on this link, you will be taken through the following URLs:

http://example.com/products/12345/xl/about/about/why-you-should-care-about-blue-widget
http://example.com/products/12345/xl/about/about/about/why-you-should-care-about-blue-widgets
http://example.com/products/12345/xl/about/about/about/about/why-you-should-care-about-blue-widgets
http://example.com/products/12345/xl/about/about/about/about/about/why-you-should-care-about-blue-widgets
http://example.com/products/12345/xl/about/about/about/about/about/about/why-you-should-care-about-blue-widgets
http://example.com/products/12345/xl/about/about/about/about/about/about/about/why-you-should-care-about-blue-widgets

If the page:

http://example.com/products/12345/xl/extra-large-blue-widget

contains a link that is supposed to look like this:

which would point to the URL:

http://www.othersite.com/

but the author leaves out the “http://”:

This link actually points to the URL:

http://example.com/products/12345/xl/www.othersite.com/

If you repeatedly click on this link, you will be taken through the following URLs:

http://example.com/products/12345/xl/www.othersite.com/
http://example.com/products/12345/xl/www.othersite.com/www.othersite.com/
http://example.com/products/12345/xl/www.othersite.com/www.othersite.com/www.othersite.com/
http://example.com/products/12345/xl/www.othersite.com/www.othersite.com/www.othersite.com/www.othersite.com/

If the page:

http://example.com/products/12345/xl/extra-large-blue-widget

contains a link that is supposed to look like this:

which would point to the URL:

http://www.othersite.com/

but the HTML was pasted from a word processor, which silently converted the quote marks into curly quotes:

This looks the same to us human beings, but to a browser or a search engine, those curly quotes are not valid quotation marks. From the browser/crawler’s point of view, this tag will look like:

As a result, the final URL will become the following mangled mess:

http://example.com/products/12345/xl/%E2%80%9Chttp://www.othersite.com/%E2%80%9D

If you repeatedly click on this link, you will be taken through the following URLs:

http://example.com/products/12345/xl/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9D
http://example.com/products/12345/xl/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9D
http://example.com/products/12345/xl/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9D

Treatment

This issue can be challenging to fix. Here are some things that may help:

Track down and fix the malformed link(s) that are creating the extra directory levels. This will fix the problem for now. Be aware that the problem is likely to return in the future if/when another bad link is added.
If you have the technical skill, you can add rules to the server config to limit rewrites to URLs with a specific number of slashes in them. Any URL with the wrong number of slashes should not be rewritten. This will cause malformed relative links to return a 404 error (as they should).
If all else has failed, you may be able to block the trap URLs using robots.txt.

Mix & Match Trap

Identification

This issue happens when a site has a finite number of items that can be sorted and filtered in a virtually unlimited number of different ways. This is most common on large online stores that offer multiple ways to filter and sort lists of products.

The key things that define a mix & match trap are:

The site offers many, many different ways to sort/filter the same list of products. For example: by category, manufacturer, color, size, price range, special offers etc.
These filters are implemented through normal, crawlable links that take the user to a new page. If the filters are powered by JavaScript, or the filters require the user to submit a form, then it’s probably not a spider trap.
It is possible to mix filter types. For example, you could view a list of all products of a specific brand that are also a specified color. If it is only possible to filter by brand or filter by color, but not by both, then this is probably not a spider trap.
Often, it is also possible to arbitrarily combine multiple choices from the same filter type. For example, by viewing a single list of all products that are any of red, blue, or mauve. A site can still have a mix & match trap without this, but it makes the issue much, much worse. It can easily increase the number of possible pages a trillion times or more.

Causes

This type of filter creates a trap because each option multiples the number of possibilities by two or more. If there are many filters, the number of combinations can get extremely large very quickly. For example, if there are 40 different on/off filtering options, then there will be 2⁴⁰, or over a trillion different ways to sort the same list of products.

A more concrete example:

Suppose an online store has a few thousand products. The list can be sorted by any of: price ascending, price descending, name ascending, or name descending. That’s four possible views.
There is also an option to limit the results to just items that are on sale. This doubles the number of possible views, so the total number is now 8.
Results can also be limited to any combination of four price ranges: $0–$10, $10–$50, $50–$200, and $200–$1000. The user may select any combination of these. (e.g. they can select both $0–$10 and $200–$1000 at the same time) This increases the number of possibilities by 2⁴, or 16 times, so the total number of views is now 128.
Results can also be limited to any combination of five sizes: XS, S, M, L, or XL. This increases the number of possibilities by 2⁵, or 32 times, so the total number of views is now 4096.
Results can be limited to any combination of 17 available colors. This increases the number of possibilities by 2¹⁷, or 131,072 times, so the total number of views is now 536,870,912.
Last but not least, the results can also be limited to any combination of 26 possible brand names. This increases the number of possibilities by 2²⁶, or 67,108,864 times, so the total number of views is now 36,028,797,018,963,968. (!!!)

That’s 36 quadrillion different ways to view the same list of a few thousand products. For all practical purposes, this can be considered infinite. To make matters worse, the vast majority of these will contain zero or one items.

Treatment

This issue can be extremely difficult to fix. The best way to deal with it is to not create the issue in the first place.

Some options:

Consider offering fewer options. Seriously. More choices is not always better.
Implement the mix & match filtering in JavaScript.
Depending on the URL scheme, it may be possible to limit the extent of the trap by using robots.txt to block any page with more than a minimum number of filters. This must be done very carefully. Block too much, and the crawler will no longer be able to find all of the products. Block too little, and the site will still be effectively infinite. (mere billions of pages instead of quadrillions)

Calendar trap

Identification

This issue is easy to identify. If the site has a calendar page, go to it. Try clicking the ‘next year’ (or ‘next month’) button repeatedly. If you can eventually go centuries into the future, then the site has a calendar trap.

Causes

This issue happens when an event calendar is capable of showing any future date, far beyond the point where there could plausibly be any events to display. From the search engine’s point of view, this creates an infinite supply of ‘junk’ pages—pages that contain no useful information. As spider traps go, this is a comparatively minor one, but still worth fixing.

Treatment

Some options:

Add code to selectively insert a robots noindex tag into the calendar page when it is more than X years into the future.
Block a distant future time range using robots.txt. (For example “Disallow: /calendar?year=203” to block the years 2030 through 2039—just make sure you change this before the year 2030 actually happens.)

SessionID trap

Identification

This trap can be identified by crawling the site, and looking at the list of crawled URLs for something like this:

http://example.com/somepage?jsessionid=E8B8EA9BACDBEBB5EDECF64F1C3868D3
http://example.com/otherpage?jsessionid=E8B8EA9BACDBEBB5EDECF64F1C3868D3
http://example.com/somepage?jsessionid=3B95930229709341E9D8D7C24510E383
http://example.com/otherpage?jsessionid=3B95930229709341E9D8D7C24510E383
http://example.com/somepage?jsessionid=85931DF798FEC39D18400C5A459A9373
http://example.com/otherpage?jsessionid=85931DF798FEC39D18400C5A459A9373

The key things to look for:

After the crawl has been running a while, all of the new URLs will include a query parameter with a name like ‘jsessionid’, ‘sessid’, ‘sid’, or the like.
The value of this parameter will always be a long random-looking code string.
These code strings will all be the same length.
The code strings might all be unique, or some of them might be duplicates.
If you open two URLs that differ only by the value of the code string they will return exactly the same page.

Causes

This session ID query parameter is used by some sites as a way to keep track of user sessions without the use of cookies. For the record, this is a very bad idea. It can cause a long list of major SEO problems, and it also does a lousy job of keeping track of sessions. (because the ID value tends to change — see items #4 and #5 below)

How this works:

If a URL that does not have a session ID is requested, the site will redirect the request to a version of the URL that has a session ID appended.
If a request URL does have a session ID, then the server returns the requested page, but it appends the session ID to each of the internal links on the page.
In theory, if #1 and #2 above were implemented perfectly without missing any links or redirects, then all of the URLs in the crawl would wind up with the same session ID and the crawl would end normally. In actual practice, this almost never happens. The implementation inevitably misses at least a few links.
If the site contains even one internal link that is missing the session ID (because #2 above was implemented incompletely), then the site will generate a brand new session ID each time this link is followed. From the crawler’s point of view, each time it follows the link it will be taken to a whole new copy of the site, with a new session ID.
If there are any URLs that are not properly redirected when requested without a session ID (because item #1 above was implemented incompletely), and the URL can also be reached through a link that does not have a session ID (because item #2 was also implemented incompletely), then every link on this new page will effectively point to a brand new copy of the site.

To further complicate things, on some sites all of the above is implemented conditionally—the site first attempts to store session info in a cookie, and if this fails, then it redirects to a URL with a session ID. This really just makes the problem harder to find, because it hides the issue when the site is viewed by a human being with a browser.

Treatment

To deal with this issue, you will need to remove the session IDs from all redirects and all links. Details of how to do this depend on implementation. It is critical to remove all of them. If you overlook even one source of session IDs, the crawl will still have infinite URLs.

Conclusion

Spider traps can have a variety of causes, and they can vary in severity from “less than optimal” to “biblical disaster”. The one thing they all have in common is they all unnecessarily throw obstacles in the search engines’ path. This will inevitably lead to incomplete crawling and a lower rank than the site deserves. The search engines are your friends. They have the potential to bring in huge amounts of highly qualified traffic. It is in your interests to make their job as easy as possible.

Spider traps are also one of the most difficult SEO problems to find, diagnose and fix. Try the techniques I have outlined above. If you have questions, please leave a comment below.

Happy trapping!

The post Field Guide to Spider Traps: An SEO’s Companion appeared first on Portent.