How to: Mine server logs for broken links

I’ve broken this out into lots of steps. You could do it all in one or two steps with a shell script or other geekery. I wrote this to keep each step simple, and get you into Excel as quickly as possible, instead.

I’ve railed about fixing broken links for years now. I’ve presented webinars about it, talked about all manner of fancy tools and generally made myself a pest.

What I’ve never done, though, is shown folks how they can quickly find those busted external links using basic tools. So, here goes:

Why bother?

With a log file, you can find broken external links that Google hasn’t. Google Webmaster Tools only shows you broken links found by Google. GWT ignores:

Old broken links that Google assumes are no longer relevant;
New incoming links that are broken, from sites like bit.ly;
Broken social media links, if they’re not driving many clicks.

Don’t you want all those links from Twitter? How about all the old .edu links you used to have, but lost when you took down the target pages?

Hell yes. Here’s how you can find them using your log files:

Get your tools together

If you’re using OS X or Linux, you have everything you need except, possibly, a spreadsheet program. Google Docs will work, or OpenOffice for big files, or Excel for the coolest stuff (like pivot tables).

If you’re on Windows, you’ll want to install CYGWIN—that gives you all of the command-line tools I talk about in this post.

1: Get access to the log files

If you run your own site, you can download the log files yourself. Otherwise, though, you’re going to have to ask someone else to get ’em for you, and that’s rarely popular. Here’s how you can make the process less painful:

Explain why you need them: To improve sales. Log files will give you the best potential linking ‘wins’, and reveal the biggest site indexation problems.
Explain the value: The log files will let you more accurately spot ‘big two’ issues (links and indexation) than any other method. Both have huge implications for site traffic. Which has huge implications for sales.
Explain exactly what you need: Don’t just ask for ‘the log files’. Let them know you just need a 5-10 day slice of the files or, if the site’s really busy, just a day or two.
Provide them an easy secure location to upload the zipped files. An FTP or Dropbox folder should work fine, and it saves them a step.
Assure them we’ll delete the logs the moment we’re done.

The key here: Make this an easy process. The first concern of whoever you ask for the files will be: “Is this a lot of work for me?” and “Is this a security issue?” Answer those concerns before they’re raised.

1b: If you can’t get the files

I’ve spent weeks, literally, trying to get log file access from a client. Usually, that’s because no one knows what I’m talking about. If you run into this, try these steps, in this order:

If the site’s located with a hosting company:

Read the company’s tech support docs. You may find the information you need there.
Check the site’s control panel. It probably has an area for log file management, or a file manager where you can click around and find the log file folder.
If all else fails, contact the hosting provider’s tech support team. Pick up the phone. Talk to a human being. You’d be amazed how well that works.

If the site’s self-hosted or managed by an internal team:

Get in touch with whoever manages the server day-to-day. Whether they know it or not, they’ll have the info you need to get the logs.
If they can’t find the files, but they’re willing to let you get access, get SSH or Remote Desktop permissions on the server. You can then click around and find the log files, or go directly into the IIS control panel/Apache configuration file and find the log file location there.
If they can’t find the files and they won’t give you access, find out their server platform. Then research possible log file locations on that platform, and ask them to look there.

Got the files? Great! Time to get to work.

2: Extract the log files

Now, you can go download the log files. You probably have a bunch of compressed files up on a server somewhere. They’ll look like the right-hand side of my FTP window:

Download them to your machine. Decompress them using whatever utility makes sense. If these are .gz files, you can extract them using the GUNZIP command:

gunzip *.gz

That will extract every file in this folder with a .gz on the end, and leave you with something like this:

Log files may be compressed using ZIP, or something else. You can find the right extraction tool using, I dunno, Google?

3: Combine the log files

Ideally, you need a single log file. To combine the log files, use the CAT command:

cat access_log > biglog.txt

The above command will:

Read each file that has a name starting with ‘access_log’.
Write the contents of all of those files into a file named ‘biglog.txt’.
The single ‘>’ tells CAT to erase a pre-existing file named ‘biglog.txt’ and start over. If you use ‘>>’ then CAT will add to the existing file, instead.

If the files are really huge you may have to keep them separate. But that’ll only be an issue if, once combined, the final file is multiple gigabytes in size. GREP is really good at processing huge files.

Interlude: What you need from this file

You need to find all of the broken external links. So, you’ll need four pieces of data:

The response code. A web server responds to a request for a broken link with a 404 error code, which then gets stored in the log files you just combined. The response code will let us filter for broken links.
The referrer. It also stores the referrer—the URL of the linking page. We’ll use this to figure out the value of the broken link.
The request. It stores the request—the URL of the linked page. The request will tell us which pages we need to replace or redirect.
The user agent. Finally, it stores the user agent—the type of browser or bot that made the request. This will let us exclude Googlebot visits.

With those four items, you can find all of the external broken links visited by browsers other than Googlebot.

4: Use GREP to find the 404 errors

Now to the good stuff. You’ve got one gigantic log file. You can use the GREP command to search through that file at super speed.

Use this command, changing the htm and file names as relevant:

grep "\.htm*[[:space:]]404[[:space:]]" biglog.txt > errors.txt

This command will:

Find every line in the log that includes ‘.htm’ and ‘ 404 ’. It uses a regular expression, or regex. I kinda suck at regex, so go to this site if you want to learn more.
Write that to a file called errors.txt.

This can take a minute or two.

You may need to change the .htm. We’re using to exclude all of the requests for .gif, .png and other non-html files. We only care about pages this time around. If your site uses php, and all of the URIs end with .php, you’ll have to change .htm to .php.

5: Get rid of Googlebot

We need to remove all 404 errors generated by Googlebot. GREP can do the job, again. Use this command:

grep -v "Googlebot" errors.txt > errors-no-google.txt

This command will:

Search through the file you generated in step 4.
Find every line that does not include “Googlebot”. The -v inverts the search, so GREP finds all lines that don’t match the search criteria.
Output that line to a new file called errors-no-google.txt. If the file exists, it’ll wipe that file and create a new one. Use >> if you want to append to the existing file instead.

Notice how fast GREP ran that command? Pretty nifty, huh?

When I ran through this exercise on my laptop, I took a .5 gigabyte biglog.txt file and trimmed it down to a 904kb file that just contained the errors I needed. It took a total of 5 minutes, start to finish. Try this in Excel and you’ll see smoke rising from your computer. GREP is so cool that I’ve written about it before.

6: Prepare your spreadsheet

Using whatever spreadsheet software you prefer, import the errors-no-google file as a space-delimited text file:

You won’t need most of the columns. Only three columns really matter:

The column that includes GET or HEAD and a URL. That’s the request—the page on this site that someone tried to load.
The column that includes a three-digit number. It usually comes right after the request. That’s the response code—the server’s reply to the request. If GREP did its job, the response should be 404 for every line in the sheet. Sometimes it goes wrong, though, because of a ‘404’ somewhere else in the row. Poop happens.
The next column should be a URL, or a dash. That’s the referrer. If someone clicked a link on another page, that other page’s URL is the referring URL. It’s shown in this column. If they typed in the page address, or if their browser is set up to hide the referrer, the referrer is ‘-’.

You can delete the rest of the columns. Then insert a new row at the top of the page and label the columns:

That’ll let you indulge in some data processing niftiness later on.

Oh, and save the damned spreadsheet. Nothing sadder than losing all your data because your cat strolled across the keyboard.

7: Set up filtering.

Put your cursor in the heading row you created in step 6 and click the filter button:

Now you can sort and/or filter our stuff you don’t need. For example, I may not want to see all of those ‘-’ referrers:

And I probably only want to see external broken links, so I can filter out all referrers that include this site’s domain name:

Note that I used ‘does not contain’ for the second filter. Read up on Excel’s filter tool. It’s your friend.

8: Find the broken external links

Phew. Finally. We can find some external links. Take a look at the result:

It’s a link goldmine!!! Every row represents a broken link from another site.

Now you can use a pivot table or other spreadsheet awesomeness to find the biggest problems:

Or, you can just browse through the raw data. Either way, you’ll find great, easy incoming links.

9: Prioritize the links

Prioritize broken links like this:

Broken links from high-authority sites get fixed first. These links could really give you a rankings boost.
Broken links with a high number of requests get fixed next. A lot of people are still clicking them.
Everything else.

10: How to fix the links

None of this work means a thing if no one fixes the links! Here are the ways to fix them, from best to worst:

Rebuild the missing page. If the broken link points at a deleted page, replace that page. If the site’s an online store and the link points at a product that’s out of stock or no longer available, put up a page, at that URL, that says ‘This product is out of stock’ or ‘This product is no longer available’. Then provide links to other relevant pages, or to customer support, or to the category page.
Build a new page. If the broken link points at a page that never existed or had to be deleted, create something new (but relevant) there.
Build a detour page. Create a page that summarizes what the old page said and then says ‘But this page is gone now. Sniff. Instead, go over here.’ Then link to an alternative.
Use a permanent redirect. Create a 301 redirect from the broken link URL to a relevant page. Do not simply redirect to your home page! That just confuses your visitors.

Always use options 1-3 before 4. A permanent redirect is a very imperfect solution, and best applied when you have no other options. 301 redirects will reroute authority for a while, but eventually the authority ‘decays’. Plus, a high number of 301 redirects on a site can wreak havoc with Google and Bing. Both search engines’ crawlers will give up if they see too many redirect ‘hops’.

Put away that letter opener…

This post has over 1800 words. At this point you’re probably ready to stab me. Please don’t. I like my insides in.

And, this isn’t nearly as hard as it seems. With practice, you’ll be zipping through all these steps in under an hour. It’s by far the quickest, easiest way to improve site authority.

Comments

Ramon Fincken says:
August 14, 2012 at 9:38 am
Nicely done!
I hope a geeky commandline linux summary will show up in the comments 😉
In addition the grep only looks for *.html files, yet images and other stuff like rewritten urls do not end on *.html and _do_ have a great impact on serverload
1. Ian Lurie says:
  August 14, 2012 at 9:49 am
  Yup, and we use log files for that, too. This was just about links and links to pages.
Travis says:
August 15, 2012 at 8:42 am
Sorry Ian,
I didn’t read this…I only clicked on it my feeds because I thought the title of the post was, “How to: Mine server logs for broken limbs”.
I was giddy with excitement!
I’m sure it is a great post, but not as fun as finding broken limbs.
1. Ian Lurie says:
  August 19, 2012 at 4:03 pm
  Now you’ve given me an idea for my next post…
andrea "gareth jax" Scarpetta says:
August 17, 2012 at 1:41 am
excellent tutorial 🙂
I think it would interesting to add an option to step 10: “outreach by email or social media if possible and ask to correct the link”. This offer an excellent opportunity to create a relationship and maybe suggest more awesome contents to links!
Baptiste says:
August 31, 2012 at 5:28 am
Nice tutorial, very easy to follow.
One note, you may not get what you expect in access.log, these files may only contains internal connections, the interesting logs can be other_vhosts_access.log*, depends on apache configuration. When trying to get the log files from the system administrator, make sure you get the appropriate files right away.

Comments are closed.

Why bother?

Get your tools together

1: Get access to the log files

1b: If you can’t get the files

2: Extract the log files

transferringfiles1

unzipping1

3: Combine the log files

Interlude: What you need from this file

4: Use GREP to find the 404 errors

5: Get rid of Googlebot

6: Prepare your spreadsheet

space-delimited-import

7: Set up filtering.

filter

filter-no-dash

filter-no-domain

8: Find the broken external links

final-spreadsheet

pivoted

9: Prioritize the links

10: How to fix the links

Put away that letter opener…

Give Me More

Ian Lurie

Comments