Using SiteSucker For Testing Redirects

Creative Commons License coda
My boss threw down the gauntlet Monday morning during our weekly meeting. In relaunching one of our decade old platforms, we couldn’t afford to get bashed by fickle finger of Google Search Results and I needed to take extra care in ensuring all redirects were properly made. With over twenty thousand pages, it was no small task and I struggled finding a way to automate it.

Ultimately, I had to make a compromise between feasibility and correctness, but I’m pretty satisified with the results. In fact, SiteSucker was able to confirm that, besides images, we did manage a 100% conversion for all the existing urls.

Google Analytics & Full Realization

At first, I spent about a day going through the top 1000 URLs according to the Google Analytics tracker. This was horribly redundant, mind-numbing work but I didn’t see a way to export all 5 MILLION URLs that Google Analytics had on on record for the last month.
ga-hell1

I calculated at the rate I needed to go through the top 1000, I’d need almost 7 years to double check the rest. And, of course, I knew we didn’t have that many valid pages in our site.

Site Sucker to the Rescue

I’d used SiteSucker a few times in recent months to double check the health of our site’s link structure. It’s extremely fast and the user interface is very lean (making use of Mac OS’s Console logging application). What I wondered was how I could execute a web crawl not from a site, but from a saved file? Turns out, it’s very easy!

First, ensure the settings of SiteSucker to log the download history (and save that log):
gpp-settings

Then, enter the original sitename in the Web URL input and hit enter. You’re off to the races!
gpp-running

For ~20k links, it took almost 20 minutes for SiteSucker to grab them all. Take the finished log output and snip away the unneeded text to the left and right of the url. Go ahead and take the extra time to wrap them up in a nice anchor tag (this will help in the next step).

Copy this file and rename it to reflect the new site where you want to test your redirects. In your favorite editor search/replace the old domain with the test (or new) domain. Wrap it in with basic html & body tags and change the extension to .html. Now, we have a very basic html page containing all the links from the old site.

In SiteSucker, go back into the settings and check the Limits. We want to enforce a maximum level of 1 now. This is because we already have all the relevant links in our file – no point in asking SiteSucker to recrawl the entire site for every original link (this would take days).
new-settings

Finally, drag the html file you created (containing the test or new domain name and all links) into the Web URL bar and let go. SiteSucker dutifully follows every link and reports its findings. Hopefully, you won’t have too many ERRORs, but if you do, it’s quite easy now to rectify them as you have a log showing exactly which redirects failed!

Don’t let huge numbers of links frighten you into thinking you can’t make your site better. There are plenty of web crawlers available on many platforms – I have also used and recommend Xenu on Wine. Thinking a bit outside of the box can make it easy to turn that mountain into a mole hill again.

How do you maintain your sites’ link “healthiness” ? Let us know in the comments.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.