May 23, 2005

In defense (sort of) of Statistics

Nick Finck says, "when it comes to web statistics, be very skeptical." I'm inclined to agree with Nick, but I also believe that web stats are a very powerful and useful tool. But, like any power tool, one needs to know how to use it before running around and applying it willy-nilly to whatever one comes across. Thus, I thought it might be nice to spend a few moments expounding on the subject of web stats.

User Centered Design


I've been on the user-centered design bus for quite some time now. Looking at user behavior can help create a more useful experience, and the most valuable insights are often those that come directly from the users. Many techniques—from personas to usability tests—are quite abstracted from the users themselves. You need to create something of an artificial scenario to, for example, run a usability test. This doesn't mean the technique isn't valid, it just means that you need to realize that you're observing a simulation, not real-life behavior

But when using server logs (or other forms of web statistics), you are observing real-life behavior. This direct connection to what the user is actually doing makes the use of server logs an important tool. Important, but, as pointed out by Nick and Tim Bray, not necessarily straightforward.

Know Your Goals


Make sure you've defined goals before you embark into the world of stats. It is too easy to wander aimlessly amongst the pretty, flashing lights. Goals can help anchor you. Are you looking to measure the effect of a design tweak? Or perhaps measure an outside activity, like a training or marketing campaign? Or do you just want to know what percentage of your users are using Internet Explorer 5.0? Different goals will dictate the use of different tools or methods of analysis.

Focus on Trends, Not Numbers


This won't help Tim, who was asked to come up with an exact number of "feed users," but when using stats, I tend to focus on trends, not exact numbers. I'm interested in change over time. Is usage going up? Did the newsletter sent out on Tuesday translate into increased usage of a particular section? Did the re-positioning of this element increase or decrease usage? Answering these types of questions bypasses the inherent fuzziness of determining things like unique visitors. As long as the calculation for determining visitors remains constant, you'll likely have enough information to answer your questions.

Define the Terms


Each web stats package is going to be different. They'll define terms (like "hit", "page", and "visitor") differently. It is up to you, as the analyst, to figure out how each package uses the terms, and adjust accordingly. The good ones will let you adjust the algorithms to, for example, filter out IP addresses belonging to your organization's employees (assuming they're not the target audience).

A more complicated case comes with dynamic pages. Are page.cgi?id=1 and page.cgi?id=2 the same page or two different pages? Obviously, this depends on your setup, but you better be able to tell your stats package which is which.

I remember one case where, at first glance, it looked like one prominent feature on the site wasn't being used much at all. But, I realized that I needed to tell the stats system to account for the query string, and lo and behold, those pages were being used. Good thing we didn't take rash action before I figured out how to use the system!

Multiple Sources


Assuming you can analyze all of it, I'm a fan of using multiple sources of information. At work, we use a number of methods to learn more about our users.

I've recently written about my purchase of ClickTracks, which is neat click stream analysis software. This gives a page-by-page account of where users are clicking. This will be, I think, very helpful in helping us gauge the success of re-designs and newsletter campaigns. It also presents the data in a very nice, non-threatening way. But, it doesn't meet all of the criteria I laid out for my ideal stats system. So, we use a second (and maybe a third) server log analysis tool to help with these other items. The tools each give us an different viewport onto what is happening on the site.

We store queries entered into the site's search engine to get a better idea about the terms people use to search our site. We check those logs to see if the users are actually finding what they're looking for, and if they're entering queries that we don't have answers for. It turns out they are, and so we're working to address this by adding a different type of search functionality (long story).

Our website is a portal, even though I don't care for that term. So, we care about the usage of the resources we link to. I built a little home-grown stats collection tool to track this information. This gives us a real nice view into how our site gets used, and by whom. We can see the effect of, for example, training activity on usage. And, if we see an unexpected spike in usage, we can investigate. Often, there is an interesting story behind dramatically increasing usage.

Assuming each view is actually shedding light on your goals, the more views you can muster, the better.

Use Your Head


Don't throw out common sense just because a stats program prints a nice pie chart. Make sure you have a good sample size, both in sheer numbers and in time. Too few hits will likely lead to distortion. And a short time window could bring other factors into play that might not exist when the viewing multiple weeks or months of data. Don't forget that the stats program you're using quite likely isn't infallible. Don't make big decisions based on a percentage point or two. At the end of the day, you'll probably want to consider other non-stats factors in any decision.


What do you think? Did I miss any obvious points? Let me know!