Right Data, Wrong Conclusion: Traffic Analytics and the Distribution Problem
Web traffic data is a veritable goldmine of information for online businesses. From small-time blogs to e-commerce giants, companies rely on traffic data analytics to increase efficiency, attract new customers, and grow revenue streams. But there's a catch. Most statistical tests, especially the types of tests most familiar to business leaders, rely on a set of assumptions about the data they analyze. Arguably the most important of these assumptions is the expectation that the data follow a normal distribution. The problem is that many important web traffic metrics are, by their very nature, unlikely to meet this condition. Applying the usual standard issue statistical tests to this kind of data can produce extremely misleading--and potentially disastrous--results.
The Normal Distribution and Parametric Statistics
In statistics, a variable's distribution is a measure of how often each possible value of the variable is observed in a dataset. A normal, or Gaussian, distribution follows the familiar shape of the bell curve: most observations are clustered in the middle around the mean, and few observations have extreme values below or above the mean. If college students' scores on an economics exam follow a normal
distribution with an average of 80 (and a standard deviation of 5), then about two-thirds of students would have scores between 75 and 85. Less than 5% of students would have scores higher than 90 or lower than 70.
Many common web traffic metrics can be expected to follow a normal distribution. This is particularly true for metrics denominated in units of time, like page views per day. Over the last 100 days, a website that averages 5,000 page views per day has probably attracted between 4,000 and 6,000 views on most observed days in the dataset. But there would probably be a few bad days with fewer than 2,500 views and some good days with more than 7,500 views mixed into the data. Other metrics like page views per week or unique visitors per day are likely to follow the same shape, even though observed values will differ.
Most statistical tools, including the T-test, the F-test, analysis of variance (ANOVA), and ordinary least squares (OLS) regression, all assume that the data adhere to a standard distribution. These are called parametric tests, and they're excellent tools for analyzing the web traffic metrics discussed above. In practice, of course, data is always at least slightly abnormal, but statistical tests are robust to slight deviations. The problems start when analysts apply parametric tests to data that are not normally distributed.
Traffic Data Violates the Normality Assumption
The difficulty for data analysts is that many important web traffic metrics are unlikely to follow normal distributions. In fact, it is impossible for some metrics to be "normal." Using parametric tests to analyze these types of variables is a recipe for disaster.
First, many variables of interest to marketers, blogs, and e-commerce websites follow a binomial distribution, meaning there are only two possible values of the variable. These are usually "yes or no" variables. A classic example is click through rate: either the user clicks on the link, or the user does not click on the link. Another example is bounce rate: either the user immediately exits the page or not. There's no way for this type of variable to be normally distributed since there are only two possible values.
Other significant traffic metrics, specifically those that count the number of times an event occurs, are likely to follow a Poisson distribution. Businesses that generate revenue from ads, for example, are likely to be interested in how many different pages their users access on each visit. Most visitors are likely to view one or two pages. A smaller number will visit view three or four pages, and a still smaller number will view five or more. The same pattern holds for many important e-commerce metrics, like the number of items each customer buys: most customers will purchase one item, some will purchase two or three, and a few will buy lots of things.
Attempting to analyze these types of variables with parametric statistics can introduce all sorts of problems. These tests lose statistical power when data is not normally distributed, meaning that they are less able to detect meaningful statistical relationships. In an A/B test assessing the click through rate for two different ads, for example, a T-test may indicate no significant difference between the ads even when a large difference exists.
Parametric tests can also become highly sensitive to outliers when normality is violated. The outliers can fool parametric tests, for example, if items per customer follow a Poisson distribution with most customers purchasing one or two items, but one customer buys 100 items. The test will tend to overvalue the information provided by the outlier. In essence, the presence of that single observation of 100 items purchased tricks the test into thinking that many other customers bought 40, 50, or 60 items.
Dealing with the Non-Normal Data Problem
Parametric tests are vulnerable to failure when dealing with non-normal data, but that doesn't mean the data are useless. Clever analysts can still extract valuable information from traffic metrics like click-through rate and page-view counts by employing two techniques.
First, researchers have created analysis tools that can substitute for parametric statistics in most circumstances. These non-parametric tests can often provide the same information as parametric tests without relying on the assumption of normality. The commonly used parametric T-test, for instance, can be replaced with the Mann-Whitney test. Whereas the T-test estimates the probability that two populations have different means, the Mann-Whitney test assesses the chance that random observation from one sample is less than or greater than a random observation from a second sample. Choosing the correct non-parametric test can be complicated because the appropriate test can depend on the type of data and the underlying distribution. When used appropriately, however, these tests provide reliable and valuable information.
The second technique available to analysts is data transformation. This method is particularly useful when a variable's distribution is relatively close to normal but is positively or negatively skewed. A good example is engaged minutes: most visitors might spend two or three minutes on a site, but some are likely to spend 20 or 30 minutes, which creates a positively skewed distribution, characterized by a long tail extending to the right-hand side of the histogram. This variable could be transformed to give it a more normal distribution by taking the square root. The bulk of the observations, those in the two-to-three-minute range, would change only slightly, but the observations responsible for the skew would shrink quite a bit. An observation of 25 minutes would be reduced to five minutes, for instance.
Transforming data can seem like cheating, but power transformations retain 100% of the information contained in the original variable. In fact, this type of change is analogous to converting units, like changing degrees Fahrenheit to degrees Celsius. The transformation won't help with any kind of web traffic data, however. Variables with binomial distributions can't be transformed usefully, for example. Additionally, it can be difficult to interpret the results of tests performed with modified data because the units are unfamiliar.
Despite their shortcomings, both non-parametric tests and data transformation are vastly superior to using parametric tests on non-normal data. The application of statistical tests to data that substantially violate the normality assumption is never appropriate. Such applications can often give "correct" answers, but this is the result of coincidence, not reliable data analytics practices.