Statistics And Web Analytics When Do I Need What

Statistics And Web Analytics When Do I Need What

Management Summary

Whenever we work with large amounts of data or tables, the temptation is to turn to statistics for help. And it’s the same in the area of ​​web analytics. But you can only work with statistics under certain conditions. And it can't always provide an answer to every question. In this article I would like to provide some clarity about when statistics can be used sensibly in the context of web analytics. First of all, perhaps one more thing: I can promise one thing to all those who think that they will soon no longer be able to see the screen because of all the formulas: I will get by in this article without any formulas or arithmetic.

What does statistics actually do?

The word statistics can have many meanings. One of these meanings – the one I am referring to here – is the inference from a sample to an underlying population. Strictly speaking, this type of statistics is called “inferential statistics.” At the beginning of every investigation, scientific or not, there is an interest in knowledge. There are entities I want to know about, such as visitors to a website. The idea with statistics is to look at just a few representative units instead of all units. So only a few units of a clearly defined sample are examined. All of this is done with the aim of being able to make statements about the population at least with a certain degree of accuracy, without having to bear the costs of a complete survey of the population.

Very important concepts here are the sample variance, the fluctuation range and the statistical significance. What do they mean?

Sample variance

If we look at a characteristic, for example the session duration, then of course it is different for every visitor to a website. While you can calculate an average session length, none of the visitors will meet the exact average. Some will have longer sessions, others shorter sessions. Statisticians like to call the phenomenon that not everyone is like the average variance. Everyone else also says scatter. And it’s not just the units of the population that are scattered – no, of course the units of the sample are also scattered around the so-called sample mean.

Range of fluctuation

Because our observations are spread out, statements that we arrive at thanks to a sample are also subject to uncertainty. We could end up being unlucky when selecting the sample and accidentally only catch visitors who were on our website for a particularly short period of time. Then the sample mean would indicate a short session duration. However, it is quite unlikely to select only those with short session durations when drawing a sample. The more extreme the results, the less likely it is that the sample consists only of such extreme observations. That is why the range of fluctuation is always included in statistical key figures. This gives us a range in which the actual average of the population should lie with the usual 95 percent statistical certainty.

So if I take a sample to find out the average visit time on my website, I might find, for example, that the average session length on my website is 12 minutes +/- 2 minutes of variation. With this knowledge, I can then say that the average session duration for all website visitors is between 10 and 14 minutes with 95 percent statistical certainty.

Statistical significance

So far so good. However, we are often more interested in differences between groups, for example whether my target group of 18-34 year olds stays on my website longer than older visitors or not. If we work here on the basis of samples, then statistics come into play again. When comparing groups, essentially nothing happens other than calculating the sample mean and its range – in this case, separately for each of the groups being compared. If the intervals of the mean and the fluctuation range do not overlap, then one can say that the difference will most likely be found not only in the sample, but also in the population.It should have already become clear, but to be on the safe side: It is entirely possible that a difference does not occur in the sample, but can already be found in the population. This is again the well-known bad luck when drawing a sample. If we’re unlucky, we always pull people from the young target group and the old target group who were only on the website for a short time. Then there would be no difference in the sample, even though it exists in the population. Of course, the same thing also happens the other way around: a difference occurs in the sample that does not exist in the population. That’s bad luck again. But statistical significance helps us determine how likely it is that we are unlucky or not.And that is exactly what is meant by statistical significance – just expressed in a number (the p-value). The smaller the p-value, the more certain you can be that the difference is not just in the sample.

When do we need statistics and when not?

The answer to the question of when we can use statistics in web analytics is actually quite simple: it depends on whether the data that needs to be analyzed is a sample from a larger population or the population itself. The problem, however, is that it is often not so easy to say. It also depends very much on the question.

There are four typical questions:

Statements about a specific period of time

When making statements about a certain period of time, I want to know how strongly a feature occurred during that period. E.g. how many visitors did my website have in the last 30 days. I usually don’t need statistics for this. Rather, one will look at the entire period and accept the result as such. No fluctuation ranges, no significance. Theoretically, it would be possible to make this statement based on a sample. So first you randomly select a few days, then the average is calculated over these selected days and finally the fluctuation range is determined. As I said, technically possible, but there is no reason to proceed this way because every analysis tool can provide the numbers from the last 30 days at the push of a button.

Comparison of two time periods

Things don’t get much more exciting here. If I want to know whether my website grew more in the last month than in the previous month, then I operate with the population again. I calculate the total of all visitors in the last month and also the total of all visitors in the month before. So no sample is taken and there is no fluctuation range and no significance in this question. Differences that are observed are actual differences. Whether their size has practical significance or not cannot be judged by statistics.

Comparison of many time periods

Things get more exciting when we want to compare several time periods. Let’s say I run an internationally successful website and want to know whether the usage figures during the day are very different from those at night. Here I can again calculate the daily difference between day and night visitors based on the population without any statistics and then compare them. If I answer this question without a sample, then the same applies as above: no sample, no statistics.

But I could also take a representative random sample and determine the number of day and night visitors for just a few days. Then a statistical test helps me to determine whether the differences found are so large that I can assume that they also occur in the population (and not just in the sample). So there is a sample variance here (namely the respective spread between the daytime and nighttime user numbers), a range of fluctuation (namely the range in which the respective number of visitors in the population is likely to lie). And there is also a statistical significance, namely the certainty with which I can say that the difference that occurred in the sample also occurred in the population.

Comparison of two variants

From a statistical perspective, the most exciting thing is certainly the experiment, i.e. the comparison of two or more variants with each other. For example, does one banner lead to more sales (conversions times conversion value) than another? Here we have the first case where it is easier to answer this question using a sample.

Let’s say I run the experiment for 2,000 impressions: Banner A is shown to 1,000 unique clients, and Banner B is shown to 1,000 other unique clients. I then determine which banner a visitor saw, whether there was a conversion during their visit and finally what value this conversion had. Here you can now calculate the respective sample means, sample variances and fluctuation ranges for the Banner A visitors and the Banner B visitors. Finally, I can use a statistical test to determine whether the sales from Banner A are statistically significantly different from the sales from Banner B. And only if there is a statistically significant result can I apply this difference to the population. Only then can I assume that an increase in conversions among the Banner B visitors in the sample would also be found in the population.

Summary

So we have seen that the application of statistics in web analytics depends very much on the question. The entire database can often be used to answer the question. Then no statistical methods are necessary. Statistics are only used when working with samples to determine whether findings from the sample can also be transferred to the population.

The procedures described all refer to the description of the past. To make predictions, such as how many visitors I will have

on my website in two months, econometrics will be used instead of the classic statistical tool.

And their own rules and ideas apply there. But more about that another time.Christoph Waldhauseris a data scientist at HEROLD Business Data. He deals with modeling customer behavior and geographical conditions.

More about this at:Herald Dialogue and Data

e-dialog office Vienna
Relevant content

More about Analytics