Practical Use for Standard Deviation & Percentiles

Your vision is to offer your customers the most efficient way to purchase from you while setting yourself up for repeat business. You've been thinking about a subscription model. Standard deviation and percentiles will help you here.

Your goal is to get a baseline and understand how often a customer shops with you in a year. You want to take this information and determine if a subscription-based model may make sense. A few customers have mentioned it here, and there is feedback, but now it’s time to dive into the data and see how often they purchase.

You could figure this out by taking the total order quantity in a period of time such as the year 2021 and divide that by your customer count and you would be done. Or would you? Sometimes average isn’t the best route to go because you may have data that can skew the average. If you use a few additional calculations, you can ensure visibility into outliers.

The solution to this is to look into descriptive statistics, by not only looking at the averages but also include mode, min, max, standard deviation, and percentiles.

This link is to a sample spreadsheet. Let’s walk through the definition and purpose for each item in the spreadsheet. The focus will be on total order quantity.

Key descriptive statistics terminology

Mean: The most straightforward and commonly used. Find the total order quantity and divide by the number of customers. In this case, you see in column H, or for the average order quantity, it’s rounded to 7.5. This is also referred to as the mean.

Mode: The most commonly occurring number, in this case, column I, and the most commonly occurring order quantity is 10. So most often, your customers ordered 10 times in 2021.

Min: The fewest number of orders, column, J, 1.

Max: The highest numbers of orders, column K, 12.

Range: Mentioning range in definitions as it is part of descriptive statistics but this wasn't used in our calculations because it doesn't provide much additional value here. You would take max-min to get this, and in our case, it would be 11. This tells you the difference between the lowest and highest values.

Percentile: This can be used to look at your data set and determine a relative rank. It’s not going to tell you the rank of any particular customer, but instead, as a whole. You could sort your data and divide it out to find the percentiles manually, but Excel and Google Sheets make it very easy with the formulas provided. The 50th percentile is also known as the median.

If you have kids, you go to the doctor for check-ups and they tell you where their height and weight rank among other kids. So if your child is in the 90th percentile, then only 10% of kids are taller or weigh more than yours. Your child is taller and weighs more than 90% of other comparable-age kids.

Standard Deviation: This is going to tell you how spread out your data is around the average or mean. Why would you want to know this? It will tell you where most of your customer data will fall. Again, you could do this manually, but for the purpose of this article, you need to understand why you would use this and you may use Excel or Google Sheets to quickly do this. As long as you have a normal distribution in your sales data, you can follow the 68- 95-99.7 rule, where 68% of your data falls within 1 standard deviation, 95% within 2 standard deviations, and 99.7% within three.

Let’s look at calculating the standard deviation to better understand how we arrive here.

If you want to manually find the standard deviation, you can look at columns E and F.

Step 1: Find the mean or average order quantity, 7.511.

Step 2: For each number (each of the 1000 customer IDs) subtract the mean from the total order quantity, column E.

Step 3: Square each of those values, column F.

Step 4: Find the average of the squared differences, which is called variance, cell H12.

Step 5: Take the square root of the variance, cell H14.

Using this method, or the formula in Google Sheets, we get 2.4 as the standard deviation.

Let's summarize our descriptive statistics

We have all these extra numbers now on the spreadsheet, what does this mean for the total order quantity? The purpose of looking at additional figures is to put your average in context, and also do a spot check for outliers in your calculation that could skew your average. How would you do that? Look and min and max. Your minimum order quantity is 1 and the max is 12. If you adjust a random field in column C and let's do the example of replacing cell C3 with 30,000, you will see the max go up to 30,000 and your average goes up to 37.5. It’s recommended that before you start looking at your numbers, you check that the min and max make sense and if you have one or two extreme outliers, they get removed from the data set and note that in your assumptions. Luckily, our numbers make sense, but just think if you told someone the average order quantity is 37.5 just because of one single data entry mistake! Sometimes it’s not a mistake, you may have a single customer with extreme order volume, but either way, that is not the ‘norm’ and if you really want to understand what your typical volume is, you should remove the couple of extreme outliers and document it.

You see that each customer on average ordered from you 7.5 times in 2021. However, the mode is 10. So, 10 is the most commonly occurring count of orders. Why would the average differ so much from the mean? You likely have more order quantities that are on the low end, which is skewing your average down a bit.

Additionally, let’s look at the percentiles. The 50th percentile is 8, meaning that half your customers ordered more than that and half less. The 90th percentile is 10, meaning that just 10% of your customers ordered from you more than 10 times, 90% of them ordered 10 or fewer times.

So now what do you do with a standard deviation of 2.4? 68% of data will fall within one standard deviation of the mean. So, go back to the average or mean of 7.5 + 2.4 standard deviation and 7.5 - 2.4 = 9.9 and 5.1. 68% of your customers ordered between 5.1 and 9.9 times. 95% of your customers between 2.7 and 12.3 orders and 99.7% between .3 and 14.7 orders.

Conclusion on descriptive statistics and your subscription model idea

Descriptive statistics isn’t directly a predictive tool, however, you now have a baseline based on past performance at what to expect for the number of times a customer will purchase from you.

So going back to the original question, does a subscription model make sense? You’re on your way to figuring this out. You can see that 90% of your customers order 10 or fewer times a year from you, half of them ordered 8 times, so you may want to look further at the times of the year they order, but you may be able to offer a bimonthly or monthly subscription. You can plug in the 99th percentile to find that is 12. So 99% of your customers ordered 12 times or fewer, and with 52 weeks in a year, it’s highly unlikely that your customers would sign up for a weekly subscription.

Try applying descriptive statistics the next time you have a business problem to solve that you would normally just use an average for and comment how it went!

Learn more about how I can help.

Written by Nicole Hullihen, July 31st, 2021

References recommended to learn more about descriptive statistics and percentiles.

https://www.investopedia.com/terms/d/descriptive_statistics.asp

https://bookdown.org/ejvanholm/Textbook/descriptive-statistics.html

https://www.mathsisfun.com/data/standard-deviation.html

https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/

https://www.thoughtco.com/what-is-a-percentile-3126238