Quantifying scatter of continuous data#

Introduction#

In the realm of biostatistics, understanding the concept of scatter is crucial. Scatter refers to the spread of data points around the central value. It provides insights into the variability or diversity in a data set.

In this chapter, we will explore various methods to quantify scatter, such as the range, interquartile range, variance, and standard deviation. These measures help us understand the dispersion in our data and are fundamental to many statistical tests and models.

Here’s a sneak peek into what we will cover with Python:

  1. Range: the simplest measure of scatter, defined as the difference between the maximum and minimum values in a dataset.

  2. Interquartile Range (IQR): this measure gives us the range within which the central 50% of our data lies.

  3. Variance (Var) and standard deviation: these are more sophisticated measures of scatter that take into account how far each data point is from the mean.

Definitions#

Variance#

The variance is a measure of how much the data varies around its mean. There are two different definitions of variance.

The population variance (\(\sigma^2\)) assumes that that we are working with the entire population and is defined as the average squared difference from the mean:

\[\operatorname{Var}_p(x) = \sigma_p^2(x) = \sigma^2(x) = \frac{1}{n}\sum_{i = 1}^{n}(x_i - \overline{x})^2\]

More generally, the variance of a random variable \(X\) is the expected value of the squared deviation from the mean of \(X\), \(\mu =\operatorname{E}[X]\):

\[\operatorname{Var}(X) = \operatorname{E} \left[(X - \mu)^2\right]\]

The sample variance (\(s^2\)) assumes that we are working with a sample and attempts to estimate the variance of a larger population by applying Bessel’s correction to account for potential sampling error. The sample variance is:

\[\operatorname{Var}_s(x)= \sigma_s^2(x) = s^2(x) = \frac{1}{n-1}\sum_{i = 1}^{n}(x_i - \overline{x})^2\]

One can understand Bessel’s correction as the degrees of freedom in the residuals vector (residuals, not errors, because the population mean is unknown): \((x_1 - \bar{x}, \dots, x_n - \bar{x})\), where \(\bar{x}\) is the sample mean. While there are \(n\) independent observations in the sample, there are only \(n- 1\) independent residuals, as they sum to 0.

We can see that \(\sigma^2(x) = \frac{n - 1}{n} s^2(x)\), so as the data set gets larger, the sample variance and the population variance become less and less distinguishable, which intuitively makes sense.

Note that variance can be decomposed to show that it is also the difference of the mean of squares to the mean squared:

\[\operatorname{Var}(x) = \frac{1}{n}\sum_{i = 1}^{n}x_i^2 - \overline{x}^2\]

Standard deviation#

Variance does not have intuitive scale relative to the data being studied, because we have used a squared distance metric, therefore we can square-root it to get a measure of ‘deviance’ on the same scale as the data. We call this the standard deviation (SD) or \(\sigma(x)\), where \(\operatorname{Var}(x) = \sigma^2(x)\).

The standard deviation accounts for the variation among the values, with an estimation of the spread of the distribution:

\[\sigma(x) = \sqrt{\frac{\sum(X_i - \overline{X})^2}{n}}\]

As with variance, standard deviation has both population (\(\sigma\)) and sample (\(s\)) versions, and the sample version is calculated by default. Conversion between the two takes the form

\[\sigma(x) = \sqrt{\frac{n-1}{n}}s(x)\]

Standard error#

The standard error (SE), also known as the standard error of the mean (SEM) in the context of a mean estimate, or the standard deviation of the mean (\(\sigma_{\overline{x}}\)), is a measure of how far our sample mean is likely to be from the true population mean.

Let’s derive the formula for the standard error of the mean of \( x = x_1, x_2, \dots, x_n \):

\[\begin{split} \begin{aligned} \operatorname{Var}(\overline{x}) &= \operatorname{Var}(\frac{1}{n}\sum_{i = 1}^{n}x_i) \\ &= \sum_{i = 1}^{n}\operatorname{Var}(\frac{1}{n}x_i) \\ &= \frac{1}{n^2}\sum_{i = 1}^{n}\operatorname{Var}(x_i) \\ &= \frac{1}{n^{\cancel{2}}} \cancel{n} \operatorname{Var}(x) \\ &= \frac{1}{n}\operatorname{Var}(x) \end{aligned} \end{split}\]

The standard error of the mean is the standard deviation of the mean, which is the square root of the variance of the mean:

\[\begin{split} \begin{aligned} \sigma_{\overline{x}} &= \sqrt{\operatorname{Var}(\overline{x})} \\ &= \sqrt{\frac{\operatorname{Var}(x)}{n}} \\ &= \sqrt{\frac{\sigma^2(x)}{n}} \\ &= \frac{\sigma(x)}{\sqrt{n}} \end{aligned} \end{split}\]

The population standard deviation \(\sigma\) is typically unknown in real-world situations. Consequently, the standard error of the mean is usually estimated by substituting the sample standard deviation (\(s\)) for \(\sigma\) in the formula:

\[\sigma_{\overline{x}} \approx \frac{s}{\sqrt{n}}\]

The SE gets smaller as the samples get larger. This implies that as we collect more data, the estimate of the mean gets more precise, which makes sense intuitively. Now, the confidence interval is a range of values, derived from the SEM, that predicts the probability of a parameter (like the mean) to fall within that range as \(W = z^\ast \times \sigma_{\overline{x}}\), as illustrated in the next chapter.

Practical examples using NumPy statistical functions#

Let’s calculate the mean, standard deviation, and other statistics for a given dataset using numpy.

import numpy as np
from scipy import stats

# The dataset
data = np.array([64.7, 65.4, 88.3, 64, 71.9])

# Calculate the mean
mean = np.mean(data)
print(f"The mean of the dataset is {mean:.2f}")

# Calculate the variance
variance = np.var(data, ddof=1) # ddof=0 by default
print(f"The sample (ddof=1) variance of the dataset is {variance:.3f}")

# Calculate the standard error of the mean
sem = np.std(data, ddof=1) / np.sqrt(len(data))
print(f"The sample (ddof=1) standard error of the mean is {sem:.3f}")

# Calculate the 95% Confidence Interval
ci = stats.sem(data) * stats.t.ppf((1 + 0.95) / 2., len(data)-1)
lower_bound = np.mean(data) - ci
upper_bound = np.mean(data) + ci
print(f"The corresponding 95% confidence interval is ({lower_bound:.3f}, {upper_bound:.3f})")

# Calculate the standard deviation with ddof=0
std_dev = np.std(data, ddof=0)
print(f"The population standard deviation (ddof=0) is {std_dev:.3f}")
The mean of the dataset is 70.86
The sample (ddof=1) variance of the dataset is 105.013
The sample (ddof=1) standard error of the mean is 4.583
The corresponding 95% confidence interval is (58.136, 83.584)
The population standard deviation (ddof=0) is 9.166

The ddof parameter stands for Delta Degrees of Freedom. When calculating the standard deviation, the sum of squared deviations from the mean is usually divided by [n - ddof], where n is the number of elements in the dataset.

Setting ddof=0 means that we divide simply by n, which gives us the population standard deviation. This assumes that the dataset represents the entire population.

On the other hand, if we set ddof=1, we would be dividing by [n - 1], giving the sample standard deviation. This is used when our data is a sample from a larger population.

In most cases, if we’re working with a sample, we would use ddof=1 to get an unbiased estimate of the actual population standard deviation. However, if our data represents the entire population, we would use ddof=0.

The 3-sigma rule#

The 3-sigma (3-σ) rule, also known as the empirical rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution:

  1. Approximately 68% of the data falls within one standard deviation of the mean.

  2. Approximately 95% of the data falls within two standard deviations of the mean.

  3. Approximately 99.7% of the data falls within three standard deviations of the mean.

This rule is a quick way to get an understanding of where most of the values in the dataset are likely to fall, assuming the data follows a normal distribution.

Here’s how it relates to the standard deviation:

  • The standard deviation is a measure of the amount of variation or dispersion in a set of values. A low standard deviation means that the values tend to be close to the mean, while a high standard deviation means that the values are spread out over a wider range.

  • When we talk about the “3-σ” or “three standard deviations from the mean”, we’re talking about the range of values that covers about 99.7% of the data (assuming a normal distribution). This means that the probability of a value being outside of this range is very low (0.3%).

So, the 3-sigma rule gives us a way to understand the standard deviation in terms of the percentage of data that falls within these ranges. It’s a useful rule of thumb for understanding the distribution of the data when working with statistics.

Geometric mean and standard deviation#

The geometric mean is a type of average that is calculated by multiplying all the numbers together and then taking the nth root of the product. It’s particularly useful when we want to compare things with very different properties, or when dealing with values that increase exponentially.

The geometric standard deviation is a measure of spread of a set of numbers where the geometric mean is preferred. It expounds the factor by which a set of numbers deviate from the geometric mean. It’s used when dealing with values that increase exponentially.

If a dataset follows a log-normal distribution, the logarithm of the values will have a normal distribution. In a log-normal distribution, the multiplicative spread of values (i.e., the ratio between different data points) is more important than the absolute difference between them. This is where the geometric mean and geometric standard deviation come into play.

For a log-normal distribution, about 2/3 of the values fall between the geometric mean divided by the geometric standard deviation and the geometric mean multiplied by the geometric standard deviation. This is analogous to the rule for a normal distribution, where about 2/3 of the values fall within one standard deviation of the mean.

When we take the logarithm of a log-normally distributed dataset, it becomes normally distributed (this is the “log converts multiplicative scatter to additive” part). The geometric mean corresponds to the arithmetic mean of the logged data, and the geometric standard deviation corresponds to the standard deviation of the logged data. So, the rule that 2/3 of the data falls within one standard deviation of the mean in a normal distribution translates to the rule we mentioned for a log-normal distribution.

This property is very useful in many fields, including finance, biology, and engineering, where variables often exhibit exponential growth and one is interested in rates or multiplicative factors rather than raw differences. For example in biology, we can use geometric mean and standard deviation for potency of drug (EC50, IC50, Km, Ki etc.), blood serum concentrations of many natural or toxic compounds, etc.

While NumPy provides a lot of useful statistical functions, it does not directly provide a function to calculate the geometric mean or the geometric standard deviation. However, we can calculate these using numpy’s other functions.

# Calculate the geometric mean
gmean = np.exp(np.mean(np.log(data)))
print(f"The geometric mean is {gmean:.3f}")

# Calculate the geometric standard deviation
gstd = np.exp(np.std(np.log(data), ddof=1))
print(f"The geometric standard deviation is {gstd:.4f}")
The geometric mean is 70.319
The geometric standard deviation is 1.1451

Weighted statistics#

Weighted statistics are a type of statistics where some data points contribute more than others. The weight of a data point refers to the relative importance or frequency of that point in the dataset.

For example, if we’re calculating the average test score in a class, but some tests count for more of the final grade than others, we would use a weighted mean, where each test score is multiplied by the weight of that test before summing and dividing by the total weight.

# The dataset
data = np.array([64.7, 65.4, 88.3, 64, 71.9])

# Weights for each data point
weights = np.array([0.1, 0.2, 0.3, 0.1, 0.3])

# Calculate the weighted mean
weighted_mean = np.average(data, weights=weights)
print(f"The weighted mean is {weighted_mean}")
The weighted mean is 74.01

Statistical analysis with other packages#

SciPy#

This module contains a large number of probability distributions as well as a growing library of statistical functions:

  • Interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles. IQR is a measure of variability, based on dividing a data set into quartiles.

  • Descriptive summary provides several descriptive statistics of the data. This includes measures such as the number of elements in the data set, the minimum and maximum value, the mean, variance, skewness, and kurtosis.

  • Z-score is a statistical measurement that describes a value’s relationship to the mean of a group of values. It is measured in terms of standard deviations from the mean.

  • Coefficient of variation (CV) is a statistical measure of the relative dispersion of data points in a data series around the mean, with \(\operatorname{CV} = \sigma / \mu\). For example, if CV = 0.25, we know that the SD is 25% of the mean

  • Geometric mean is a kind of average of a set of numbers that is different from the arithmetic average. The geometric mean is calculated by multiplying all the numbers together and then taking the nth root of the product.

  • Geometric standard deviation is a measure of spread of a set of numbers where the geometric mean is preferred, it expounds the factor by which a set of numbers deviate from the geometric mean.

Here’s how we can use it to calculate various statistical measures:

# Interquartile range (IQR)
iqr = stats.iqr(data)
print(f"The interquartile range (IQR) is {iqr:.2f}")

# Standard Error of the Mean (SEM)
sem = stats.sem(data)
print(f"The standard error of the mean (SEM) is {sem:.3f}")

# Descriptive summary
summary = stats.describe(data)
print(f"The descriptive summary is:\n{summary}")

# Z-score
z_score = stats.zscore(data)
print(f"The z-scores are {z_score}")

# Coefficient of Variation (CV)
cv = stats.variation(data)
print(f"The coefficient of variation (CV) is {cv:.4f}")

# Geometric mean
gmean = stats.gmean(data)
print(f"The geometric mean is {gmean:.2f}")

# Geometric standard deviation
gstd = stats.gstd(data)
print(f"The geometric standard deviation is {gstd:.3f}")
The interquartile range (IQR) is 7.20
The standard error of the mean (SEM) is 4.583
The descriptive summary is:
DescribeResult(nobs=5, minmax=(64.0, 88.3), mean=70.86000000000001, variance=105.01299999999995, skewness=1.1912013156755001, kurtosis=-0.24972334599204293)
The z-scores are [-0.6720695  -0.59569796  1.90274222 -0.74844103  0.11346628]
The coefficient of variation (CV) is 0.1293
The geometric mean is 70.32
The geometric standard deviation is 1.145

Pandas#

We can perform similar statistical calculations using the Pandas library, which provides a DataFrame structure to hold and manipulate data. These methods are very similar to the ones provided by NumPy and SciPy, but they are methods of the DataFrame or Series object instead of standalone functions.

import pandas as pd

df = pd.DataFrame(data, columns=['Data'])

# Calculate the mean
mean = df['Data'].mean()
print(f"The mean of the dataset is {mean:.2f}")

# Calculate the standard deviation
std_dev = df['Data'].std(ddof=1)
print(f"The standard deviation of the dataset is {std_dev:.3f}")

# Calculate the Standard Error of the Mean (SEM)
sem = df['Data'].sem()
print(f"The standard error of the mean (SEM) is {sem:.3f}")

# Calculate the Interquartile Range (IQR)
iqr = df['Data'].quantile(0.75) - df['Data'].quantile(0.25)
print(f"The interquartile range (IQR) is {iqr:.2f}")

# Descriptive summary
summary = df['Data'].describe()
print(f"The descriptive summary is:\n{summary}")

# Z-score
df['Z_score'] = (df['Data'] - df['Data'].mean()) / df['Data'].std(ddof=1)
print(f"The z-scores are:\n{df['Z_score']}")

# Coefficient of Variation (CV)
cv = df['Data'].std(ddof=1) / df['Data'].mean()
print(f"The coefficient of variation (CV) is {cv:.4f}")
The mean of the dataset is 70.86
The standard deviation of the dataset is 10.248
The standard error of the mean (SEM) is 4.583
The interquartile range (IQR) is 7.20
The descriptive summary is:
count     5.000000
mean     70.860000
std      10.247585
min      64.000000
25%      64.700000
50%      65.400000
75%      71.900000
max      88.300000
Name: Data, dtype: float64
The z-scores are:
0   -0.601117
1   -0.532808
2    1.701864
3   -0.669426
4    0.101487
Name: Z_score, dtype: float64
The coefficient of variation (CV) is 0.1446

Statsmodels#

DescrStatsW is a class in the Statsmodels library in Python that provides descriptive statistics and statistical tests with weights for case weights. It assumes that the data is 1D or 2D with observations in rows, variables in columns, and that the same weight applies to each column.

from statsmodels.stats.weightstats import DescrStatsW

# Create a DescrStatsW object
d1 = DescrStatsW(data, weights=weights, ddof=0)

# Calculate the weighted mean
mean = d1.mean
print(f"The weighted mean is {mean}")

# Calculate the variance
variance = d1.var
print(f"The variance is {variance:.4f}")

# Calculate the standard deviation
std = d1.std
print(f"The standard deviation is {std:.4f}")
The weighted mean is 74.01
The variance is 96.1109
The standard deviation is 9.8036

Interlude: understanding key statistical formulas#

In this section, we will explore some fundamental statistical formulas that are crucial in data analysis. These include the mean of a sum, the variance of a linear combination, and the standard error of the difference between means.

Mean of a sum#

The mean, or average, is a measure of central tendency. If we have two random variables, \(X\) and \(Y\), the mean of their sum is simply the sum of their means. Mathematically, this is expressed as \(E(X+Y)=E(X)+E(Y)\). We demonstrate this relation as follows:

Let \(x = x_1, x_2, \dots, x_n\) and \(y = y_1, y_2, \dots, y_m\) be samples of two random variables of length \(n\) and \(m\) respectively. If \(m = n\) and \(x + y\) is formed from the element-wise sum of \(x\) and \(y\), it is obvious that the mean of \(x + y\) is equal to the sum of the mean of \(x\) and the mean of \(y\):

\[\frac{1}{n}\sum_{i=1}^n x_i + \frac{1}{n}\sum_{i=1}^n y_i = \frac{1}{n}\sum_{i=1}^n (x_i + y_i)\]

Variance of a linear combination#

The variance measures how spread out a set of data is. If we have a linear combination of two random variables, \(X\) and \(Y\), with constants \(a\) and \(b\), the variance is given by \(\mathrm{Var}(aX+bY) = a^2\mathrm{Var}(X) + b^2\mathrm{Var}(Y) + 2ab\mathrm{Cov}(X,Y)\).

We can first demonstrate that if all values are scaled by a constant, the variance is scaled by the square of that constant \(\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)\):

\[\begin{split} \begin{aligned} \operatorname{Var}(aX) &= \frac{1}{n}\sum_{i = 1}^{n}(ax_i - \overline{ax})^2 \\ &= \frac{1}{n}\sum_{i = 1}^{n}(ax_i - a\overline{x})^2 \\ &= \frac{1}{n}\sum_{i = 1}^{n}a^2(x_i - \overline{x})^2 \\ &= a^2\left(\frac{1}{n}\sum_{i = 1}^{n}(x_i - \overline{x})^2\right) \\ &= a^2\operatorname{Var}(x) \end{aligned} \end{split}\]

The variance of a sum of two random variables is given by \( \operatorname {Var}(aX+bY) = a^2 \operatorname{Var}(X) + b^2 \operatorname{Var}(Y) + 2ab \operatorname{Cov}(X,Y) \), and \( \operatorname {Var}(aX-bY) = a^2 \operatorname{Var}(X) + b^2 \operatorname{Var}(Y) - 2ab \operatorname{Cov}(X,Y) \) which can be simplified in the case of independent variables, in which case the covariate term drops out:

\[\operatorname{Var}(X \pm Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)\]

Standard error of the difference between means#

If we have two independent random variables, X and Y, the variance of the difference between their means is given by \(\operatorname{Var}(\hat{X} - \hat{Y}) = \operatorname{Var}(\hat{X}) + \operatorname{Var}(\hat{Y})\), as demonstrated above.

Since each \(x_i\) is independent and identically distributed, the standard error of the mean is the standard deviation of the mean, which is the square root of the variance of the mean, as we saw before: \(\sigma_{\overline{x}} = \sqrt{\frac{\sigma^2(x)}{n}}\)

Finally, the standard error of the difference between the means of \(x\) and \(y\) (see chapter 30 - Comparing means) can be derived to

\[\sigma_{\overline{x} - \overline{y}} = \sqrt{\frac{\sigma^2(x)}{n} + \frac{\sigma^2(y)}{m}}\]

In essence, this formula reflects the fact that the uncertainty in the difference between two sample means is a combination of the uncertainties in each individual mean.

Conclusion#

In this chapter, we delved into the world of descriptive statistics and explored various measures of central tendency and dispersion. We started with the concept of scatter and how it is quantified using measures like range, interquartile range, variance, and standard deviation. We then discussed the importance of these measures in understanding the spread and variability of our data.

We learned how to calculate these measures using Python and its powerful libraries like NumPy, SciPy and Pandas. We also touched upon the concept of weighted statistics and how some data points can contribute more than others based on their relative importance or frequency. We also introduced the Statsmodels library and its DescrStatsW class for weighted descriptive statistics, and demonstrated how to use it to calculate various statistical measures.

Finally, we took a mathematical interlude to understand key statistical formulas, including the mean of a sum, the variance of a linear combination, and the standard error of the difference between means. These formulas provide the mathematical foundation for many statistical tests and measures.

Understanding these concepts and being able to calculate and interpret these measures are fundamental skills in data analysis and statistics. They allow us to summarize and understand our data, and form the basis for many advanced statistical tests and models.

Cheat sheet#

This cheat sheet provides a quick reference for essential code snippets used in this chapter.

Statistics with NumPy#

import numpy as np

# Defining the sample dataset
data = np.array([64.7, 65.4, 88.3, 64, 71.9])

# Calculate basic statistics
mean = np.mean(data)
var = np.var(data, ddof=1) # ddof=0 by default
std = np.std(data, ddof=1)
sem = std / np.sqrt(len(data))

# Geometric mean and std
gmean = np.exp(np.mean(np.log(data)))
gstd = np.exp(np.std(np.log(data), ddof=1))

# Weighted statistics
# Weights for each corresponding point in data
weights = np.array([0.1, 0.2, 0.3, 0.1, 0.3])

weighted_mean = np.average(data, weights=weights)

Statistics with SciPy#

from scipy import stats

# Basic statistics
summary = stats.describe(data) # desciptive summary
iqr = stats.iqr(data)
sem = stats.sem(data)
zscore = stats.zscore(data)
cv = stats.variation(data)

# Geometric mean and std
gmean = stats.gmean(data)
gstd = stats.gstd(data)

Statistics with Pandas#

import pandas as pd

# Define the dataframe
df = pd.DataFrame(data, columns=['values'])

# Basic statistics
summary = df['values'].describe() # desciptive summary
mean = df['values'].mean()
std = df['values'].std(ddof=1)
sem = df['values'].sem()
iqr = df['values'].quantile(0.75) - df['values'].quantile(0.25)
df['zscore'] = (df['values'] - df['values'].mean()) / df['values'].std(ddof=1)
cv = df['values'].std(ddof=1) / df['values'].mean()

Statistics with Statsmodels#

from statsmodels.stats.weightstats import DescrStatsW

# Create a DescrStatsW object
d = DescrStatsW(
    data,
    weights=weights, # 'None' if no weights
    ddof=1)

# Basic statistics
mean = d.mean
variance = d.var
std = d.std

Session information#

The output below details all packages and version necessary to reproduce the results in this report.

Hide code cell source
!python --version
print("-------------")

from importlib.metadata import version

# List of packages we want to check the version
packages = ['numpy', 'pandas', 'scipy', 'statsmodels']

# Initialize an empty list to store the versions
versions = []

# Loop over the packages
for package in packages:
    try:
        # Get the version of the package
        package_version = version(package)
        # Append the version to the list
        versions.append(package_version)
    except Exception:  # Use a more general exception for broader compatibility
        versions.append('Not installed')

# Print the versions
for package, version in zip(packages, versions):
    print(f'{package}: {version}')
Python 3.12.7
-------------
numpy: 1.26.4
pandas: 2.2.2
scipy: 1.14.1
statsmodels: 0.14.2