The normal distribution, also known as the Gaussian distribution, is one of the most widely used probability distributions in statistics and machine learning. Understanding its core properties, mean and variance, is important for interpreting data and modelling real-world phenomena. In this article, we will dig into the concepts of mean and variance as they relate to the normal distribution, exploring their significance and how they define the shape and behaviour of this ubiquitous probability distribution.
What is a Normal Distribution?
A normal distribution is a continuous probability distribution characterized by its bell-shaped curve, symmetric around its mean (μ). The equation defining its probability density function (PDF) is:
Where:
- μ: the mean (center of the distribution),
- σ2: the variance (spread of the distribution),
- σ: the standard deviation (square root of variance).
Mean of the Normal Distribution
The mean (μ) is the central value of the distribution. It indicates the location of the peak and acts as a balance point where the distribution is symmetric.
Key points about the mean:
- All values in the distribution are distributed equally around μ.
- In real-world data, μ often represents the “average” of a dataset.
- For a normal distribution, about 68% of the data lies within one standard deviation (μ±σ).
Example: If a dataset of heights has a normal distribution with μ=170 cm, the average height is 170 cm, and the distribution is symmetric around this value.
Also read: Statistics for Data Science: What is Normal Distribution?
Variance of the Normal Distribution
The variance (σ2) quantifies the spread of data around the mean. A smaller variance indicates that the data points are closely clustered around μ, while a larger variance suggests a wider spread.
Key points about variance:
- Variance is the average squared deviation from the mean, where xi are individual data points.
- The standard deviation (σ) is the square root of the variance, making it easier to interpret in the same units as the data.
- Variance controls the “width” of the bell curve. For higher variance:
- The curve becomes flatter and wider.
- Data is more dispersed.
Example: If the heights dataset has σ2=25, the standard deviation (σ) is 5, meaning most heights fall within 170±5 cm.
Also read: Normal Distribution : An Ultimate Guide
Relationship Between Mean and Variance
- Independent properties: Mean and variance independently influence the shape of the normal distribution. Adjusting μ shifts the curve left or right, while adjusting σ2 changes the spread.
- Data insights: Together, these parameters define the overall structure of the distribution and are critical for predictive modelling, hypothesis testing, and decision-making.
Practical Applications
Here are the practical applications:
- Data Analysis: Many natural phenomena (e.g., heights, test scores) follow a normal distribution, allowing for straightforward analysis using μ and σ2.
- Machine Learning: In algorithms like Gaussian Naive Bayes, the mean and variance play a crucial role in modeling class probabilities.
- Standardization: By transforming data to have μ=0 and σ2=1 (z-scores), normal distributions simplify comparative analysis.
Visualizing the Impact of Mean and Variance
- Changing the Mean: The peak of the distribution shifts horizontally.
- Changing the Variance: The curve widens or narrows. A smaller σ2 results in a taller peak, while a larger σ2 flattens the curve.
Implementation in Python
Now let’s see how to calculate the mean, variance, and visualizing the impact of mean and variance using Python:
1. Calculate the Mean
The mean is calculated by summing up all data points and dividing them by the number of points. Here’s how to do it step-by-step in Python:
Step 1: Define the dataset
data = [4, 8, 6, 5, 9]
Step 2: Calculate the sum of the data
total_sum = sum(data)
Step 3: Count the number of data points
n = len(data)
Step 4: Compute the mean
mean = total_sum / n
print(f"Mean: {mean}")
Mean: 6.4
Or we can use the built-in function mean in the statistics module to calculate the mean directly
import statistics
# Define the dataset data = [4, 8, 6, 5, 9]
# Calculate the mean using the built-in function
mean = statistics.mean(data)
print(f"Mean: {mean}")
Mean: 6.4
2. Calculate the Variance
The variance measures the spread of data around the mean. Follow these steps:
Step 1: Calculate deviations from the mean
deviations = [(x - mean) for x in data]
Step 2: Square each deviation
squared_deviations = [dev**2 for dev in deviations]
Step 3: Sum the squared deviations
sum_squared_deviations = sum(squared_deviations)
Step 4: Compute the variance
variance = sum_squared_deviations / n
print(f"Variance: {variance}")
Variance: 3.44
We can also use the built-in method to calculate the variance in the statistic module.
import statistics
# Define the dataset data = [4, 8, 6, 5, 9]
# Calculate the variance using the built-in function
variance = statistics.variance(data)
print(f"Variance: {variance}")
Variance: 3.44
3. Visualize the Impact of Mean and Variance
Now, let’s visualize how changing the mean and variance affects the shape of a normal distribution:
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
Step 1: Define a range of x values
x = np.linspace(-10, 20, 1000)
Step 2: Define distributions with different means (mu) but same variance
means = [0, 5, 10] # Different means
constant_variance = 4
constant_std_dev = np.sqrt(constant_variance)
Step 3: Define distributions with the same mean but different variances
constant_mean = 5
variances = [1, 4, 9] # Different variances
std_devs = [np.sqrt(var) for var in variances]
Step 4: Plot distributions with varying means
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for mu in means:
y = norm.pdf(x, mu, constant_std_dev) # Normal PDF
plt.plot(x, y, label=f"Mean = {mu}, Variance = {constant_variance}")
plt.title("Impact of Changing the Mean (Constant Variance)", fontsize=14)
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.legend()
plt.grid()
Step 5: Plot distributions with varying variances
plt.subplot(1, 2, 2)
for var, std in zip(variances, std_devs):
y = norm.pdf(x, constant_mean, std) # Normal PDF
plt.plot(x, y, label=f"Mean = {constant_mean}, Variance = {var}")
plt.title("Impact of Changing the Variance (Constant Mean)", fontsize=14)
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()
Also read: 6 Types of Probability Distribution in Data Science
Inference from the graph
Impact of Changing the Mean:
- The mean (μ) determines the central location of the distribution.
- Observation: As the mean changes:
- The entire curve shifts horizontally along the x-axis.
- The overall shape (spread and height) remains unchanged because the variance is constant.
- Conclusion: The mean affects where the distribution is centered but does not impact the spread or width of the curve.
Impact of Changing the Variance:
- The variance (σ2) determines the spread or dispersion of the data.
- Observation: As the variance changes:
- A larger variance creates a wider and flatter curve, indicating more spread-out data.
- A smaller variance creates a narrower and taller curve, indicating less spread and more concentration around the mean.
- Conclusion: Variance affects how much the data is spread around the mean, influencing the width and height of the curve.
Key points:
- The mean (μ) determines the centre of the normal distribution.
- The variance (σ2 ) determines its spread.
- Together, they provide a complete description of the normal distribution’s shape, allowing for precise data modeling.
Common Mistakes When Interpreting Mean and Variance
- Misinterpreting Variance: Higher variance doesn’t always indicate worse data; it may reflect natural diversity in the dataset.
- Ignoring Outliers: Outliers can distort the mean and inflate the variance.
- Assuming Normality: Not all datasets are normally distributed, and applying mean/variance-based models to non-normal data can lead to errors.
Conclusion
The mean (μ) determines the centre of the normal distribution, while the variance (σ2) controls its spread. Adjusting the mean shifts the curve horizontally, whereas changing the variance alters its width and height. Together, they define the shape and behaviour of the distribution, making them essential for analyzing data, building models, and making informed decisions in statistics and machine learning.
Also, if you are looking for an AI/ML course online, then explore: The certified AI & ML BlackBelt Plus Program!
Frequently Asked Questions
Ans. The mean determines the centre of the distribution. It represents the point of symmetry and the average of the data.
Ans. The mean determines the central location of the distribution, while the variance controls its spread. Adjusting one does not affect the other.
Ans. Changing the mean shifts the curve horizontally along the x-axis but does not alter its shape or spread.
Ans. If the variance is zero, all data points are identical, and the distribution collapses into a single point at the mean.
Ans. Mean, and variance define the shape of the normal distribution and are essential for statistical analysis, predictive modelling, and understanding data variability.
Ans. Higher variance leads to a flatter, wider bell curve, showing more spread-out data, while lower variance results in a taller, narrower curve, indicating tighter clustering around the mean.
By Analytics Vidhya, November 26, 2024.