Summary statistics are the first thing every data scientist looks at when meeting a new dataset — they give you a quick snapshot of the data’s central tendency, spread, shape, and potential issues like outliers or missing values.
In Python 2026, you’ll mostly use Pandas for this (with Polars as the high-speed alternative for large data). These numbers help you spot patterns, decide next steps, and communicate insights before diving into modeling or visualization.
1. Loading a Sample Dataset
Let’s use a simple dataset to demonstrate:
import pandas as pd
import numpy as np
# Example dataset
data = {
'scores': [85, 92, 78, 95, 88, 67, 91, 82, 99, 76, 45, 88, 92, 81],
'age': [22, 25, 19, 28, 24, 30, 23, 21, 27, 26, 35, 24, 25, 23],
'hours_study': [5, 8, 3, 10, 6, 2, 7, 4, 9, 5, 1, 6, 8, 4]
}
df = pd.DataFrame(data)
print(df.head())
2. The Most Important Summary Stats
Mean (Arithmetic Average)
The mean is sensitive to outliers — good for normally distributed data.
print(df['scores'].mean()) # ~82.57
print(df.mean(numeric_only=True)) # all numeric columns
Median (Middle Value)
Robust to outliers — often more representative for skewed data.
print(df['scores'].median()) # 86.5 (better than mean here)
Mode (Most Frequent Value)
Useful for categorical or discrete data.
print(df['scores'].mode()) # 88 and 92 (both appear twice)
Range (Max - Min)
Simple measure of spread — but very sensitive to outliers.
print(df['scores'].max() - df['scores'].min()) # 54
Standard Deviation & Variance
Standard deviation measures average distance from the mean — key for understanding spread.
print(df['scores'].std()) # ~13.2
print(df['scores'].var()) # variance = std²
Quantiles & Percentiles
Show distribution shape (e.g. 25th, 50th, 75th percentiles).
print(df['scores'].quantile([0.25, 0.5, 0.75])) # Q1, median, Q3
3. All-in-One Summary with .describe()
The fastest way to get everything at once.
print(df.describe())
Typical output:
scores age hours_study
count 14.000000 14.000000 14.000000
mean 82.571429 25.000000 5.571429
std 13.200000 3.741657 2.735088
min 45.000000 18.000000 1.000000
25% 79.500000 23.000000 4.000000
50% 86.500000 25.000000 5.500000
75% 91.000000 27.000000 7.750000
max 99.000000 35.000000 10.000000
4. Modern Alternative in 2026: Polars
For large datasets, Polars is often faster and more memory-efficient.
import polars as pl
df_pl = pl.DataFrame(data)
print(df_pl.describe())
5. Common Pitfalls & Best Practices
- Always check
df.info()first — data types and missing values matter - Mean is misleading with outliers — prefer median
- Use
describe(include='object')for categorical columns - Look at
df.skew()anddf.kurt()for distribution shape - Visualize: pair summary stats with histograms/boxplots
Conclusion
Summary statistics are your first look at any dataset — they reveal central tendency, spread, outliers, and shape in seconds. In 2026, use Pandas .describe() for quick insights and Polars for speed on big data. Master these numbers, and you’ll make smarter decisions about cleaning, modeling, and visualization from the very beginning.
Next time you load a dataset, run .describe() — it’s the fastest way to understand what you’re working with.