Summary statistics

Summary statistics are the first thing every data scientist looks at when meeting a new dataset — they give you a quick snapshot of the data’s central tendency, spread, shape, and potential issues like outliers or missing values.

In Python 2026, you’ll mostly use Pandas for this (with Polars as the high-speed alternative for large data). These numbers help you spot patterns, decide next steps, and communicate insights before diving into modeling or visualization.

1. Loading a Sample Dataset

Let’s use a simple dataset to demonstrate:


import pandas as pd
import numpy as np

# Example dataset
data = {
    'scores': [85, 92, 78, 95, 88, 67, 91, 82, 99, 76, 45, 88, 92, 81],
    'age': [22, 25, 19, 28, 24, 30, 23, 21, 27, 26, 35, 24, 25, 23],
    'hours_study': [5, 8, 3, 10, 6, 2, 7, 4, 9, 5, 1, 6, 8, 4]
}

df = pd.DataFrame(data)
print(df.head())

2. The Most Important Summary Stats

Mean (Arithmetic Average)

The mean is sensitive to outliers — good for normally distributed data.


print(df['scores'].mean())          # ~82.57
print(df.mean(numeric_only=True))   # all numeric columns

Median (Middle Value)

Robust to outliers — often more representative for skewed data.


print(df['scores'].median())        # 86.5 (better than mean here)

Mode (Most Frequent Value)

Useful for categorical or discrete data.


print(df['scores'].mode())          # 88 and 92 (both appear twice)

Range (Max - Min)

Simple measure of spread — but very sensitive to outliers.


print(df['scores'].max() - df['scores'].min())  # 54

Standard Deviation & Variance

Standard deviation measures average distance from the mean — key for understanding spread.


print(df['scores'].std())           # ~13.2
print(df['scores'].var())           # variance = std²

Quantiles & Percentiles

Show distribution shape (e.g. 25th, 50th, 75th percentiles).


print(df['scores'].quantile([0.25, 0.5, 0.75]))  # Q1, median, Q3

3. All-in-One Summary with .describe()

The fastest way to get everything at once.


print(df.describe())

Typical output:


       scores        age  hours_study
count  14.000000  14.000000    14.000000
mean   82.571429  25.000000     5.571429
std    13.200000   3.741657     2.735088
min    45.000000  18.000000     1.000000
25%    79.500000  23.000000     4.000000
50%    86.500000  25.000000     5.500000
75%    91.000000  27.000000     7.750000
max    99.000000  35.000000    10.000000

4. Modern Alternative in 2026: Polars

For large datasets, Polars is often faster and more memory-efficient.


import polars as pl

df_pl = pl.DataFrame(data)
print(df_pl.describe())

5. Common Pitfalls & Best Practices

Always check df.info() first — data types and missing values matter
Mean is misleading with outliers — prefer median
Use describe(include='object') for categorical columns
Look at df.skew() and df.kurt() for distribution shape
Visualize: pair summary stats with histograms/boxplots

Conclusion

Summary statistics are your first look at any dataset — they reveal central tendency, spread, outliers, and shape in seconds. In 2026, use Pandas .describe() for quick insights and Polars for speed on big data. Master these numbers, and you’ll make smarter decisions about cleaning, modeling, and visualization from the very beginning.

Next time you load a dataset, run .describe() — it’s the fastest way to understand what you’re working with.