Detecting missing values

Detecting missing values is the very first — and often most revealing — step in any serious data analysis or modeling workflow. Missing data (NaN, None, null) isn’t just an inconvenience; it can hide patterns, introduce bias, crash algorithms, or mislead conclusions. In 2026, with larger, messier datasets, fast and visual detection is non-negotiable.

Here’s a complete, up-to-date guide to detecting missing values using pandas (classic), Polars (fast & modern), and powerful visualizations that reveal not just counts, but patterns and correlations of missingness.

1. Quick Detection & Summary (Pandas)


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load example (Titanic - classic missingness dataset)
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# 1. Count missing per column
print("Missing counts:\n", df.isna().sum())

# 2. Percentage missing
print("\nMissing %:\n", round(df.isna().mean() * 100, 2))

# 3. Total missing cells
print(f"\nTotal missing values: {df.isna().sum().sum()} / {df.size} ({df.isna().mean().mean()*100:.2f}%)")

2. Visual Detection (Heatmap & Bar Plot)

Numbers lie — visuals show the truth: clustered missingness, patterns by column/row, correlations.


# Missingness heatmap (yellow = missing)
plt.figure(figsize=(12, 8))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap (Yellow = Missing)', fontsize=14)
plt.xlabel('Columns')
plt.tight_layout()
plt.show()

# Bar plot of missing percentages
missing_pct = df.isna().mean() * 100
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=missing_pct.values, y=missing_pct.index, palette='viridis')
plt.title('Columns with Missing Values (%)', fontsize=14)
plt.xlabel('Missing Percentage')
for i, v in enumerate(missing_pct.values):
    plt.text(v + 0.5, i, f"{v:.1f}%", va='center')
plt.tight_layout()
plt.show()

3. Modern & Fast: Polars (2026 Speed Favorite)


import polars as pl

# Load same data with Polars
df_pl = pl.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Missing counts & percentages
missing_summary = df_pl.null_count().transpose(include_header=True)
missing_summary = missing_summary.with_columns(
    (pl.col("column_1") / df_pl.height * 100).alias("missing_pct")
).sort("column_1", descending=True)

print(missing_summary.filter(pl.col("column_1") > 0))

4. Advanced: Missingness Patterns & Correlations

Are missings random (MCAR), conditional (MAR), or informative (MNAR)? Correlation matrix reveals relationships.


# Missingness correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.isna().corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.2f')
plt.title('Correlation of Missingness Across Columns', fontsize=14)
plt.tight_layout()
plt.show()

# Grouped missingness (e.g., by Survived)
print("\nMissing Age by Survived:\n", df.groupby('Survived')['Age'].apply(lambda x: x.isna().mean() * 100))

5. Quick Decision Tree: What to Do After Detection

0–2% missing, random ? drop rows (dropna())
Numeric, no strong skew ? median fill
Time series ? interpolate / forward-fill
Categorical ? mode or new 'Unknown' category
Missingness correlated with target ? create missing indicator + advanced imputation (KNN/MICE)
>30–50% missing in column ? consider dropping column
Large data ? Polars + lazy evaluation for speed

Best Practices & Pitfalls (2026 Edition)

Never skip visualization — counts alone hide clustering (e.g., all Age missing for one group)
Use missingno library for fancier visuals: import missingno as msno; msno.matrix(df)
Create missing flags: df['Age_missing'] = df['Age'].isna().astype(int) — models can learn from it
Pitfall: mean imputation on skewed data ? distorts distribution ? prefer median or model-based
Pitfall: fill before EDA ? can hide real patterns ? detect first, then decide
Production: use scikit-learn SimpleImputer or IterativeImputer in Pipeline

Conclusion

Detecting missing values isn’t a chore — it’s your first chance to understand data quality, mechanisms of missingness (MCAR/MAR/MNAR), and potential biases. In 2026, start every dataset with isna().sum(), heatmaps, bar plots, and missingness correlations. Use pandas for exploration, Polars for speed, and never move to modeling without visualizing and deciding how to treat NaNs thoughtfully. Master detection, and you’ll build more robust, trustworthy analyses and models.

Next time you load a dataset — plot the missingness first. It often tells the most important story.