Missing values

Missing values (NaN, None, null) are one of the most common — and most dangerous — realities in real-world data. They appear from sensor failures, non-responses in surveys, data entry errors, filtering bugs, or intentional non-collection. Ignoring them leads to biased models, crashed algorithms, or misleading insights. In 2026, handling missing data intelligently is still a core skill for any data scientist or analyst.

Here’s a practical, up-to-date guide to detecting, understanding, and treating missing values using pandas (classic), Polars (fast modern alternative), and visualization tools.

1. Quick Detection & Summary


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Example: Titanic-like dataset with realistic missingness
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Fast missing overview
print("Missing counts:\n", df.isna().sum())
print("\nMissing %:\n", df.isna().mean() * 100)

# Visual heatmap (very useful!)
plt.figure(figsize=(10, 6))
sns.heatmap(df.isna(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap (Yellow = Missing)', fontsize=14)
plt.tight_layout()
plt.show()

2. Common Strategies: When to Use What (Decision Guide 2026)

Scenario	Best Method	When to Avoid	Code Example
Very few missing (<1–2%)	Drop rows	Small dataset or important rows	`df.dropna(inplace=True)`
Numeric, random missing	Mean / Median imputation	Strong skew or outliers	`df['Age'].fillna(df['Age'].median(), inplace=True)`
Time series / ordered data	Forward-fill, backward-fill, linear interpolation	Long gaps	`df['value'].interpolate(method='linear', inplace=True)`
Categorical / low cardinality	Mode or new 'missing' category	High cardinality	`df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)`
Predictive power matters (modeling)	KNN / Iterative imputation (sklearn)	Very large data (slow)	See advanced section
Large data & speed critical	Polars + simple fill / forward fill	Need complex logic	`df.with_columns(pl.col('col').fill_null(strategy='forward'))`

3. Advanced Imputation: When Simple Isn’t Enough


from sklearn.impute import KNNImputer, IterativeImputer

# KNN imputation (uses nearest neighbors based on other features)
imputer = KNNImputer(n_neighbors=5)
df_numeric = df.select_dtypes(include='number')
df_imputed = pd.DataFrame(imputer.fit_transform(df_numeric), columns=df_numeric.columns)

# Iterative (model-based, like MICE)
iter_imputer = IterativeImputer(max_iter=10, random_state=42)
df_iter = pd.DataFrame(iter_imputer.fit_transform(df_numeric), columns=df_numeric.columns)

4. Visualizing Missingness Patterns


# Missingness correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.isna().corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation of Missingness Across Columns', fontsize=14)
plt.tight_layout()
plt.show()

Best Practices & Common Pitfalls (2026 Edition)

Always explore why values are missing — MCAR, MAR, MNAR? (affects strategy)
Never blindly fill before understanding patterns — use heatmaps and groupby
Create a missing indicator column — helps models learn from missingness itself
For time series — prefer interpolation / forward-fill over mean
Compare model performance before/after imputation — sometimes dropping is better
Large data? Use Polars — fill_null strategies are blazing fast
Production? Use sklearn Pipeline with SimpleImputer or custom transformers

Conclusion

Missing values are not just noise — they are information. In 2026, start every dataset with detection (heatmap + counts), understand mechanisms, then choose the right treatment: drop for tiny missingness, simple fill for speed, interpolation for time series, or advanced (KNN/Iterative) when modeling performance matters. Visualize patterns, create indicators, compare strategies, and never assume “mean fill and move on” is enough. Master missing data handling, and your models will be more robust, less biased, and far more trustworthy.