Replacing missing values

Replacing missing values (imputation via fillna() in pandas or fill_null() in Polars) is often the best compromise when dropping rows/columns would destroy too much data. The key is choosing the right strategy: simple fills preserve row count but can distort distributions; advanced methods preserve realism but take more compute.

In 2026, start simple (mean/median/mode/ffill), then move to model-based imputation (KNN/MICE) when accuracy matters. Always visualize before and after — never assume the fill is harmless.

1. Basic Replacement in Pandas


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Realistic example: sales data with common missing patterns
data = {
    'date': pd.date_range('2025-01-01', periods=6),
    'sales': [120, None, 180, 210, None, 150],
    'region': ['North', 'South', 'East', None, 'West', 'North'],
    'price': [15.5, 14.9, None, 16.2, 15.0, 14.8]
}
df = pd.DataFrame(data)

print("Original missing count:\n", df.isna().sum())

# Option 1: Fill numeric columns with median
df_median = df.copy()
df_median['sales'] = df_median['sales'].fillna(df_median['sales'].median())
df_median['price'] = df_median['price'].fillna(df_median['price'].median())

# Option 2: Fill categorical with mode (most frequent)
df_median['region'] = df_median['region'].fillna(df_median['region'].mode()[0])

print("\nAfter median/mode fill:\n", df_median)

**Typical output (after fill):**

Original missing count: date 0 sales 2 region 1 price 1 dtype: int64 After median/mode fill: date sales region price 0 2025-01-01 120.0 North 15.5 1 2025-01-02 165.0 South 14.9 2 2025-01-03 180.0 East 15.1 3 2025-01-04 210.0 North 16.2 4 2025-01-05 165.0 West 15.0 5 2025-01-06 150.0 North 14.8

2. Time-Series Friendly: Forward-Fill & Backward-Fill


# Forward-fill (carry last valid observation forward) — great for time series
df_ffill = df.copy()
df_ffill = df_ffill.fillna(method='ffill')

# Backward-fill (carry next valid observation backward)
df_bfill = df.copy()
df_bfill = df_bfill.fillna(method='bfill')

print("After forward-fill:\n", df_ffill)
print("\nAfter backward-fill:\n", df_bfill)

3. Fast & Modern: Filling in Polars (2026 Large-Data Choice)


import polars as pl

df_pl = pl.from_pandas(df)

# Fill numeric with median, categorical with mode
df_pl_filled = df_pl.with_columns([
    pl.col('sales').fill_null(pl.col('sales').median()),
    pl.col('price').fill_null(pl.col('price').median()),
    pl.col('region').fill_null(pl.col('region').mode().first())
])

print("After median/mode fill (Polars):\n", df_pl_filled)

4. Before & After Visual Check (Critical Step)


import missingno as msno

# Before
plt.figure(figsize=(10, 4))
msno.bar(df, color='teal')
plt.title('Missing Values BEFORE Replacement', fontsize=14)
plt.show()

# After
plt.figure(figsize=(10, 4))
msno.bar(df_median, color='darkorange')
plt.title('Missing Values AFTER Median/Mode Fill', fontsize=14)
plt.show()

When to Use Each Replacement Strategy (2026 Decision Framework)

Scenario	Best Method	Why / Risk
Numeric, low skew	Mean fill	Preserves mean; distorts if outliers
Numeric, skewed/outliers	Median fill	Robust to outliers
Categorical	Mode or 'Unknown' category	'Unknown' preserves missingness info
Time-series / ordered	ffill / bfill / linear interpolate	Preserves temporal continuity
Modeling performance critical	KNNImputer or IterativeImputer	Uses other features for realistic fill

Best Practices & Common Pitfalls

Always visualize before/after with missingno.bar() — confirms you filled what you intended
Fill numeric with median (not mean) unless data is symmetric — protects against outliers
For time-series: prefer method='ffill' or interpolate(method='linear') over global stats
Pitfall: filling before EDA ? can hide real patterns (e.g., missing salary for unemployed group)
Pitfall: global fill on grouped data ? use groupby + transform for group-aware imputation
Large data? Use Polars fill_null() — much faster than pandas fillna()
Production: log filled values & strategy — audit trail for reproducibility

Conclusion

Replacing missing values is an art: mean/median/mode/ffill for speed and simplicity, group-aware or model-based (KNN/Iterative) when realism matters. In 2026, always visualize before and after, choose method based on data type and missing mechanism, and test impact on model performance. Done right, imputation preserves data volume and signal; done wrong, it introduces noise or bias. Master replacement strategies, and your datasets stay powerful and trustworthy.

Next time you see missing values — don’t just drop or mean-fill. Choose thoughtfully, visualize the change, and let the data guide you.