Counting missing values

Counting missing values is one of the most important early steps in data quality assessment. After detecting whether missing values exist (.isna().any()), the next question is: how many are there, and where? Accurate counting helps you decide whether to drop rows/columns, impute, or investigate why data is missing.

In 2026, you still rely on .isna().sum() in pandas for most work, but Polars offers faster alternatives for large datasets. Here’s a complete, practical guide with real examples, visuals, and best practices.

1. Basic Counting in Pandas


import pandas as pd

# Realistic example: survey data with various missingness
data = {
    'respondent_id': [101, 102, 103, 104, 105],
    'age': [28, None, 45, 33, None],
    'income': [55000, 72000, None, 61000, 48000],
    'city': ['New York', 'Chicago', 'Los Angeles', None, 'Seattle'],
    'satisfaction': [8, 7, None, 9, 6]
}
df = pd.DataFrame(data)

# Count missing values per column
print("Missing count per column:")
print(df.isna().sum())

# Total missing values in the entire DataFrame
total_missing = df.isna().sum().sum()
print(f"\nTotal missing values: {total_missing} out of {df.size} cells "
      f"({total_missing / df.size * 100:.1f}%)")

**Typical output:**

Missing count per column:

respondent_id|>0
age|2
income|1
city|1
satisfaction|1

dtype: int64 Total missing values: 5 out of 25 cells (20.0%)

2. Percentage Missing (Most Useful View)


# Missing percentage per column (sorted descending)
missing_pct = (df.isna().mean() * 100).round(2).sort_values(ascending=False)

print("Missing percentage per column:")
print(missing_pct[missing_pct > 0])

# Quick bar plot of missing percentages
plt.figure(figsize=(8, 5))
sns.barplot(x=missing_pct[missing_pct > 0].values, y=missing_pct[missing_pct > 0].index, palette='viridis')
plt.title('Columns with Missing Values (%)', fontsize=14)
plt.xlabel('Missing Percentage')
for i, v in enumerate(missing_pct[missing_pct > 0].values):
    plt.text(v + 0.5, i, f"{v}%", va='center')
plt.tight_layout()
plt.show()

3. Fast Counting with Polars (2026 Speed Choice for Large Data)


import polars as pl

df_pl = pl.from_pandas(df)

# Missing count & percentage per column (very fast)
missing_summary = df_pl.null_count().transpose(include_header=True, header_name="column", column_names=["count"])
missing_summary = missing_summary.with_columns(
    (pl.col("count") / df_pl.height * 100).round(2).alias("pct")
).filter(pl.col("count") > 0).sort("count", descending=True)

print("Missing summary (Polars):")
print(missing_summary)

4. Counting Missing Values by Group (Real-World Insight)


# Example: missing age by gender (if we had that column)
# df.groupby('gender')['age'].apply(lambda x: x.isna().sum())

# Or with Polars (cleaner syntax)
print(df_pl.group_by('city').agg(pl.col('age').null_count().alias('missing_age')))

5. Quick Decision Guide: What to Do After Counting

0–2% missing ? usually safe to drop rows (dropna())
5–20% missing ? simple imputation (median, mode, forward-fill)
>30% missing in a column ? consider dropping the column unless critical
Missingness varies strongly by group ? imputation should be group-aware or use advanced methods (KNN/MICE)
Time series ? never use global mean/median ? prefer interpolation or forward/backward fill

Best Practices & Common Pitfalls (2026 Edition)

Always report both **count** and **percentage** — raw counts mislead on large datasets
Sort by missingness descending — focus effort on worst columns first
Visualize after counting — bar plot or heatmap reveals clustering
Pitfall: .isna().sum().sum() counts total cells — use .isna().any().sum() to count affected columns
Pitfall: forgetting to handle missings before modeling — many sklearn models crash or give poor results
Large data? Use Polars — null_count() is significantly faster than pandas isna().sum()
Production pipelines: log missing percentages — alert if > threshold

Conclusion

Counting missing values with .isna().sum() (or Polars null_count()) is the bridge from “there’s a problem” to “how big is the problem and where?” In 2026, do it immediately after loading data, always report percentages, visualize the results, and use the numbers to decide on drop vs fill vs advanced imputation. Master this step, and you’ll catch data quality issues early — saving time, avoiding bias, and building more reliable models.

Next time you load a DataFrame — count missing values first. The numbers you get will guide every decision that follows.