Dropping duplicate pairs

Dropping duplicate pairs (or combinations of columns) is a critical cleaning step when dealing with real-world data — think duplicate customer records, repeated transactions, or merged datasets with overlapping entries. In Pandas, the .drop_duplicates() method lets you remove rows that have identical values in one or more specified columns (a "pair" or "group" of columns), while keeping control over which duplicate to retain.

In 2026, this operation is essential for data quality in analytics, machine learning prep, and reporting. Here’s a practical guide with real examples.

1. Basic Setup & Sample Data


import pandas as pd

data = {
    'Name': ['John', 'Mary', 'Peter', 'Anna', 'John', 'Mike'],
    'Age': [25, 32, 18, 47, 25, 23],
    'Salary': [50000, 80000, 35000, 65000, 50000, 45000]
}

df = pd.DataFrame(data)
print(df)

Output (notice duplicate 'John' with same age, but different salary):


    Name  Age  Salary
0   John   25   50000
1   Mary   32   80000
2  Peter   18   35000
3   Anna   47   65000
4   John   25   50000
5   Mike   23   45000

2. Drop Duplicates Based on a Pair of Columns (Name + Age)

Remove rows that match on both 'Name' and 'Age' — useful when same person shouldn't appear twice at the same age.


# Keep first occurrence of each unique (Name, Age) pair
df_clean = df.drop_duplicates(subset=['Name', 'Age'])

print(df_clean)

Output (second 'John' row removed because Name + Age match):


    Name  Age  Salary
0   John   25   50000
1   Mary   32   80000
2  Peter   18   35000
3   Anna   47   65000
5   Mike   23   45000

Keep last occurrence instead:


df_clean_last = df.drop_duplicates(subset=['Name', 'Age'], keep='last')

3. Drop Duplicates Based on Multiple Columns (Name + Age + Salary)

Only consider rows duplicate if all three columns match — stricter deduplication.


df_multi = df.drop_duplicates(subset=['Name', 'Age', 'Salary'], keep='first')
print(df_multi)

Output (all rows kept if any column differs):

4. In-Place Modification & Keeping Only Unique Rows

Modify the original DataFrame or drop rows that appear more than once anywhere.


# In-place removal (modifies df directly)
df.drop_duplicates(subset=['Name', 'Age'], inplace=True)

# Keep only rows that are completely unique across ALL columns
df_unique_all = df.drop_duplicates(keep=False)  # drops any row that appears more than once

5. Modern Alternative in 2026: Polars

For large datasets, Polars is often faster and more memory-efficient.


import polars as pl

df_pl = pl.DataFrame(data)
df_pl_unique = df_pl.unique(subset=["Name", "Age"])
print(df_pl_unique)

Best Practices & Common Pitfalls

Always check duplicates first: df.duplicated(subset=['Name', 'Age']).sum()
Use subset — otherwise it checks every column (very strict)
Decide keep='first' (default) or 'last' based on which record you trust more (e.g., latest data)
Handle case sensitivity: df['Name'].str.lower() before deduping if needed
After deduping, reset index: df = df.reset_index(drop=True)
For huge data, Polars or chunked processing is often better

Conclusion

Dropping duplicate pairs (or groups of columns) with .drop_duplicates() is a quick, essential step for cleaner data — especially when focusing on keys like Name + Age, ID + Date, or Email + Phone. In 2026, use Pandas for most work and Polars for scale. Master subsetting, keep logic, and pre-checks, and you'll avoid one of the most common sources of analysis errors.

Next time you see repeated names with matching attributes — reach for .drop_duplicates() — it's simple but prevents massive downstream issues.