Dropping duplicate pairs (or combinations of columns) is a critical cleaning step when dealing with real-world data — think duplicate customer records, repeated transactions, or merged datasets with overlapping entries. In Pandas, the .drop_duplicates() method lets you remove rows that have identical values in one or more specified columns (a "pair" or "group" of columns), while keeping control over which duplicate to retain.
In 2026, this operation is essential for data quality in analytics, machine learning prep, and reporting. Here’s a practical guide with real examples.
1. Basic Setup & Sample Data
import pandas as pd
data = {
'Name': ['John', 'Mary', 'Peter', 'Anna', 'John', 'Mike'],
'Age': [25, 32, 18, 47, 25, 23],
'Salary': [50000, 80000, 35000, 65000, 50000, 45000]
}
df = pd.DataFrame(data)
print(df)
Output (notice duplicate 'John' with same age, but different salary):
Name Age Salary
0 John 25 50000
1 Mary 32 80000
2 Peter 18 35000
3 Anna 47 65000
4 John 25 50000
5 Mike 23 45000
2. Drop Duplicates Based on a Pair of Columns (Name + Age)
Remove rows that match on both 'Name' and 'Age' — useful when same person shouldn't appear twice at the same age.
# Keep first occurrence of each unique (Name, Age) pair
df_clean = df.drop_duplicates(subset=['Name', 'Age'])
print(df_clean)
Output (second 'John' row removed because Name + Age match):
Name Age Salary
0 John 25 50000
1 Mary 32 80000
2 Peter 18 35000
3 Anna 47 65000
5 Mike 23 45000
Keep last occurrence instead:
df_clean_last = df.drop_duplicates(subset=['Name', 'Age'], keep='last')
3. Drop Duplicates Based on Multiple Columns (Name + Age + Salary)
Only consider rows duplicate if all three columns match — stricter deduplication.
df_multi = df.drop_duplicates(subset=['Name', 'Age', 'Salary'], keep='first')
print(df_multi)
Output (all rows kept if any column differs):
4. In-Place Modification & Keeping Only Unique Rows
Modify the original DataFrame or drop rows that appear more than once anywhere.
# In-place removal (modifies df directly)
df.drop_duplicates(subset=['Name', 'Age'], inplace=True)
# Keep only rows that are completely unique across ALL columns
df_unique_all = df.drop_duplicates(keep=False) # drops any row that appears more than once
5. Modern Alternative in 2026: Polars
For large datasets, Polars is often faster and more memory-efficient.
import polars as pl
df_pl = pl.DataFrame(data)
df_pl_unique = df_pl.unique(subset=["Name", "Age"])
print(df_pl_unique)
Best Practices & Common Pitfalls
- Always check duplicates first:
df.duplicated(subset=['Name', 'Age']).sum() - Use
subset— otherwise it checks every column (very strict) - Decide
keep='first'(default) or'last'based on which record you trust more (e.g., latest data) - Handle case sensitivity:
df['Name'].str.lower()before deduping if needed - After deduping, reset index:
df = df.reset_index(drop=True) - For huge data, Polars or chunked processing is often better
Conclusion
Dropping duplicate pairs (or groups of columns) with .drop_duplicates() is a quick, essential step for cleaner data — especially when focusing on keys like Name + Age, ID + Date, or Email + Phone. In 2026, use Pandas for most work and Polars for scale. Master subsetting, keep logic, and pre-checks, and you'll avoid one of the most common sources of analysis errors.
Next time you see repeated names with matching attributes — reach for .drop_duplicates() — it's simple but prevents massive downstream issues.