Dropping duplicates is one of the most frequent cleaning steps in data analysis — especially when dealing with names, IDs, emails, or any key column that should be unique. In Pandas, the .drop_duplicates() method makes this fast and flexible, letting you remove duplicate rows based on one column, multiple columns, or the entire row.
In 2026, when datasets are often messy from merges, scraping, or user input, mastering duplicate removal saves time and prevents errors downstream. Here’s a practical guide with real examples.
1. Basic Setup & Sample Data
import pandas as pd
data = {
'Name': ['John', 'Mary', 'Peter', 'Anna', 'John', 'Mike'],
'Age': [25, 32, 18, 47, 25, 23],
'Salary': [50000, 80000, 35000, 65000, 50000, 45000]
}
df = pd.DataFrame(data)
print(df)
Output (notice duplicate 'John' rows):
Name Age Salary
0 John 25 50000
1 Mary 32 80000
2 Peter 18 35000
3 Anna 47 65000
4 John 25 50000
5 Mike 23 45000
2. Drop Duplicates Based on One Column (e.g., Name)
Keep only the first occurrence of each unique name — common for deduplicating records.
# Keep first occurrence, drop later duplicates
df_clean = df.drop_duplicates(subset='Name')
print(df_clean)
Output:
Name Age Salary
0 John 25 50000
1 Mary 32 80000
2 Peter 18 35000
3 Anna 47 65000
5 Mike 23 45000
Keep last occurrence instead:
df_clean_last = df.drop_duplicates(subset='Name', keep='last')
3. Drop Duplicates Based on Multiple Columns
Only consider rows duplicate if both Name and Age match — useful when records can have same name but different ages.
df_multi = df.drop_duplicates(subset=['Name', 'Age'], keep='first')
print(df_multi)
Output (both 'John' rows kept if ages differed; here only one kept):
4. In-Place Modification & Keeping Only Unique
Modify the original DataFrame (inplace=True) or keep only rows with no duplicates at all.
# In-place removal (modifies df directly)
df.drop_duplicates(subset='Name', inplace=True)
# Keep only rows that are completely unique across all columns
df_unique_all = df.drop_duplicates(keep=False) # drops any row that appears more than once
5. Modern Alternative in 2026: Polars
For large datasets, Polars is often faster and more memory-efficient.
import polars as pl
df_pl = pl.DataFrame(data)
df_pl_unique = df_pl.unique(subset=["Name"])
print(df_pl_unique)
Best Practices & Common Pitfalls
- Always check
df.duplicated().sum()first to see how many duplicates exist - Specify
subset— otherwise it checks every column (very strict) - Use
keep='first'(default) or'last'depending on which record you trust more - Handle case sensitivity:
df['Name'].str.lower()before deduping if needed - After deduping, reset index if needed:
df = df.reset_index(drop=True) - For huge data, Polars or chunked processing is often better
Conclusion
Dropping duplicates with .drop_duplicates() is a quick win for cleaner data — especially when focusing on key columns like names, IDs, or emails. In 2026, use Pandas for most work and Polars for scale. Master subsetting, keep logic, and pre-checks, and you'll avoid one of the most common sources of analysis errors.
Next time you see duplicate names or IDs — reach for .drop_duplicates() — it's simple but saves massive headaches later.