Dropping duplicate names

Dropping duplicates is one of the most frequent cleaning steps in data analysis — especially when dealing with names, IDs, emails, or any key column that should be unique. In Pandas, the .drop_duplicates() method makes this fast and flexible, letting you remove duplicate rows based on one column, multiple columns, or the entire row.

In 2026, when datasets are often messy from merges, scraping, or user input, mastering duplicate removal saves time and prevents errors downstream. Here’s a practical guide with real examples.

1. Basic Setup & Sample Data


import pandas as pd

data = {
    'Name': ['John', 'Mary', 'Peter', 'Anna', 'John', 'Mike'],
    'Age': [25, 32, 18, 47, 25, 23],
    'Salary': [50000, 80000, 35000, 65000, 50000, 45000]
}

df = pd.DataFrame(data)
print(df)

Output (notice duplicate 'John' rows):


    Name  Age  Salary
0   John   25   50000
1   Mary   32   80000
2  Peter   18   35000
3   Anna   47   65000
4   John   25   50000
5   Mike   23   45000

2. Drop Duplicates Based on One Column (e.g., Name)

Keep only the first occurrence of each unique name — common for deduplicating records.


# Keep first occurrence, drop later duplicates
df_clean = df.drop_duplicates(subset='Name')

print(df_clean)

Output:


    Name  Age  Salary
0   John   25   50000
1   Mary   32   80000
2  Peter   18   35000
3   Anna   47   65000
5   Mike   23   45000

Keep last occurrence instead:


df_clean_last = df.drop_duplicates(subset='Name', keep='last')

3. Drop Duplicates Based on Multiple Columns

Only consider rows duplicate if both Name and Age match — useful when records can have same name but different ages.


df_multi = df.drop_duplicates(subset=['Name', 'Age'], keep='first')
print(df_multi)

Output (both 'John' rows kept if ages differed; here only one kept):

4. In-Place Modification & Keeping Only Unique

Modify the original DataFrame (inplace=True) or keep only rows with no duplicates at all.


# In-place removal (modifies df directly)
df.drop_duplicates(subset='Name', inplace=True)

# Keep only rows that are completely unique across all columns
df_unique_all = df.drop_duplicates(keep=False)  # drops any row that appears more than once

5. Modern Alternative in 2026: Polars

For large datasets, Polars is often faster and more memory-efficient.


import polars as pl

df_pl = pl.DataFrame(data)
df_pl_unique = df_pl.unique(subset=["Name"])
print(df_pl_unique)

Best Practices & Common Pitfalls

Always check df.duplicated().sum() first to see how many duplicates exist
Specify subset — otherwise it checks every column (very strict)
Use keep='first' (default) or 'last' depending on which record you trust more
Handle case sensitivity: df['Name'].str.lower() before deduping if needed
After deduping, reset index if needed: df = df.reset_index(drop=True)
For huge data, Polars or chunked processing is often better

Conclusion

Dropping duplicates with .drop_duplicates() is a quick win for cleaner data — especially when focusing on key columns like names, IDs, or emails. In 2026, use Pandas for most work and Polars for scale. Master subsetting, keep logic, and pre-checks, and you'll avoid one of the most common sources of analysis errors.

Next time you see duplicate names or IDs — reach for .drop_duplicates() — it's simple but saves massive headaches later.