Substitution in Regular Expressions – Complete Guide for Data Science 2026
Substitution is the most powerful part of regular expressions in Python. It lets you find patterns with re.search() or re.findall() and then automatically replace them with new text — or even with dynamically computed values. In data science, substitution is used daily for cleaning messy logs, anonymizing data, standardizing formats, fixing typos, and transforming raw text into structured features.
TL;DR — Core Substitution Tools
re.sub(pattern, repl, text)→ replace all matchesre.subn(pattern, repl, text)→ replace + return count- Backreferences:
\1,\g<1> - Callable replacement function (dynamic logic)
- pandas
.str.replace(..., regex=True)for vectorized substitution
1. Basic Substitution
import re
text = "Order ORD-12345 placed for $1,250.75 on 2026-03-19"
# Simple string replacement
clean = re.sub(r"ORD-d+", "ORDER_ID", text)
print(clean)
# Backreference substitution
clean = re.sub(r"(d{4}-d{2}-d{2})", r"DATE:1", text)
print(clean)
2. Real-World Data Science Examples with Pandas
import pandas as pd
df = pd.read_csv("logs.csv")
# Example 1: Standardize currency format
df["amount"] = df["log"].str.replace(r"$(d+(?:,d+)?(?:.d+)?)",
r"USD 1", regex=True)
# Example 2: Anonymize emails
df["log"] = df["log"].str.replace(r"(S+@S+)", "[EMAIL]", regex=True)
# Example 3: Convert dates to ISO format
df["log"] = df["log"].str.replace(r"(d{2})/(d{2})/(d{4})", r"3-1-2", regex=True)
3. Advanced Substitution with Backreferences & Functions
# Backreferences with groups
text = "Contact: Alice (alice@example.com)"
print(re.sub(r"(w+) ((S+@S+))", r"1 <2>", text))
# Callable function (dynamic substitution)
def mask_number(match):
return "*" * len(match.group(0))
print(re.sub(r"d+", mask_number, "Order ID 98765, amount 1250.75"))
4. Best Practices in 2026
- Use raw strings
r"..."for every regex pattern - Prefer backreferences
\1or\g<1>for simple reordering - Use a callable function when logic is complex
- Use
re.subn()when you need to know how many changes were made - Always apply substitution with pandas
.str.replace(regex=True)on DataFrames for speed
Conclusion
Substitution turns regular expressions from simple search tools into powerful data transformation engines. In 2026 data science projects, mastering re.sub(), backreferences, callable replacements, and pandas vectorized substitution is essential for cleaning, anonymizing, standardizing, and feature-engineering text at scale. These techniques complete the regex workflow and prepare your data for analysis and modeling.
Next steps:
- Take one of your current text-cleaning scripts and replace manual string operations with
re.sub()for cleaner, more powerful results