Grouping and capturing re module

Grouping and capturing in Python’s re module is one of the most powerful features of regular expressions — parentheses () create groups that capture matched substrings for later use, allowing you to extract specific parts of a match, apply quantifiers to subpatterns, or reference captured text in replacements. Groups are numbered from 1 (left to right) based on opening parentheses, and can be accessed via the Match object’s .group(n), .groups(), or .groupdict() methods. Named groups ((?Ppattern)) make code more readable and self-documenting. In 2026, grouping and capturing remain essential — used constantly in data extraction (names, emails, IDs), log parsing, text transformation, validation, and vectorized pandas/Polars string operations where precise subpattern capture scales efficiently across large datasets.

Here’s a complete, practical guide to grouping and capturing in Python regex: basic groups, named groups, capturing vs non-capturing, backreferences, real-world patterns, and modern best practices with raw strings, flags, compilation, and pandas/Polars integration.

Basic grouping with () — captures matched text and allows quantifiers to apply to the group.


import re

text = "John Doe, jane doe, and Jim Smith"

# Group first and last names
pattern = r"(\w+) (\w+)"
matches = re.findall(pattern, text)
print(matches)
# [('John', 'Doe'), ('jane', 'doe'), ('Jim', 'Smith')]

# Access groups from search/match
match = re.search(pattern, text)
if match:
    print(match.group(0))   # full match: John Doe
    print(match.group(1))   # first group: John
    print(match.group(2))   # second group: Doe
    print(match.groups())   # ('John', 'Doe')

Named groups with (?Ppattern) — access via name with .group('name') or .groupdict().


pattern_named = r"(?P\w+) (?P\w+)"
match = re.search(pattern_named, text)
if match:
    print(match.group('first'))   # John
    print(match.groupdict())      # {'first': 'John', 'last': 'Doe'}

Non-capturing groups (?:pattern) — group for structure/quantifiers without capturing (no group number or name).


# Non-capturing for alternation without capturing
pattern_noncap = r"(?:Mr|Mrs|Ms)\s+(\w+)"
names = re.findall(pattern_noncap, "Mr Smith, Mrs Jones, Ms Davis")
print(names)   # ['Smith', 'Jones', 'Davis'] (only the name captured)

Backreferences — reuse captured groups in the pattern with \1, \2, etc. (or (?P=name) for named groups).


# Find repeated words
repeated = re.findall(r'\b(\w+)\s+\1\b', "hello hello world world")
print(repeated)   # ['hello', 'world']

# Swap first and last names using backreferences in replacement
swapped = re.sub(r"(\w+) (\w+)", r"\2, \1", text)
print(swapped)
# Doe, John, doe, jane, Smith, Jim

Real-world pattern: extracting structured data in pandas — vectorized .str.extract() uses capturing groups to pull fields into new columns efficiently.


import pandas as pd

df = pd.DataFrame({
    'log': [
        "ERROR: connection failed at 2023-03-15",
        "INFO: data loaded successfully",
        "WARNING: low memory at 14:30"
    ]
})

# Extract level and message using named groups
df[['level', 'message']] = df['log'].str.extract(r'^(?PERROR|INFO|WARNING):\s+(?P.*)')

print(df)
#                           log    level                      message
# 0     ERROR: connection failed    ERROR     connection failed
# 1           INFO: data loaded   INFO     data loaded successfully
# 2      WARNING: low memory   WARNING     low memory at 14:30

Best practices make grouping and capturing safe, readable, and performant. Use capturing groups only when you need the text — use non-capturing (?:...) for structure to avoid unnecessary overhead. Prefer named groups (?Ppattern) — clearer and self-documenting. Modern tip: use Polars for large text columns — pl.col("text").str.extract(r'(?P\w+) (?P\w+)') is 10–100× faster than pandas .str.extract(). Add type hints — str or pd.Series[str] — improves static analysis. Use raw strings r'pattern' — avoids double-escaping backslashes. Compile patterns with re.compile() for repeated use — faster and clearer. Use backreferences \1 or (?P=name) for reuse in patterns/replacements. Handle no-match cases — check if match is not None or use matches or []. Combine with pandas.str — df['col'].str.extract(r'(?Ppattern)') for vectorized extraction. Use re.escape() for literal substrings in patterns.

Grouping and capturing with () and named groups (?P) let you extract, structure, and reuse matched text in regex. In 2026, use named groups for clarity, non-capturing for structure, compile patterns, vectorize in pandas/Polars, and use raw strings. Master grouping and capturing, and you’ll extract, transform, and validate text patterns with precision and efficiency.

Next time you need to pull specific parts from a match — use groups. It’s Python’s cleanest way to say: “Capture this piece and use it later.”

Generating content...