Greedy vs. nongreedy matching is one of the most important concepts to understand in regular expressions — it determines how quantifiers (*, +, ?, {m,n}) behave when matching repeating patterns. By default, quantifiers are greedy: they match as much text as possible while still allowing the overall pattern to succeed. Adding a question mark after the quantifier makes it nongreedy (lazy): it matches as little text as possible while still allowing the pattern to succeed. This difference is crucial for avoiding over-matching (greedy) or under-matching (nongreedy), especially when parsing HTML, XML, logs, quoted strings, or any text with nested or repeated structures. In 2026, mastering greedy vs nongreedy behavior remains essential — used constantly in data extraction, cleaning, validation, and vectorized pandas/Polars string operations where precise control over repetition scales efficiently across large datasets.
Here’s a complete, practical guide to greedy vs nongreedy matching in Python’s re module: how greediness works, lazy quantifiers, possessive quantifiers, real-world patterns, and modern best practices with raw strings, flags, compilation, and pandas/Polars integration.
Greedy quantifiers match as much as possible — they expand outward until the rest of the pattern can still match, often consuming more text than intended.
import re
html = "Hello
World!"
# Greedy: .* matches as much as possible
greedy = re.findall(r'<.*>', html)
print(greedy)
# ['Hello
World!'] (over-matches, grabs everything to the last >)
Nongreedy (lazy) quantifiers (*?, +?, ??, {m,n}?) match as little as possible — they expand only enough for the overall pattern to succeed, stopping at the first valid match.
lazy = re.findall(r'<.*?>', html)
print(lazy)
# ['', '
', '', '', '', ''] (stops at each closing >)
Possessive quantifiers (*+, ++, ?+, {m,n}+ in Python 3.11+) are greedy and prevent backtracking — they match as much as possible and never give back characters, improving performance when backtracking would fail anyway.
text = "aaaa"
print(re.findall(r'a*+', text)) # ['aaaa'] (possessive, no backtracking)
Real-world pattern: parsing HTML-like tags or quoted strings in pandas — nongreedy quantifiers prevent over-matching across multiple tags or quotes.
import pandas as pd
df = pd.DataFrame({
'html': [
"Hello
World",
"Test",
"Click"
]
})
# Greedy — over-matches
df['greedy_tags'] = df['html'].str.findall(r'<.*>')
# Nongreedy — matches each tag
df['lazy_tags'] = df['html'].str.findall(r'<.*?>')
print(df)
Best practices make greedy/nongreedy matching safe, readable, and performant. Prefer nongreedy quantifiers (*?, +?) when matching bounded structures (HTML tags, quoted strings) — prevents over-consuming text. Use greedy when you want maximal matching (e.g., everything between two markers). Modern tip: use Polars for large text columns — pl.col("text").str.extract_all(r'<.*?>') is 10–100× faster than pandas .str.findall(). Add type hints — str or pd.Series[str] — improves static analysis. Compile patterns with re.compile() for repeated use — faster and clearer. Use raw strings r'pattern' — avoids double-escaping backslashes. Use possessive quantifiers *+ (Python 3.11+) for performance when backtracking is unwanted. Avoid catastrophic backtracking — limit quantifiers or use atomic groups (?>...). Combine with pandas.str — df['col'].str.findall(r'<.*?>') for vectorized tag extraction. Use re.escape() for literal substrings in patterns.
Greedy vs nongreedy matching controls how quantifiers consume text — greedy takes as much as possible, nongreedy as little as possible, possessive prevents backtracking. In 2026, use nongreedy for bounded structures, greedy for maximal, compile patterns, vectorize in pandas/Polars, and use raw strings. Master greedy/nongreedy behavior, and you’ll parse, extract, and clean text patterns with precision and efficiency.
Next time you match repeating patterns — choose greedy or nongreedy wisely. It’s Python’s cleanest way to say: “Match as much/little as needed.”