Introduction to the scrapy Selector

Introduction to the Scrapy Selector — Scrapy’s built-in parsing engine — is the most powerful and flexible tool for extracting structured data from HTML and XML in web scraping projects. The Selector class (and its cousin SelectorList) lets you navigate and query document trees using both XPath and CSS selectors, with built-in support for regular expressions, attribute extraction, text cleaning, and chaining. In 2026, the Scrapy Selector remains the gold standard for serious scraping — faster and more robust than standalone BeautifulSoup or lxml in most cases, with native integration into Scrapy spiders, items, pipelines, and middleware. It’s vectorized-like in behavior (handles lists of elements seamlessly) and excels at large-scale, production-grade scraping where reliability, speed, and maintainability matter.

Here’s a complete, practical introduction to the Scrapy Selector: core concepts, XPath vs CSS, common extraction patterns, real-world examples, advanced usage, and modern best practices with type hints, performance, error handling, and pandas/Polars integration.

Basic usage — create a Selector from HTML text, response, or URL; query with .xpath() or .css().


from scrapy import Selector
import requests

# From raw HTML string
html = """

    My Web Page
    
        
            Welcome!
            This is the intro.
            Main content here.
        
    

"""

sel = Selector(text=html)

# Extract title with XPath
title = sel.xpath('//title/text()').get()
print(title)   # My Web Page

# Extract intro paragraph with CSS
intro = sel.css('p.intro::text').get()
print(intro)   # This is the intro.

# Get all paragraphs' text
content = sel.css('p::text').getall()
print(content)   # ['This is the intro.', 'Main content here.']

From Scrapy response (real spider context) — response is already a Selector instance.


# In a Scrapy spider
def parse(self, response):
    # response is Selector-like
    title = response.xpath('//title/text()').get()
    intro = response.css('p.intro::text').get()
    all_links = response.css('a::attr(href)').getall()

    yield {
        'title': title,
        'intro': intro,
        'links_count': len(all_links)
    }

Advanced selector features — chaining, relative XPath/CSS, regex, attributes, text cleaning, and more.


# Chaining selectors
container = sel.css('.container')
h1_text = container.xpath('.//h1/text()').get()   # relative XPath with .

# Regex extraction
emails = sel.xpath('//a[contains(@href, "mailto")]/@href').re(r'mailto:([\w\.-]+@[\w\.-]+)').getall()

# Attribute and text together
links = sel.css('a[href]::attr(href)').getall()

# Clean text (remove whitespace)
clean_intro = sel.css('p.intro::text').get(default='').strip()

Real-world pattern: structured extraction in Scrapy spiders — yield items with cleaned, typed data for pipelines/export.


# Example item extraction
def parse_product(self, response):
    yield {
        'name': response.css('h1.product-name::text').get(default='').strip(),
        'price': response.css('.price::text').re_first(r'[\d,.]+'),  # extract number
        'url': response.url,
        'description': ' '.join(response.css('.description p::text').getall()).strip()
    }

Best practices make Selector usage safe, readable, and performant. Prefer CSS selectors for readability — .css('div.container h1::text') — XPath for complex traversal. Always use .get() (first match) or .getall() (all matches) — avoid .extract() (deprecated). Modern tip: use Polars for large-scale post-processing — combine Scrapy output with pl.from_pandas(df) for fast cleaning/aggregation. Add type hints — response: scrapy.http.TextResponse — improves spider clarity. Use relative selectors — .xpath('.//p') or .css('p') — after narrowing scope. Handle missing data — .get(default='') or .getall() or []. Use ::text and ::attr(name) CSS pseudo-elements — cleaner than XPath for common cases. Compile regex in .re() — re.compile(r'pattern') for repeated use. Combine with scrapy.linkextractors.LinkExtractor — for smart link following. Respect robots.txt and rate limit — Scrapy does this automatically with settings.

The Scrapy Selector is the most powerful HTML/XML parser for scraping — XPath/CSS, regex, chaining, relative queries, and vectorized-like behavior. In 2026, prefer CSS for readability, XPath for power, use .get()/.getall(), compile regex, and integrate with Polars for scale. Master the Selector, and you’ll extract structured data from any website reliably and efficiently.

Next time you parse a webpage in Scrapy — reach for the Selector. It’s Python’s cleanest way to say: “Find and extract exactly what I need from this HTML.”

Welcome!

Generating content...