Setting up a Selector

Setting up a Selector in Scrapy is the first step to powerful, structured web scraping — the Selector (and its list counterpart SelectorList) is Scrapy’s built-in parsing engine that lets you query HTML/XML with XPath and CSS selectors, extract text/attributes, apply regex, clean data, and chain queries efficiently. In 2026, the Selector remains the heart of Scrapy spiders — faster and more robust than standalone BeautifulSoup or lxml in most production cases, with native support for responses, relative queries, namespaces, and integration with Items, Pipelines, and exporters. It’s vectorized-like (handles lists of elements seamlessly) and scales well for large-scale crawling and extraction.

Here’s a complete, practical guide to setting up and using the Scrapy Selector: importing and instantiation, from strings/responses/URLs, basic XPath/CSS extraction, real-world setup patterns, and modern best practices with type hints, performance, error handling, and pandas/Polars integration.

Import Selector from scrapy — create instances from raw HTML, Scrapy response objects, or URLs via response.


from scrapy import Selector
import requests

# 1. From raw HTML string (standalone usage)
html = """

    My Web Page
    
        
            Welcome!
            This is the intro.
            Main content here.
        
    

"""

sel = Selector(text=html)

# Extract title with XPath
title = sel.xpath('//title/text()').get()
print(title)   # My Web Page

# Extract intro with CSS
intro = sel.css('p.intro::text').get()
print(intro)   # This is the intro.

In a real Scrapy spider, response is already a Selector — no need to instantiate manually.


# Inside a Scrapy spider
def parse(self, response):
    # response is already a Selector
    title = response.xpath('//title/text()').get()
    intro = response.css('p.intro::text').get()
    all_paras = response.css('p::text').getall()

    yield {
        'title': title,
        'intro': intro,
        'para_count': len(all_paras)
    }

From live URL (standalone) — fetch with requests, then wrap in Selector.


url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"}
response = requests.get(url, headers=headers)
response.raise_for_status()

sel = Selector(text=response.text)

# Or from response object directly (if using scrapy.http.TextResponse)
from scrapy.http import TextResponse
scrapy_resp = TextResponse(url=url, body=response.content, encoding='utf-8')
sel = Selector(response=scrapy_resp)

Real-world pattern: flexible Selector setup for both spider and standalone scripts — extract structured data cleanly for export or analysis.


# Standalone ? pandas export
products = []
for item in sel.css('.product-item'):
    name = item.css('.name::text').get(default='').strip()
    price = item.css('.price::text').re_first(r'[\d,.]+') or '0'
    products.append({"name": name, "price": float(price)})

import pandas as pd
df = pd.DataFrame(products)
df.to_parquet("products.parquet")
print(df.head())

Best practices make Selector setup safe, readable, and performant. Prefer response in spiders — it’s already a Selector with URL context and encoding handled. Use .get(default='') or .getall() or [] — handles missing elements gracefully. Modern tip: use Polars for large-scale post-processing — pl.from_pandas(df) after Selector extraction for fast cleaning/aggregation. Add type hints — response: scrapy.http.TextResponse — improves spider clarity. Prefer CSS selectors for readability — .css('div.container h1::text') — XPath for complex traversal or namespaces. Use relative selectors — .xpath('.//p') or .css('p') — after narrowing scope with .css('.container'). Handle encoding — response.encoding or force utf-8. Use ::text and ::attr(name) CSS pseudo-elements — cleaner than XPath for common cases. Combine with scrapy.linkextractors.LinkExtractor — for smart link following. Respect robots.txt and rate limit — Scrapy does this automatically with settings.

Setting up a Selector is the gateway to powerful, structured scraping — from raw HTML, responses, or URLs, with XPath/CSS, regex, chaining, and relative queries. In 2026, use response in spiders, Selector(text=...) standalone, prefer CSS for readability, and integrate with Polars for scale. Master the Selector, and you’ll extract clean, reliable data from any website with speed and maintainability.

Next time you parse HTML in Scrapy — set up the Selector. It’s Python’s cleanest way to say: “Give me the power to query this page precisely.”

Welcome!

Generating content...