Selecting Selectors in Scrapy means choosing between XPath and CSS (or combining them) to query HTML/XML documents with the best balance of power, readability, and performance. Scrapy’s Selector supports both: XPath offers deep structural traversal, complex conditions, relative paths, and full DOM access, while CSS provides concise, familiar tag/class/ID/pseudo-element syntax ideal for most everyday scraping tasks. In 2026, selector choice remains a core skill — CSS dominates for speed and maintainability, XPath shines for precision and edge cases, and hybrid usage (CSS to narrow, XPath to refine) is common in production spiders. The right selector makes code clearer, faster to write/debug, and more robust against site changes.
Here’s a complete, practical guide to selecting selectors in Scrapy: XPath vs CSS head-to-head, strengths/weaknesses, decision guidelines, hybrid patterns, real-world examples, and modern best practices with type hints, performance, and pandas/Polars integration.
XPath selectors — expressive and powerful — use path-like syntax to navigate the tree, filter by attributes, text, position, or relationships.
# XPath examples
title = response.xpath('//title/text()').get() # title text
first_p = response.xpath('(//p)[1]/text()').get() # first text
intro = response.xpath('//p[@class="intro"]/text()').get() #
links = response.xpath('//a/@href').getall() # all hrefs
deep = response.xpath('//div[@id="content"]//h2/text()').getall() # all h2 under #content
contains = response.xpath('//p[contains(text(), "important")]/text()').getall() # text contains "important"
CSS selectors — concise and web-developer friendly — use tag, class, ID, attribute, and pseudo-element syntax for quick, readable selections.
# CSS examples
title = response.css('title::text').get() # title text
intro = response.css('p.intro::text').get() # p.intro text
links = response.css('a::attr(href)').getall() # all hrefs
articles = response.css('div.article h2::text').getall() # h2 inside .article
first_para = response.css('p:first-of-type::text').get() # first p text
Decision guide — when to choose XPath vs CSS (or both).
- Use CSS when: simple tag/class/ID selection, ::text/::attr, pseudo-classes (:first-child, :nth-of-type, :contains), readability/maintainability priority.
- Use XPath when: complex conditions (text contains, position, parent/sibling/ancestor), namespaces, relative paths from narrowed scope, full DOM traversal needed.
- Use both together — narrow with CSS (
.css('.container')), refine with XPath (.xpath('.//p[contains(text(), "key")]/text()')) — fastest and clearest.
Real-world pattern: structured extraction in Scrapy spiders — choose selectors based on site structure and maintainability.
def parse_product(self, response):
# CSS for common, stable fields
yield {
'name': response.css('h1.product-title::text').get(default='').strip(),
'price': response.css('.price-amount::text').re_first(r'[\d,.]+') or '0.00',
'rating': response.css('.rating-stars::attr(data-rating)').get(default='N/A'),
# XPath for complex/conditional (e.g., stock status with text check)
'stock': response.xpath('//span[contains(@class, "stock") and contains(text(), "In stock")]/text()').get(default='Out of stock').strip()
}
Best practices for selector choice make scraping reliable and maintainable. Prefer CSS selectors for most cases — shorter, easier to read/write/debug, faster in Scrapy. Fall back to XPath only when CSS can’t express the condition (text content, position, parent/sibling, namespaces, complex logic). Modern tip: use Polars for post-processing — pl.from_pandas(df) after Scrapy item export for fast cleaning/aggregation. Add type hints — response: scrapy.http.TextResponse — improves spider clarity. Use relative selectors — .xpath('.//p') or .css('p') — after narrowing scope to avoid full-tree searches. Prefer ::text and ::attr(name) CSS pseudo-elements — cleaner than XPath for common cases. Test selectors in Scrapy Shell — scrapy shell 'url' — iterate quickly. Handle missing data — .get(default='') or .getall() or []. Combine with scrapy.linkextractors.LinkExtractor — for smart link following. Respect robots.txt and rate limit — Scrapy settings handle this automatically.
Choosing between XPath and CSS selectors in Scrapy balances power and simplicity — CSS for speed/readability, XPath for advanced filtering. In 2026, prefer CSS for everyday tasks, XPath for complex needs, use relative queries, test in Scrapy Shell, and integrate with Polars for scale. Master selector selection, and you’ll write fast, clean, maintainable spiders that extract accurate data from any site structure.
Next time you need to query a webpage in Scrapy — pick the right selector. It’s Python’s cleanest way to say: “Find exactly what I need, in the most effective way.”