Extracting Data from a SelectorList

Extracting Data from a SelectorList is the core skill in Scrapy scraping — once you’ve used .xpath() or .css() to select elements, you get a SelectorList (a list-like object of Selector instances). From there, you extract text, attributes, HTML, or cleaned values using methods like .get(), .getall(), .extract(), .extract_first(), .re(), and more. In 2026, mastering SelectorList extraction remains essential — it’s how you turn raw HTML into structured data (titles, prices, links, reviews) for Items, Pipelines, exports, and analysis in pandas/Polars. The right extraction method ensures clean, consistent, and performant output even on messy or large-scale sites.

Here’s a complete, practical guide to extracting data from a SelectorList in Scrapy: key methods with examples, differences and use cases, chaining and cleaning, real-world patterns, and modern best practices with type hints, performance, error handling, and pandas/Polars integration.

.get() returns the first result as a string (text/attribute value) — returns None if empty; safest for single-value fields.


# Extract first title text
title = response.css('title::text').get()
print(title)   # e.g., "My Web Page" or None

# Extract first href attribute
first_link = response.css('a::attr(href)').get()
print(first_link)   # e.g., "https://example.com" or None

.getall() returns a list of all results as strings — empty list if no matches; ideal for multi-value fields (all links, paragraphs, prices).


paragraphs = response.css('p::text').getall()
print(paragraphs)   # ['Intro text.', 'Main content.', ...] or []

images = response.css('img::attr(src)').getall()
print(images)   # ['img1.jpg', 'img2.png', ...] or []

.extract() and .extract_first() are older aliases — .extract() returns list (like .getall()), .extract_first() returns first or None (like .get()). Use .get()/.getall() in new code — more consistent and explicit.


# Legacy style (still works)
titles_old = response.xpath('//h1/text()').extract()     # list
first_title_old = response.xpath('//h1/text()').extract_first()   # first or None

# Modern style (recommended)
titles_new = response.xpath('//h1/text()').getall()     # list
first_title_new = response.xpath('//h1/text()').get()   # first or None

Real-world pattern: structured item extraction in Scrapy spiders — use .get() for single fields, .getall() for lists, and clean/normalize for Items.


def parse_product(self, response):
    yield {
        'name': response.css('h1.product-title::text').get(default='').strip(),
        'price': response.css('.price-amount::text').re_first(r'[\d,.]+') or '0.00',
        'rating': response.css('.rating-stars::attr(data-rating)').get(default='N/A'),
        'features': [li.strip() for li in response.css('.features li::text').getall() if li.strip()],
        'image_urls': response.css('img.product-image::attr(src)').getall()
    }

Best practices make SelectorList extraction safe, readable, and performant. Prefer .get() for single values and .getall() for lists — clearer than legacy .extract()/.extract_first(). Use default='' or or [] — handles missing elements gracefully. Modern tip: use Polars for large-scale post-processing — pl.from_pandas(df) after Scrapy item export for fast cleaning/aggregation. Add type hints — response: scrapy.http.TextResponse — improves spider clarity. Chain selectors — response.css('.container').xpath('.//p/text()').getall() — narrows scope efficiently. Use ::text and ::attr(name) CSS pseudo-elements — cleaner than XPath for common cases. Clean output — .strip(), .re_first(), or list comprehensions — removes whitespace, extracts numbers. Handle multiple matches — .getall() or loop over SelectorList. Combine with ItemLoader — processes/clean inputs automatically for complex items. Respect robots.txt and rate limit — Scrapy settings handle this automatically.

Extracting data from a SelectorList turns raw selections into clean, structured output — use .get()/.getall(), clean/normalize, and chain wisely. In 2026, prefer modern methods, handle missing data, vectorize in pandas/Polars post-processing, and type hints for safety. Master SelectorList extraction, and you’ll build reliable, maintainable spiders that yield high-quality structured data from any site.

Next time you have a SelectorList — extract with .get() or .getall(). It’s Scrapy’s cleanest way to say: “Give me the data, clean and ready.”

Generating content...