A Classy Spider

A Classy Spider in Scrapy is the recommended way to write clean, organized, and maintainable spiders — inheriting from scrapy.Spider and defining key attributes (name, start_urls, allowed_domains) and methods (start_requests(), parse(), custom callbacks). Classy spiders provide full control over request generation, response handling, link following, pagination, and data extraction while keeping code modular and testable. In 2026, classy spiders remain the foundation of most production Scrapy projects — they integrate seamlessly with items, loaders, pipelines, middleware, feeds, and extensions, making them ideal for scalable, ethical crawling and structured data extraction.

Here’s a complete, practical guide to writing classy spiders in Scrapy: structure and attributes, start_requests vs start_urls, parse() and custom callbacks, link following, pagination, real-world patterns, and modern best practices with type hints, settings, logging, and pandas/Polars integration.

Classy spider structure — inherit from scrapy.Spider, define name (unique identifier), start_urls (initial seeds), allowed_domains (restrict crawl scope), and core methods.


import scrapy

class MyClassySpider(scrapy.Spider):
    name = 'myclassy'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/']

    def start_requests(self):
        # Optional: generate initial requests dynamically
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                headers={'User-Agent': 'MyClassySpider/1.0'},
                meta={'depth': 0}  # optional depth tracking
            )

    def parse(self, response):
        # Extract data from current page
        yield {
            'title': response.css('title::text').get(default='').strip(),
            'url': response.url,
            'h1_count': len(response.css('h1'))
        }

        # Follow links (simple example)
        for href in response.css('a::attr(href)').getall():
            yield response.follow(
                href,
                callback=self.parse,  # or custom callback
                meta={'depth': response.meta.get('depth', 0) + 1}
            )

start_requests() vs start_urls — use start_urls for simple lists; override start_requests() for dynamic generation, headers, meta, cookies, or authentication.


def start_requests(self):
    # Example: authenticated start requests
    yield scrapy.FormRequest(
        url='https://example.com/login',
        formdata={'username': 'user', 'password': 'pass'},
        callback=self.after_login
    )

def after_login(self, response):
    # Check login success, then yield real start requests
    if "welcome" in response.text.lower():
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse)
    else:
        self.logger.error("Login failed")

Real-world pattern: full-featured classy spider with pagination, custom callback, and item yielding.


class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['shop.example.com']
    start_urls = ['https://shop.example.com/categories/all?page=1']

    def parse(self, response):
        # Extract products
        for product in response.css('.product-item'):
            yield {
                'name': product.css('h3::text').get(default='').strip(),
                'price': product.css('.price::text').re_first(r'[\d,.]+') or '0.00',
                'url': product.css('a::attr(href)').get()
            }

        # Pagination — follow next page
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Best practices for classy spiders make them robust, scalable, and maintainable. Define name and allowed_domains — prevents accidental crawling outside target. Use start_requests() for dynamic/authenticated starts, parse() for generic handling, custom callbacks for specialized pages. Modern tip: use Polars for large-scale post-processing — export via FEEDS to Parquet, then pl.read_parquet() for fast analysis. Add type hints — response: scrapy.http.TextResponse — improves spider clarity. Yield dicts or Items early — use ItemLoader for cleaning/processing. Follow links with response.follow() — handles relative URLs, adds to scheduler. Use meta for passing data (e.g., depth, referer, priority). Enable ROBOTSTXT_OBEY, DOWNLOAD_DELAY, AUTO_THROTTLE — ethical and ban-resistant. Log meaningfully — self.logger.info() instead of print(). Handle duplicates with DUPEFILTER_CLASS. Export with FEEDS — supports JSON, CSV, Parquet via extensions. Test in Scrapy Shell — scrapy shell 'url' — iterate quickly.

A classy spider in Scrapy is clean, organized, and powerful — define name, start_urls, allowed_domains, start_requests(), parse(), and follow links with callbacks. In 2026, use start_requests for dynamic starts, yield items early, respect robots.txt, rate limit, export via FEEDS, and integrate with Polars for scale. Master classy spiders, and you’ll build maintainable, efficient crawlers that extract high-quality data from any site.

Next time you need to crawl a website systematically — write a classy spider. It’s Scrapy’s cleanest way to say: “Start here, follow everything, and extract structured data.”

Generating content...