Crawl

Crawling is the automated process of systematically discovering and visiting web pages, following links, and extracting data at scale — the foundation of large web scraping projects. In Scrapy, crawling is handled by Spiders: Python classes that define start URLs, parsing logic, link following, and data yielding. Spiders can crawl entire sites or domains, respect robots.txt, handle pagination, avoid duplicates via seen URLs, and export structured items to JSON, CSV, XML, or databases. In 2026, crawling with Scrapy remains the gold standard for production-grade scraping — built-in concurrency, middleware, pipelines, link extractors, and depth limiting make it efficient, scalable, and maintainable compared to manual requests loops or other tools.

Here’s a complete, practical guide to crawling with Scrapy: spider structure, start_urls and parse(), link following, pagination, rules/link extractors, real-world patterns, and modern best practices with settings, items, pipelines, and pandas/Polars integration.

Basic Spider structure — define name, start_urls, and parse() to extract data and follow links.


import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Extract data from current page
        yield {
            'title': response.css('title::text').get(default='').strip(),
            'url': response.url
        }

        # Follow all links on the page
        for href in response.css('a::attr(href)').getall():
            yield response.follow(href, callback=self.parse)

Link following with response.follow() — handles relative/absolute URLs, adds to request queue, calls parse() (or custom callback).


# Custom callback for detail pages
def parse(self, response):
    # Yield page data
    yield {'title': response.css('title::text').get()}

    # Follow detail links with different callback
    for href in response.css('.product a::attr(href)').getall():
        yield response.follow(href, callback=self.parse_product)

def parse_product(self, response):
    yield {
        'name': response.css('h1::text').get(),
        'price': response.css('.price::text').get()
    }

Pagination and link extractors — use CrawlSpider + Rule + LinkExtractor for automatic following of next pages or categories.


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ProductSpider(CrawlSpider):
    name = 'products'
    start_urls = ['https://example.com/categories']

    rules = (
        Rule(LinkExtractor(allow=r'/products/'), callback='parse_product', follow=True),
        Rule(LinkExtractor(allow=r'/page/\d+'), callback='parse', follow=True),
    )

    def parse_product(self, response):
        yield {'name': response.css('h1::text').get()}

Real-world pattern: full-site crawl with data export — crawl categories, follow product links, yield structured items, save to JSON/CSV/Parquet via pipelines or feed exports.


# settings.py excerpt
FEEDS = {
    'products.json': {'format': 'json'},
    'products.csv': {'format': 'csv'},
}

# Or use pipeline for custom processing/export to database/Polars
class SaveToParquetPipeline:
    def open_spider(self, spider):
        self.items = []

    def process_item(self, item, spider):
        self.items.append(item)
        return item

    def close_spider(self, spider):
        import polars as pl
        df = pl.DataFrame(self.items)
        df.write_parquet(f"{spider.name}_products.parquet")

Best practices for effective crawling in Scrapy. Define clear start_urls and use LinkExtractor with allow/deny to control crawl scope. Use CrawlSpider with Rule for automatic following — set follow=True on category/index pages, follow=False on detail pages. Modern tip: use Polars for large-scale post-processing — export to Parquet via pipeline/feed for fast analytics. Add type hints — response: scrapy.http.TextResponse — improves spider clarity. Respect robots.txt — enable ROBOTSTXT_OBEY = True in settings. Rate limit with DOWNLOAD_DELAY, AUTO_THROTTLE, or CONCURRENT_REQUESTS — prevents bans. Use DEPTH_LIMIT for bounded crawls. Handle duplicates with DUPEFILTER_CLASS and DUPEFILTER_DEBUG. Yield items early — use ItemLoader for processing/cleaning. Export with FEEDS — supports JSON, CSV, XML, Parquet via extensions. Log with self.logger.info() — better than print(). Monitor with stats — check scrapy stats or enable TELNETCONSOLE_ENABLED.

Crawling in Scrapy systematically discovers and extracts data — define spiders, start URLs, parse methods, follow links, use rules/extractors, and export structured items. In 2026, use CrawlSpider for automation, respect robots.txt, rate limit, export to Parquet/Polars, and type hints for safety. Master crawling, and you’ll build scalable, ethical scrapers that map entire sites and yield clean, valuable data.

Next time you need to explore a whole website — start crawling with Scrapy. It’s Python’s cleanest way to say: “Discover and extract everything I need automatically.”

Generating content...