Mastering Crawling & Pagination in Scrapy 2026 – Complete Python Web Scrapping Guide

Crawling is the heart of any serious web scrapping project. In Scrapy (still the #1 framework for structured crawling in 2026), crawling means systematically following links across pages, handling pagination, respecting depth limits, and extracting data at scale — all while avoiding blocks and staying ethical.

This updated 2026 guide explains how to build robust crawlers with Scrapy 2.14+, including modern async patterns, pagination strategies, CrawlSpider rules, depth control, and best practices to stay under the radar.

What Does "Crawling" Mean in Web Scrapping?

Crawling = discovering and visiting new pages by following hyperlinks. In Scrapy, this happens through:

start_urls + parse() → manual following
response.follow() → link following
CrawlSpider + Rule → automatic link extraction & pagination

Option 1: Manual Crawling with Basic Spider (Simple Pagination)


# spiders/news_crawler.py
import scrapy

class NewsCrawler(scrapy.Spider):
    name = "news"
    start_urls = ["https://example.com/news/page/1"]

    def parse(self, response):
        # Extract articles on current page
        for article in response.css("article.post"):
            yield {
                "title": article.css("h2::text").get(),
                "url": article.css("h2 a::attr(href)").get(),
                "date": article.css(".date::text").get()
            }

        # Follow next page (pagination)
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Option 2: Automatic Crawling with CrawlSpider + Rules (Recommended 2026)

Use CrawlSpider for automatic link discovery and pagination handling.


# spiders/blog_spider.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.link_extractors import LinkExtractor

class BlogSpider(CrawlSpider):
    name = "blog"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/blog/"]

    rules = (
        # Extract all article links and parse them
        Rule(LinkExtractor(allow=r"/blog/article/"), callback="parse_article", follow=True),
        
        # Follow pagination links (common patterns)
        Rule(LinkExtractor(allow=r"/page/d+/", deny=r"/page/1000/"), follow=True),
    )

    def parse_article(self, response):
        yield {
            "title": response.css("h1.article-title::text").get(),
            "content": response.css("div.content").get(),
            "author": response.css(".author::text").get(),
            "url": response.url
        }

Depth & Performance Control in 2026

Set these in settings.py to prevent infinite crawling:


DEPTH_LIMIT = 5              # max link depth
DEPTH_PRIORITY = 1           # higher depth = lower priority
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 1.2         # polite delay
RANDOMIZE_DOWNLOAD_DELAY = True

2026 Best Practices for Reliable Crawling

Use async def parse() when possible (Scrapy 2.13+)
Integrate scrapy-playwright for JavaScript-rendered pagination
Rotate User-Agents & headers via scrapy-user-agents middleware
Enable ROBOTSTXT_OBEY = True by default
Add realistic delays & concurrent request limits
Save output incrementally: scrapy crawl blog -o data.jl -s FEED_EXPORT_ENCODING=utf-8

Ethical & Legal Reminders for Crawling

Always respect robots.txt, avoid aggressive crawling, and prefer official APIs when available. High-volume crawling of protected sites can lead to IP bans or legal notices.

Last updated: March 19, 2026 – Scrapy 2.14 brings better coroutine support and improved link extractors, making crawling more efficient and reliable than ever.

Mastering Crawling & Pagination in Scrapy 2026 – Complete Python Web Scrapping Guide

What Does "Crawling" Mean in Web Scrapping?

Option 1: Manual Crawling with Basic Spider (Simple Pagination)

Option 2: Automatic Crawling with CrawlSpider + Rules (Recommended 2026)

Depth & Performance Control in 2026

2026 Best Practices for Reliable Crawling

Ethical & Legal Reminders for Crawling

Related Articles in Web Scrapping 2026

Slashes and Brackets in Web Scraping with Python 2026: XPath vs CSS Explained

Introduction to the Scrapy Selector in Python 2026

Setting up a Selector in Python 2026: Best Practices for Web Scraping

Generating content...