Crawling is the heart of any serious web scrapping project. In Scrapy (still the #1 framework for structured crawling in 2026), crawling means systematically following links across pages, handling pagination, respecting depth limits, and extracting data at scale — all while avoiding blocks and staying ethical.
This updated 2026 guide explains how to build robust crawlers with Scrapy 2.14+, including modern async patterns, pagination strategies, CrawlSpider rules, depth control, and best practices to stay under the radar.
What Does "Crawling" Mean in Web Scrapping?
Crawling = discovering and visiting new pages by following hyperlinks. In Scrapy, this happens through:
start_urls+parse()→ manual followingresponse.follow()→ link followingCrawlSpider+Rule→ automatic link extraction & pagination
Option 1: Manual Crawling with Basic Spider (Simple Pagination)
# spiders/news_crawler.py
import scrapy
class NewsCrawler(scrapy.Spider):
name = "news"
start_urls = ["https://example.com/news/page/1"]
def parse(self, response):
# Extract articles on current page
for article in response.css("article.post"):
yield {
"title": article.css("h2::text").get(),
"url": article.css("h2 a::attr(href)").get(),
"date": article.css(".date::text").get()
}
# Follow next page (pagination)
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Option 2: Automatic Crawling with CrawlSpider + Rules (Recommended 2026)
Use CrawlSpider for automatic link discovery and pagination handling.
# spiders/blog_spider.py
from scrapy.spiders import CrawlSpider, Rule
from scrapy.link_extractors import LinkExtractor
class BlogSpider(CrawlSpider):
name = "blog"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/blog/"]
rules = (
# Extract all article links and parse them
Rule(LinkExtractor(allow=r"/blog/article/"), callback="parse_article", follow=True),
# Follow pagination links (common patterns)
Rule(LinkExtractor(allow=r"/page/d+/", deny=r"/page/1000/"), follow=True),
)
def parse_article(self, response):
yield {
"title": response.css("h1.article-title::text").get(),
"content": response.css("div.content").get(),
"author": response.css(".author::text").get(),
"url": response.url
}
Depth & Performance Control in 2026
Set these in settings.py to prevent infinite crawling:
DEPTH_LIMIT = 5 # max link depth
DEPTH_PRIORITY = 1 # higher depth = lower priority
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 1.2 # polite delay
RANDOMIZE_DOWNLOAD_DELAY = True
2026 Best Practices for Reliable Crawling
- Use async def parse() when possible (Scrapy 2.13+)
- Integrate scrapy-playwright for JavaScript-rendered pagination
- Rotate User-Agents & headers via
scrapy-user-agentsmiddleware - Enable
ROBOTSTXT_OBEY = Trueby default - Add realistic delays & concurrent request limits
- Save output incrementally:
scrapy crawl blog -o data.jl -s FEED_EXPORT_ENCODING=utf-8
Ethical & Legal Reminders for Crawling
Always respect robots.txt, avoid aggressive crawling, and prefer official APIs when available. High-volume crawling of protected sites can lead to IP bans or legal notices.
Last updated: March 19, 2026 – Scrapy 2.14 brings better coroutine support and improved link extractors, making crawling more efficient and reliable than ever.