A Classy Spider in Python 2026: Building Web Crawlers with Elegance & Best Practices
Building a web crawler (often called a "spider") is a classic Python project that teaches asynchronous I/O, data extraction, rate limiting, and respectful crawling. In 2026, with improved async support, better libraries (httpx, BeautifulSoup4, Playwright, Scrapy), and stricter ethical guidelines, writing a "classy spider" means creating clean, efficient, respectful, and maintainable crawlers.
This March 24, 2026 update walks through building a modern, classy spider in Python using best practices: asynchronous requests, proper headers, rate limiting, data validation, error handling, and ethical considerations.
TL;DR — Key Takeaways 2026
- Use
httpx+asynciofor fast asynchronous crawling - Always respect
robots.txtand add proper delays - Use
BeautifulSouporPlaywrightfor parsing - Implement robust error handling and logging
- Store data cleanly (JSON, CSV, or database)
- Make your spider "classy" — modular, configurable, and respectful
1. Modern Classy Spider Structure
import asyncio
import httpx
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
class ClassySpider:
def __init__(self, start_url: str, delay: float = 1.0):
self.start_url = start_url
self.delay = delay
self.visited = set()
self.client = httpx.AsyncClient(timeout=10.0, headers={
"User-Agent": "ClassySpider/2026[](https://www.pyinns.com)"
})
async def fetch(self, url: str):
await asyncio.sleep(self.delay) # respectful delay
response = await self.client.get(url)
response.raise_for_status()
return response.text
async def parse(self, html: str, base_url: str):
soup = BeautifulSoup(html, "html.parser")
links = []
for a in soup.find_all("a", href=True):
link = urljoin(base_url, a["href"])
if self.is_valid_url(link):
links.append(link)
return links
def is_valid_url(self, url: str) -> bool:
parsed = urlparse(url)
return parsed.netloc == urlparse(self.start_url).netloc
async def crawl(self, max_pages: int = 50):
queue = [self.start_url]
while queue and len(self.visited) < max_pages:
url = queue.pop(0)
if url in self.visited:
continue
self.visited.add(url)
print(f"Crawling: {url}")
try:
html = await self.fetch(url)
new_links = await self.parse(html, url)
queue.extend(new_links)
except Exception as e:
print(f"Error crawling {url}: {e}")
await self.client.aclose()
print(f"Crawl finished. Visited {len(self.visited)} pages.")
# Usage
async def main():
spider = ClassySpider("https://www.example.com")
await spider.crawl(max_pages=30)
if __name__ == "__main__":
asyncio.run(main())
2. Key Best Practices in 2026
- Respect robots.txt — use
urllib.robotparserorreppy - Add User-Agent — identify your crawler clearly
- Rate limiting — use delays and async sleep
- Error handling — graceful failures with logging
- Data storage — save to JSON, CSV, or database incrementally
- Ethical crawling — avoid aggressive scraping on production sites
3. When to Use Advanced Tools
| Tool | Use Case | When to Choose |
|---|---|---|
httpx + BeautifulSoup | Static pages | Simple, fast crawling |
Playwright / Selenium | JavaScript-heavy sites | Dynamic content |
Scrapy | Large-scale crawling | Production-grade spiders |
Asyncio + httpx | High performance | Modern async spiders |
Conclusion — A Classy Spider in 2026
Building a classy spider means writing respectful, efficient, maintainable, and ethical crawlers. In 2026, combine asyncio with httpx, use proper delays, respect robots.txt, and implement clean architecture with classes. Whether for learning, data collection, or research, a well-written spider demonstrates both technical skill and responsibility.
Next steps:
- Build your own classy spider starting with the example above
- Related articles: Efficient Python Code 2026 • Python Built-ins Overview 2026