Crawl in Python 2026: Building Modern Web Crawlers with Best Practices
Web crawling (also known as spidering) is the process of systematically browsing the internet to collect data. In 2026, building a responsible and efficient crawler involves asynchronous I/O, respectful rate limiting, proper user-agent identification, robots.txt compliance, and clean data pipelines.
This March 24, 2026 guide shows how to build a modern, classy Python crawler using current best practices with httpx, asyncio, BeautifulSoup, and ethical considerations.
TL;DR — Key Takeaways 2026
- Use asynchronous requests with
httpxfor speed - Always respect
robots.txtand add delays - Implement proper error handling and logging
- Use
BeautifulSouporPlaywrightfor parsing - Save data incrementally to avoid memory issues
- Make your crawler identifiable and polite
1. Modern Classy Crawler Example
import asyncio
import httpx
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
class ClassyCrawler:
def __init__(self, start_url: str, delay: float = 1.0, max_pages: int = 50):
self.start_url = start_url
self.delay = delay
self.max_pages = max_pages
self.visited = set()
self.client = httpx.AsyncClient(
timeout=15.0,
headers={
"User-Agent": "ClassyCrawler/2026[](https://www.pyinns.com)"
}
)
async def fetch(self, url: str):
await asyncio.sleep(self.delay)
response = await self.client.get(url)
response.raise_for_status()
return response.text
async def parse_links(self, html: str, base_url: str):
soup = BeautifulSoup(html, "html.parser")
links = []
for a in soup.find_all("a", href=True):
full_url = urljoin(base_url, a["href"])
if self.is_same_domain(full_url):
links.append(full_url)
return links
def is_same_domain(self, url: str) -> bool:
return urlparse(url).netloc == urlparse(self.start_url).netloc
async def crawl(self):
queue = [self.start_url]
while queue and len(self.visited) < self.max_pages:
url = queue.pop(0)
if url in self.visited:
continue
self.visited.add(url)
print(f"Crawling: {url}")
try:
html = await self.fetch(url)
new_links = await self.parse_links(html, url)
queue.extend(new_links)
except Exception as e:
print(f"Failed to crawl {url}: {e}")
await self.client.aclose()
print(f"Crawl completed. Visited {len(self.visited)} pages.")
async def main():
crawler = ClassyCrawler("https://www.example.com", delay=1.0, max_pages=30)
await crawler.crawl()
if __name__ == "__main__":
asyncio.run(main())
2. Key Best Practices in 2026
- Respect robots.txt — check before crawling
- Polite delays — use asyncio.sleep between requests
- Clear User-Agent — identify your crawler
- Graceful error handling — never crash on single page failure
- Incremental saving — write data as you crawl
- Domain restriction — stay within allowed domains
3. When to Choose Different Tools
| Tool | Use Case | Recommendation |
|---|---|---|
| httpx + BeautifulSoup | Static sites | Fast and simple |
| Playwright / Selenium | JavaScript-heavy sites | When dynamic content is needed |
| Scrapy | Large-scale crawling | Production-grade spiders |
| Asyncio + httpx | High performance | Modern custom crawlers |
Conclusion — Building a Classy Crawler in 2026
A classy crawler is respectful, efficient, maintainable, and ethical. In 2026, the combination of asyncio, httpx, and BeautifulSoup provides an excellent foundation. Always prioritize politeness, error handling, and clean architecture. Whether for research, data collection, or learning, a well-built spider demonstrates both technical skill and responsibility.
Next steps:
- Build your own crawler starting with the example above
- Related articles: Efficient Python Code 2026 • Python Built-ins Overview 2026