Python Web Scraping: From First Request to Production Pipeline
Python web scraping guide covering requests, BeautifulSoup, Scrapy, and Playwright with working code examples, anti-bot strategies, and production workflows.

Python web scraping guide covering requests, BeautifulSoup, Scrapy, and Playwright with working code examples, anti-bot strategies, and production workflows.

Python web scraping is the automated extraction of data from websites using a script that sends HTTP requests, downloads HTML, and pulls out the fields you care about. The most common beginner stack is requests for fetching pages and BeautifulSoup for parsing them. When pages rely on JavaScript to render content, Playwright has become the preferred 2026 choice over the older Selenium.
The global web scraping market reached $754 million in 2024 and is projected to exceed $2.87 billion by 2034, driven by competitive intelligence, price monitoring, and AI dataset construction.
This guide covers the complete Python scraping stack: library selection, working code for static and dynamic sites, anti-bot countermeasures, legal boundaries, and the mistakes that get beginners' IPs banned in minutes.
python-requests/2.28.0 User-Agent, which every anti-bot system recognizes immediately.Web scraping is how you collect structured data from websites that don't offer an official API. Your script mimics what a browser does: it sends an HTTP GET request, receives an HTML document, and parses out the elements you need. Those elements (prices, titles, links, phone numbers, reviews) get saved wherever you need them.
Python became the dominant scraping language because its syntax keeps scripts readable, its library ecosystem covers every stage of the workflow, and its data science stack (pandas, NumPy, SQLAlchemy) integrates directly with whatever you extract. You can go from raw HTML to a clean CSV in fewer lines of Python than any other general-purpose language.
The web scraping industry maintains a 14.3% CAGR as more businesses treat automated data collection as a competitive capability rather than a niche developer trick. 36% of investment firms now rely on scraped alternative data for portfolio decisions. 51% of online shoppers say price-comparison sites influenced their last purchase, and every one of those comparison sites runs on scrapers.
For Python developers specifically, web scraping is a skill that creates direct business value. Price monitoring, lead enrichment, ML dataset construction, and news aggregation are all hiring categories that list web scraping as a core requirement.
Every scraping project follows the same five stages, regardless of which libraries you use. Getting all five right is what separates a scraper that runs once from one that runs in production.
Before writing a single line of Python, open your browser's DevTools (F12) and examine the page's HTML structure. Look for the elements that contain your target data: their tag names, CSS classes, IDs, and data attributes. If the Network tab shows XHR or Fetch calls that return JSON, you can often hit those API endpoints directly with requests and skip HTML parsing entirely.
Use requests for static HTML and Playwright or Scrapy for everything more complex. This step gets you the raw HTML string.
If response.text contains the data you want, you have a static page and the simple stack is enough. If it looks like a skeleton with no content, the page renders via JavaScript and you need a browser automation tool.
BeautifulSoup and lxml turn the raw HTML string into a traversable document tree. You select elements by CSS selector (soup.select('.price')) or by tag and attribute (soup.find_all('a', class_='product-link')). XPath via lxml gives more precise control for deeply nested structures.
Real scrapers need to follow "next page" links, handle 404s gracefully, retry on 429 rate-limit responses, and deal with session cookies. Scrapy's AutoThrottle and retry middleware handle most of this automatically. For requests-based scrapers, you implement exponential backoff manually.
For small runs, pandas.DataFrame.to_csv() is enough. For production pipelines that run on a schedule, write to SQLite (small) or PostgreSQL (large). JSON works for APIs that consume the data downstream.
requests is the most widely installed Python library, period. It handles GET and POST requests, redirects, cookies, sessions, headers, timeouts, and authentication. Its one limitation: it downloads HTML exactly as the server sends it, with no JavaScript execution.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = requests.get('https://books.toscrape.com/', headers=headers)
response.raise_for_status() # Raise an exception for 4xx/5xx
print(response.text[:500])Always set a realistic User-Agent. The default python-requests/2.X string is blocked by virtually every anti-bot system on day one.
BeautifulSoup parses the HTML string that requests downloads and lets you navigate the document tree. Use CSS selectors for most extraction work: they're concise, match browser DevTools syntax, and handle most real-world pages.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')
# CSS selector: all book titles on page 1
books = soup.select('article.product_pod h3 a')
for book in books:
print(book['title'], book['href'])
# By class: all price elements
prices = soup.find_all('p', class_='price_color')
for price in prices:
print(price.text)Install both libraries with: pip install requests beautifulsoup4 lxml. The lxml parser is faster and more lenient with malformed HTML than Python's built-in html.parser.
lxml is the parser backend that makes BeautifulSoup faster, but you can also use it standalone with XPath queries. XPath is more expressive than CSS selectors for complex targeting (specific attribute values, parent-child constraints, text matching) and is the preferred option in Scrapy spiders.
from lxml import html
tree = html.fromstring(response.content)
prices = tree.xpath('//p[@class="price_color"]/text()')
print(prices)Use lxml directly when you need maximum parsing speed on large documents or when XPath precision matters more than syntax brevity.
Scrapy is a full crawling framework, not just a parsing library. It manages a request queue, handles concurrency, respects robots.txt by default, and gives you a pipeline architecture for cleaning and storing data. It's the right choice when you need to crawl more than a few dozen pages systematically.
A minimal Scrapy spider for book titles:
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
for book in response.css('article.product_pod'):
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('.price_color::text').get(),
}
# Follow next-page link
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)Run it with scrapy crawl books -o books.csv. Add ROBOTSTXT_OBEY = True and DOWNLOAD_DELAY = 1 to your settings.py to respect the target site from the start. Scrapy's AutoThrottle extension adjusts request rates dynamically based on server latency.
Playwright launches a real browser (Chromium, Firefox, or WebKit) and gives you programmatic control over it. Unlike requests, it executes JavaScript, waits for dynamic content to load, interacts with forms, and handles cookies and sessions exactly as a real user would. It became the preferred alternative to Selenium after its 2020 release because of its async-first API and more reliable element detection.
Setup:
pip install playwright
playwright install chromiumA synchronous example:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com/products')
page.wait_for_selector('.product-grid')
products = page.query_selector_all('.product-card')
for product in products:
print(product.inner_text())
browser.close()For high-concurrency scraping, use the async API (async_playwright) with asyncio. Playwright is RAM-intensive: each browser instance uses 200-400MB. Spin up one instance and reuse it across pages rather than creating a new browser per URL.
Library | Best For | JS Support | Learning Curve | Scale |
|---|---|---|---|---|
requests | Static pages, REST APIs | No | Low | Medium |
BeautifulSoup | HTML parsing (paired with requests) | No | Low | Medium |
lxml | Fast parsing, XPath | No | Medium | High |
Scrapy | Large-scale crawls, pipelines | Limited | High | Enterprise |
Selenium | JS sites (legacy projects) | Yes | Medium | Medium |
Playwright | JS sites (new projects) | Yes | Medium | High |
httpx | Async HTTP without a browser | No | Low | High |
Modern single-page applications built with React, Vue, or Angular load data after the initial HTML response. When you scrape one of these sites with requests, you get the skeleton HTML with empty containers: the data hasn't loaded yet.
Your first move should always be the Network tab in DevTools. Filter by XHR or Fetch requests and look for JSON API calls that contain your target data.
If the site loads products from an internal API endpoint, hit that endpoint directly with requests and skip browser automation entirely. This approach is faster, cheaper on resources, and more stable than browser-based scraping.
When there's no accessible API (data is rendered client-side with no clean network call), use Playwright:
import asyncio
from playwright.async_api import async_playwright
async def scrape_dynamic():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto('https://spa-example.com/products')
# Wait for the product list to render
await page.wait_for_selector('.product-list', timeout=10000)
titles = await page.eval_on_selector_all(
'.product-title', 'elements => elements.map(el => el.textContent)'
)
await browser.close()
return titles
asyncio.run(scrape_dynamic())The wait_for_selector call is critical. Without it, you capture the page before JavaScript finishes populating the DOM and get empty results.
About 30% of all web traffic comes from bots and scrapers. Every site worth scraping has seen this traffic pattern before, and most have implemented countermeasures. Here's how to work within them responsibly.
The single most important thing you can do is set a User-Agent that looks like a real browser. Go further and add the Accept headers a real browser sends:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
}Sending 100 requests per second to a single domain is how you get banned within minutes. Add a random delay between requests to mimic human browsing behavior:
import time
import random
def polite_get(url, headers):
time.sleep(random.uniform(1.5, 3.5)) # Random 1.5-3.5 second delay
return requests.get(url, headers=headers)In Scrapy, set DOWNLOAD_DELAY = 2 and enable AUTOTHROTTLE_ENABLED = True in settings.py. AutoThrottle adjusts the delay dynamically based on the server's response latency.
When a server returns 429 Too Many Requests, back off exponentially and retry:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=2, min=4, max=60))
def fetch_with_retry(url, headers):
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return responseFor large-scale scraping where IP rotation is necessary, use a residential proxy service like Bright Data, Oxylabs, or Zyte. These route your requests through different IP addresses to avoid IP-level bans.
For projects where you'd rather not manage proxies yourself, consider a scraping API like ScraperAPI or ScrapingBee, which handles proxies, CAPTCHAs, and JavaScript rendering through a single API endpoint.
The hiQ v. LinkedIn decision in April 2022 held that scraping publicly available data does not violate the Computer Fraud and Abuse Act. But "not illegal" and "unrestricted" are different things.
Website Terms of Service still govern what you can do with the data, and violating them can expose you to civil claims.
Check robots.txt before you run a single request. Every website publishes this file at example.com/robots.txt. It lists which paths crawlers are allowed and disallowed to access.
In Scrapy, ROBOTSTXT_OBEY = True (the default) enforces this automatically. For requests-based scrapers, check it manually.
Respect the spirit of the rules beyond robots.txt. Don't scrape data behind an authentication wall without permission, and don't collect personal data without a lawful basis (GDPR applies even if your scraper runs outside Europe). Don't hammer a site's servers at a rate that degrades performance for real users.
When in doubt, look for an official API first. If the data is available via API, use the API. It's faster, more stable, and structurally documented.
Tool | Type | Best For |
|---|---|---|
Open-source framework | Production crawls, pipeline architecture | |
Browser automation | JS-heavy sites, form interaction | |
Paid API | Managed proxies and CAPTCHA solving | |
Paid API | Headless browser via API, simpler integration | |
Proxy network | Enterprise IP rotation at scale | |
Cloud platform | Hosting Scrapy/Playwright scrapers | |
Cloud platform | Scrapy creators, managed crawl infrastructure | |
Monitoring tool | Scraper health monitoring and proxy comparison |
Paid scraping APIs make sense when the cost of managing proxies and debugging blocks exceeds the API subscription cost. For intermittent or low-volume scraping, the open-source stack is sufficient.
The most common real-world scraping use case is price monitoring: tracking competitor or supplier prices across e-commerce sites to inform your own pricing decisions. 82% of e-commerce companies use some form of this.
A minimal price monitor for a product listing page might look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import random
def get_price(url, headers):
time.sleep(random.uniform(2, 4))
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
price_el = soup.select_one('.price_color')
return price_el.text.strip() if price_el else None
urls = [
'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
]
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
records = []
for url in urls:
price = get_price(url, headers)
records.append({'url': url, 'price': price, 'scraped_at': datetime.utcnow()})
df = pd.DataFrame(records)
df.to_csv('prices.csv', index=False)
print(df)Schedule this with cron on Linux (or Task Scheduler on Windows) to run nightly and you have a functional competitive price tracker with zero infrastructure cost. In production computer vision projects, scraping infrastructure often becomes as critical as the model itself.
The default python-requests/2.32.0 User-Agent is blocked by virtually every anti-bot system. It announces "I'm a bot" before the server even checks your IP. Always set a realistic browser User-Agent header.
Scraping paths listed as Disallow in robots.txt risks IP bans and, depending on jurisdiction, civil liability. Check the file before scraping and filter your URL list accordingly.
Many beginners get their first scraper's IP banned within 10 minutes: the cause is sending requests with no delay. Even a 1-2 second random sleep cuts the block rate dramatically on most sites.
HTML structures change. A CSS selector like .product-price breaks the moment the site redesigns. Write selectors against stable attributes (data attributes, IDs, ARIA labels) when possible, and build monitoring that alerts you when extraction returns empty results.
Playwright launches a full Chromium browser process and uses 200-400MB of RAM per instance. Using it on a site that serves plain HTML is like driving a tank to the grocery store. Check whether requests + BeautifulSoup returns the data before reaching for browser automation.
Production scrapers encounter 403s, 429s, 503s, timeouts, and malformed HTML. Without error handling, a single failed request crashes the whole run. Wrap requests in try/except, check response.raise_for_status(), and implement retry logic with exponential backoff.
BeautifulSoup is excellent for 10-100 pages. At 10,000 pages, you need Scrapy's concurrent request queue, built-in retry middleware, and pipeline architecture. Choosing the right tool for the scale of your project upfront prevents painful rewrites later.
Python web scraping is a layered skill. requests + BeautifulSoup covers the majority of real-world use cases: static pages, paginated listings, and sites with accessible JSON APIs. Playwright handles the rest, namely JavaScript-heavy SPAs where HTML alone doesn't contain the data. Scrapy is the tool you reach for when page count goes from dozens to thousands and you need a production architecture with pipelines, retry logic, and configurable concurrency.
Your next step: build a scraper against books.toscrape.com, a purpose-built practice site with no rate limiting or bot detection. Extract all 1,000 books across 50 pages, save them to a CSV, and you'll have worked through pagination, error handling, and data export in a single controlled exercise.
When you're ready to go deeper, explore the related Pynions guides on scraping specific platforms (Amazon, LinkedIn, Google). Topics covered include automating data pipelines with Python and deploying a Scrapy spider to a cloud server.

A practical playbook for Python automation covering the four core domains: files, browsers, scheduling, and orchestration. With code examples, library comparisons, case studies, and 7 common mistakes to avoid.

A developer-focused guide to Python data analysis: the essential libraries (Pandas, NumPy, Scikit-learn), a 7-step workflow, and best practices for reproducible results.

Learn Python from scratch: core concepts, environment setup, best free resources, common beginner mistakes, and where Python skills can take your career.