Is web scraping with Python legal?

Scraping publicly available data is generally legal. The hiQ v. LinkedIn ruling (2022) held that public data scraping doesn't violate the Computer Fraud and Abuse Act in the US. Each site's Terms of Service still govern what you can do with the data. Scraping behind authentication walls or collecting personal data can raise separate legal issues.

What is the best Python web scraping library for beginners?

Start with requests + BeautifulSoup for any site with static HTML. The combination installs in seconds ( pip install requests beautifulsoup4 lxml ) and handles the majority of real-world scraping tasks. Upgrade to Playwright only when you confirm that JavaScript rendering is blocking your extraction.

How do I scrape JavaScript-rendered pages in Python?

Use Playwright ( pip install playwright && playwright install chromium ). Before reaching for it, inspect the Network tab in DevTools: many SPAs load data from an internal JSON API that you can hit directly with requests, which is faster and more reliable than browser automation.

How do I avoid getting blocked while scraping with Python?

Set a realistic browser User-Agent header, add 1-3 second random delays between requests, and use a requests Session object to persist cookies across pages. For large-scale scraping, rotate IP addresses through a residential proxy service like Bright Data or Oxylabs .

Is Playwright better than Selenium for Python web scraping?

For new projects in 2026, yes. Playwright has a more reliable async API, better built-in waiting for dynamic content, and slightly better bot-evasion characteristics out of the box. Selenium still has a larger community and more documentation accumulated over two decades, which matters if you hit edge cases. Existing Selenium projects don't need to migrate: both work.

How do I store scraped data in Python?

For small runs (under 100k rows), write to CSV with pandas.DataFrame.to_csv() . For data you'll query repeatedly, use SQLite ( sqlite3 from the standard library, or SQLAlchemy). For production pipelines with multiple scrapers writing to a shared store, PostgreSQL is the standard.

Can Python scrape pages that require a login?

Yes. Use a requests Session object to handle cookies across the login form submission and subsequent page requests. For sites that use JavaScript to handle authentication (OAuth flows, CAPTCHA-gated logins), use Playwright to fill and submit the form programmatically. Always verify that the site's Terms of Service permit scraping behind authentication before doing so.

Python Web Scraping: From First Request to Production Pipeline

Python web scraping guide covering requests, BeautifulSoup, Scrapy, and Playwright with working code examples, anti-bot strategies, and production workflows.

Updated May 5, 202612 min read

Python web scraping is the automated extraction of data from websites using a script that sends HTTP requests, downloads HTML, and pulls out the fields you care about. The most common beginner stack is requests for fetching pages and BeautifulSoup for parsing them. When pages rely on JavaScript to render content, Playwright has become the preferred 2026 choice over the older Selenium.

The global web scraping market reached $754 million in 2024 and is projected to exceed $2.87 billion by 2034, driven by competitive intelligence, price monitoring, and AI dataset construction.

This guide covers the complete Python scraping stack: library selection, working code for static and dynamic sites, anti-bot countermeasures, legal boundaries, and the mistakes that get beginners' IPs banned in minutes.

Key Takeaways

requests + BeautifulSoup is the correct starting point for static HTML pages.
Playwright (not Selenium) is the 2026 standard for JavaScript-rendered content.
Scrapy is the right framework when you need to crawl hundreds or thousands of URLs in production.
82% of e-commerce companies already use web scraping for competitive intelligence: checking robots.txt and ToS before you start is not optional.
The most common reason scrapers get blocked is the default python-requests/2.28.0 User-Agent, which every anti-bot system recognizes immediately.

What Is Python Web Scraping?

Web scraping is how you collect structured data from websites that don't offer an official API. Your script mimics what a browser does: it sends an HTTP GET request, receives an HTML document, and parses out the elements you need. Those elements (prices, titles, links, phone numbers, reviews) get saved wherever you need them.

Python became the dominant scraping language because its syntax keeps scripts readable, its library ecosystem covers every stage of the workflow, and its data science stack (pandas, NumPy, SQLAlchemy) integrates directly with whatever you extract. You can go from raw HTML to a clean CSV in fewer lines of Python than any other general-purpose language.

Why Python Web Scraping Matters in 2026

The web scraping industry maintains a 14.3% CAGR as more businesses treat automated data collection as a competitive capability rather than a niche developer trick. 36% of investment firms now rely on scraped alternative data for portfolio decisions. 51% of online shoppers say price-comparison sites influenced their last purchase, and every one of those comparison sites runs on scrapers.

For Python developers specifically, web scraping is a skill that creates direct business value. Price monitoring, lead enrichment, ML dataset construction, and news aggregation are all hiring categories that list web scraping as a core requirement.

How Python Web Scraping Works: A 5-Step Framework

Every scraping project follows the same five stages, regardless of which libraries you use. Getting all five right is what separates a scraper that runs once from one that runs in production.

Step 1: Inspect the Target

Before writing a single line of Python, open your browser's DevTools (F12) and examine the page's HTML structure. Look for the elements that contain your target data: their tag names, CSS classes, IDs, and data attributes. If the Network tab shows XHR or Fetch calls that return JSON, you can often hit those API endpoints directly with requests and skip HTML parsing entirely.

Step 2: Fetch the Page

Use requests for static HTML and Playwright or Scrapy for everything more complex. This step gets you the raw HTML string.

If response.text contains the data you want, you have a static page and the simple stack is enough. If it looks like a skeleton with no content, the page renders via JavaScript and you need a browser automation tool.

Step 3: Parse and Extract

BeautifulSoup and lxml turn the raw HTML string into a traversable document tree. You select elements by CSS selector (soup.select('.price')) or by tag and attribute (soup.find_all('a', class_='product-link')). XPath via lxml gives more precise control for deeply nested structures.

Step 4: Handle Pagination and Edge Cases

Real scrapers need to follow "next page" links, handle 404s gracefully, retry on 429 rate-limit responses, and deal with session cookies. Scrapy's AutoThrottle and retry middleware handle most of this automatically. For requests-based scrapers, you implement exponential backoff manually.

Step 5: Store Results

For small runs, pandas.DataFrame.to_csv() is enough. For production pipelines that run on a schedule, write to SQLite (small) or PostgreSQL (large). JSON works for APIs that consume the data downstream.

Core Python Libraries for Web Scraping

requests: The HTTP Layer

requests is the most widely installed Python library, period. It handles GET and POST requests, redirects, cookies, sessions, headers, timeouts, and authentication. Its one limitation: it downloads HTML exactly as the server sends it, with no JavaScript execution.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

response = requests.get('https://books.toscrape.com/', headers=headers)
response.raise_for_status()  # Raise an exception for 4xx/5xx
print(response.text[:500])

Always set a realistic User-Agent. The default python-requests/2.X string is blocked by virtually every anti-bot system on day one.

BeautifulSoup: HTML Parsing

BeautifulSoup parses the HTML string that requests downloads and lets you navigate the document tree. Use CSS selectors for most extraction work: they're concise, match browser DevTools syntax, and handle most real-world pages.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'lxml')

# CSS selector: all book titles on page 1
books = soup.select('article.product_pod h3 a')
for book in books:
    print(book['title'], book['href'])

# By class: all price elements
prices = soup.find_all('p', class_='price_color')
for price in prices:
    print(price.text)

Install both libraries with: pip install requests beautifulsoup4 lxml. The lxml parser is faster and more lenient with malformed HTML than Python's built-in html.parser.

lxml and XPath: High-Performance Parsing

lxml is the parser backend that makes BeautifulSoup faster, but you can also use it standalone with XPath queries. XPath is more expressive than CSS selectors for complex targeting (specific attribute values, parent-child constraints, text matching) and is the preferred option in Scrapy spiders.

from lxml import html

tree = html.fromstring(response.content)
prices = tree.xpath('//p[@class="price_color"]/text()')
print(prices)

Use lxml directly when you need maximum parsing speed on large documents or when XPath precision matters more than syntax brevity.

Scrapy: The Production Framework

Scrapy is a full crawling framework, not just a parsing library. It manages a request queue, handles concurrency, respects robots.txt by default, and gives you a pipeline architecture for cleaning and storing data. It's the right choice when you need to crawl more than a few dozen pages systematically.

A minimal Scrapy spider for book titles:

import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get(),
            }
        # Follow next-page link
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it with scrapy crawl books -o books.csv. Add ROBOTSTXT_OBEY = True and DOWNLOAD_DELAY = 1 to your settings.py to respect the target site from the start. Scrapy's AutoThrottle extension adjusts request rates dynamically based on server latency.

Playwright: JavaScript-Rendered Pages

Playwright launches a real browser (Chromium, Firefox, or WebKit) and gives you programmatic control over it. Unlike requests, it executes JavaScript, waits for dynamic content to load, interacts with forms, and handles cookies and sessions exactly as a real user would. It became the preferred alternative to Selenium after its 2020 release because of its async-first API and more reliable element detection.

Setup:

pip install playwright
playwright install chromium

A synchronous example:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com/products')
    page.wait_for_selector('.product-grid')
    products = page.query_selector_all('.product-card')
    for product in products:
        print(product.inner_text())
    browser.close()

For high-concurrency scraping, use the async API (async_playwright) with asyncio. Playwright is RAM-intensive: each browser instance uses 200-400MB. Spin up one instance and reuse it across pages rather than creating a new browser per URL.

Library Comparison Table

Library	Best For	JS Support	Learning Curve	Scale
requests	Static pages, REST APIs	No	Low	Medium
BeautifulSoup	HTML parsing (paired with requests)	No	Low	Medium
lxml	Fast parsing, XPath	No	Medium	High
Scrapy	Large-scale crawls, pipelines	Limited	High	Enterprise
Selenium	JS sites (legacy projects)	Yes	Medium	Medium
Playwright	JS sites (new projects)	Yes	Medium	High
httpx	Async HTTP without a browser	No	Low	High

Handling JavaScript-Rendered Pages

Modern single-page applications built with React, Vue, or Angular load data after the initial HTML response. When you scrape one of these sites with requests, you get the skeleton HTML with empty containers: the data hasn't loaded yet.

Your first move should always be the Network tab in DevTools. Filter by XHR or Fetch requests and look for JSON API calls that contain your target data.

If the site loads products from an internal API endpoint, hit that endpoint directly with requests and skip browser automation entirely. This approach is faster, cheaper on resources, and more stable than browser-based scraping.

When there's no accessible API (data is rendered client-side with no clean network call), use Playwright:

import asyncio
from playwright.async_api import async_playwright

async def scrape_dynamic():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto('https://spa-example.com/products')
        # Wait for the product list to render
        await page.wait_for_selector('.product-list', timeout=10000)
        titles = await page.eval_on_selector_all(
            '.product-title', 'elements => elements.map(el => el.textContent)'
        )
        await browser.close()
        return titles

asyncio.run(scrape_dynamic())

The wait_for_selector call is critical. Without it, you capture the page before JavaScript finishes populating the DOM and get empty results.

Avoiding Blocks: Rate Limiting, Headers, and Proxies

About 30% of all web traffic comes from bots and scrapers. Every site worth scraping has seen this traffic pattern before, and most have implemented countermeasures. Here's how to work within them responsibly.

Set Realistic Headers

The single most important thing you can do is set a User-Agent that looks like a real browser. Go further and add the Accept headers a real browser sends:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
}

Add Delays Between Requests

Sending 100 requests per second to a single domain is how you get banned within minutes. Add a random delay between requests to mimic human browsing behavior:

import time
import random

def polite_get(url, headers):
    time.sleep(random.uniform(1.5, 3.5))  # Random 1.5-3.5 second delay
    return requests.get(url, headers=headers)

In Scrapy, set DOWNLOAD_DELAY = 2 and enable AUTOTHROTTLE_ENABLED = True in settings.py. AutoThrottle adjusts the delay dynamically based on the server's response latency.

Handle Rate Limit Responses (429)

When a server returns 429 Too Many Requests, back off exponentially and retry:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=2, min=4, max=60))
def fetch_with_retry(url, headers):
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    return response

Proxy Rotation

For large-scale scraping where IP rotation is necessary, use a residential proxy service like Bright Data, Oxylabs, or Zyte. These route your requests through different IP addresses to avoid IP-level bans.

For projects where you'd rather not manage proxies yourself, consider a scraping API like ScraperAPI or ScrapingBee, which handles proxies, CAPTCHAs, and JavaScript rendering through a single API endpoint.

Legal and Ethical Considerations

The hiQ v. LinkedIn decision in April 2022 held that scraping publicly available data does not violate the Computer Fraud and Abuse Act. But "not illegal" and "unrestricted" are different things.

Website Terms of Service still govern what you can do with the data, and violating them can expose you to civil claims.

Check robots.txt before you run a single request. Every website publishes this file at example.com/robots.txt. It lists which paths crawlers are allowed and disallowed to access.

In Scrapy, ROBOTSTXT_OBEY = True (the default) enforces this automatically. For requests-based scrapers, check it manually.

Respect the spirit of the rules beyond robots.txt. Don't scrape data behind an authentication wall without permission, and don't collect personal data without a lawful basis (GDPR applies even if your scraper runs outside Europe). Don't hammer a site's servers at a rate that degrades performance for real users.

When in doubt, look for an official API first. If the data is available via API, use the API. It's faster, more stable, and structurally documented.

Tools and Services Ecosystem

Tool	Type	Best For
Scrapy	Open-source framework	Production crawls, pipeline architecture
Playwright	Browser automation	JS-heavy sites, form interaction
ScraperAPI	Paid API	Managed proxies and CAPTCHA solving
ScrapingBee	Paid API	Headless browser via API, simpler integration
Bright Data	Proxy network	Enterprise IP rotation at scale
Apify	Cloud platform	Hosting Scrapy/Playwright scrapers
Zyte	Cloud platform	Scrapy creators, managed crawl infrastructure
ScrapeOps	Monitoring tool	Scraper health monitoring and proxy comparison

Paid scraping APIs make sense when the cost of managing proxies and debugging blocks exceeds the API subscription cost. For intermittent or low-volume scraping, the open-source stack is sufficient.

Python Web Scraping in Practice: Price Monitoring

The most common real-world scraping use case is price monitoring: tracking competitor or supplier prices across e-commerce sites to inform your own pricing decisions. 82% of e-commerce companies use some form of this.

A minimal price monitor for a product listing page might look like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import random

def get_price(url, headers):
    time.sleep(random.uniform(2, 4))
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'lxml')
    price_el = soup.select_one('.price_color')
    return price_el.text.strip() if price_el else None

urls = [
    'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
]

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}

records = []
for url in urls:
    price = get_price(url, headers)
    records.append({'url': url, 'price': price, 'scraped_at': datetime.utcnow()})

df = pd.DataFrame(records)
df.to_csv('prices.csv', index=False)
print(df)

Schedule this with cron on Linux (or Task Scheduler on Windows) to run nightly and you have a functional competitive price tracker with zero infrastructure cost. In production computer vision projects, scraping infrastructure often becomes as critical as the model itself.

Common Python Web Scraping Mistakes to Avoid

Mistake 1: Using the Default User-Agent

The default python-requests/2.32.0 User-Agent is blocked by virtually every anti-bot system. It announces "I'm a bot" before the server even checks your IP. Always set a realistic browser User-Agent header.

Mistake 2: Ignoring robots.txt

Scraping paths listed as Disallow in robots.txt risks IP bans and, depending on jurisdiction, civil liability. Check the file before scraping and filter your URL list accordingly.

Mistake 3: No Rate Limiting

Many beginners get their first scraper's IP banned within 10 minutes: the cause is sending requests with no delay. Even a 1-2 second random sleep cuts the block rate dramatically on most sites.

Mistake 4: Hardcoded Selectors

HTML structures change. A CSS selector like .product-price breaks the moment the site redesigns. Write selectors against stable attributes (data attributes, IDs, ARIA labels) when possible, and build monitoring that alerts you when extraction returns empty results.

Mistake 5: Using Playwright for Static Pages

Playwright launches a full Chromium browser process and uses 200-400MB of RAM per instance. Using it on a site that serves plain HTML is like driving a tank to the grocery store. Check whether requests + BeautifulSoup returns the data before reaching for browser automation.

Mistake 6: No Error Handling

Production scrapers encounter 403s, 429s, 503s, timeouts, and malformed HTML. Without error handling, a single failed request crashes the whole run. Wrap requests in try/except, check response.raise_for_status(), and implement retry logic with exponential backoff.

Mistake 7: Wrong Tool for Scale

BeautifulSoup is excellent for 10-100 pages. At 10,000 pages, you need Scrapy's concurrent request queue, built-in retry middleware, and pipeline architecture. Choosing the right tool for the scale of your project upfront prevents painful rewrites later.

Conclusion

Python web scraping is a layered skill. requests + BeautifulSoup covers the majority of real-world use cases: static pages, paginated listings, and sites with accessible JSON APIs. Playwright handles the rest, namely JavaScript-heavy SPAs where HTML alone doesn't contain the data. Scrapy is the tool you reach for when page count goes from dozens to thousands and you need a production architecture with pipelines, retry logic, and configurable concurrency.

Your next step: build a scraper against books.toscrape.com, a purpose-built practice site with no rate limiting or bot detection. Extract all 1,000 books across 50 pages, save them to a CSV, and you'll have worked through pagination, error handling, and data export in a single controlled exercise.

When you're ready to go deeper, explore the related Pynions guides on scraping specific platforms (Amazon, LinkedIn, Google). Topics covered include automating data pipelines with Python and deploying a Scrapy spider to a cloud server.

Frequently Asked Questions

Laptop displaying Python code for automation

May 5, 2026

Python Automation: A Practical Playbook for Developers

A practical playbook for Python automation covering the four core domains: files, browsers, scheduling, and orchestration. With code examples, library comparisons, case studies, and 7 common mistakes to avoid.

Tomas Laurinavicius

Read

Python data analysis workflow with code and visualizations

May 5, 2026

Python Data Analysis: A Field-Tested Workflow for Developers

A developer-focused guide to Python data analysis: the essential libraries (Pandas, NumPy, Scikit-learn), a 7-step workflow, and best practices for reproducible results.

Tomas Laurinavicius

Read

May 5, 2026

Python Basics: What to Learn First (and What You Can Skip)

Learn Python from scratch: core concepts, environment setup, best free resources, common beginner mistakes, and where Python skills can take your career.

Tomas Laurinavicius

Read

Python Web Scraping: From First Request to Production Pipeline

Key Takeaways

What Is Python Web Scraping?

Why Python Web Scraping Matters in 2026

How Python Web Scraping Works: A 5-Step Framework

Step 1: Inspect the Target

Step 2: Fetch the Page

Step 3: Parse and Extract

Step 4: Handle Pagination and Edge Cases

Step 5: Store Results

Core Python Libraries for Web Scraping

requests: The HTTP Layer

BeautifulSoup: HTML Parsing

lxml and XPath: High-Performance Parsing

Scrapy: The Production Framework

Playwright: JavaScript-Rendered Pages

Library Comparison Table

Handling JavaScript-Rendered Pages

Avoiding Blocks: Rate Limiting, Headers, and Proxies

Set Realistic Headers

Add Delays Between Requests

Handle Rate Limit Responses (429)

Proxy Rotation

Legal and Ethical Considerations

Tools and Services Ecosystem

Python Web Scraping in Practice: Price Monitoring

Common Python Web Scraping Mistakes to Avoid

Mistake 1: Using the Default User-Agent

Mistake 2: Ignoring robots.txt

Mistake 3: No Rate Limiting

Mistake 4: Hardcoded Selectors

Mistake 5: Using Playwright for Static Pages

Mistake 6: No Error Handling

Mistake 7: Wrong Tool for Scale

Conclusion

Frequently Asked Questions

Is web scraping with Python legal?

What is the best Python web scraping library for beginners?

How do I scrape JavaScript-rendered pages in Python?

How do I avoid getting blocked while scraping with Python?

Is Playwright better than Selenium for Python web scraping?

How do I store scraped data in Python?

Can Python scrape pages that require a login?

Related Articles

Python Automation: A Practical Playbook for Developers

Python Data Analysis: A Field-Tested Workflow for Developers

Python Basics: What to Learn First (and What You Can Skip)