Python Automation Master Part 3: Web Scraping Basics
Extracting Data from the Web with Python
Introduction: What is Web Scraping?
Web Scraping is a technique for automatically extracting desired data from websites. Also called Web Crawling, it enables efficient collection of vast amounts of web data and is used in various fields such as data analysis, price monitoring, and news collection.
In this part, we will learn the fundamentals of web scraping using Python. We'll explore how to fetch web pages with the requests library and parse HTML with BeautifulSoup to extract the data we need step by step.
1. Legal/Ethical Considerations
Before starting web scraping, there are legal and ethical considerations you must know.
1.1 Checking robots.txt
robots.txt is a file located in the root directory of a website that specifies which pages web crawlers can access and which they should avoid.
# robots.txt example (https://example.com/robots.txt)
User-agent: *
Disallow: /private/
Disallow: /admin/
Allow: /public/
Crawl-delay: 10
- User-agent: Specifies which crawlers the rules apply to. * means all crawlers.
- Disallow: Specifies paths where crawling is prohibited.
- Allow: Specifies paths where crawling is allowed.
- Crawl-delay: Specifies the wait time (in seconds) between requests.
1.2 Web Scraping Ethics Guidelines
- Respect robots.txt: Always check and follow the website's robots.txt rules.
- Minimize server load: Add appropriate delays between requests to avoid overloading the server.
- Check Terms of Service: Review the website's terms of service for scraping-related clauses.
- Protect Personal Information: Do not collect personal information without authorization.
- Respect Copyright: Respect the copyright of collected data and use it appropriately.
1.3 Checking robots.txt with Python
from urllib.robotparser import RobotFileParser
def check_robots_txt(url, user_agent='*'):
"""Checks robots.txt and returns crawling permission status."""
# Generate robots.txt URL
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
# Set up RobotFileParser
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
# Check if crawling is allowed for the URL
can_fetch = rp.can_fetch(user_agent, url)
crawl_delay = rp.crawl_delay(user_agent)
return {
'can_fetch': can_fetch,
'crawl_delay': crawl_delay
}
# Usage example
result = check_robots_txt('https://www.google.com/search')
print(f"Crawling allowed: {result['can_fetch']}")
print(f"Crawl delay: {result['crawl_delay']} seconds")
2. The requests Library
requests is the most popular library for making HTTP requests in Python. It provides a simple and intuitive API for easily fetching web pages.
2.1 Installation
# Install using pip
pip install requests
2.2 HTTP Methods
Let's learn about the main HTTP methods and how to use them with requests.
GET Request
The most common request method, used to retrieve data from a server.
import requests
# Basic GET request
response = requests.get('https://httpbin.org/get')
print(response.text)
# GET request with query parameters
params = {
'search': 'python',
'page': 1,
'limit': 10
}
response = requests.get('https://httpbin.org/get', params=params)
print(response.url) # https://httpbin.org/get?search=python&page=1&limit=10
POST Request
Used to send data to a server. You can send form data or JSON data.
import requests
# Send form data
form_data = {
'username': 'user123',
'password': 'pass456'
}
response = requests.post('https://httpbin.org/post', data=form_data)
print(response.json())
# Send JSON data
json_data = {
'name': 'John Doe',
'email': 'john@example.com'
}
response = requests.post('https://httpbin.org/post', json=json_data)
print(response.json())
Setting Headers
You can customize requests by setting HTTP headers like User-Agent.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://www.google.com/'
}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json())
2.3 Response Handling
requests provides various ways to process responses.
import requests
response = requests.get('https://httpbin.org/get')
# Check status code
print(f"Status code: {response.status_code}") # 200
# Check if successful
if response.ok: # True if status_code is in 200-299 range
print("Request successful!")
# Response body (text)
print(response.text)
# Response body (JSON)
data = response.json()
print(data)
# Response body (binary) - for images, etc.
content = response.content
# Response headers
print(response.headers)
print(response.headers['Content-Type'])
# Encoding
print(response.encoding) # UTF-8
response.encoding = 'utf-8' # Set encoding
HTTP Status Codes
| Status Code | Meaning | Description |
|---|---|---|
| 200 | OK | Request successful |
| 301 | Moved Permanently | Permanently redirected |
| 302 | Found | Temporary redirect |
| 400 | Bad Request | Invalid request |
| 403 | Forbidden | Access denied |
| 404 | Not Found | Page not found |
| 500 | Internal Server Error | Server internal error |
2.4 Error Handling and Timeout
import requests
from requests.exceptions import RequestException, Timeout, HTTPError
def safe_request(url, timeout=10):
"""Performs a safe HTTP request."""
try:
response = requests.get(url, timeout=timeout)
response.raise_for_status() # Raises exception for 4xx, 5xx errors
return response
except Timeout:
print(f"Timeout occurred: {url}")
except HTTPError as e:
print(f"HTTP error: {e.response.status_code}")
except RequestException as e:
print(f"Request failed: {e}")
return None
# Usage example
response = safe_request('https://httpbin.org/get')
if response:
print(response.text)
3. HTML Parsing with BeautifulSoup
BeautifulSoup is a Python library for parsing HTML and XML documents. It makes it easy to extract desired data from complex HTML structures.
3.1 Installation
# Install BeautifulSoup and lxml parser
pip install beautifulsoup4 lxml
3.2 Basic Usage
from bs4 import BeautifulSoup
import requests
# Fetch web page
url = 'https://example.com'
response = requests.get(url)
# Create BeautifulSoup object
soup = BeautifulSoup(response.text, 'lxml') # or 'html.parser'
# View HTML structure (pretty print)
print(soup.prettify())
3.3 Finding Elements by Tag
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Test Page</title></head>
<body>
<h1>Main Title</h1>
<p>First paragraph</p>
<p>Second paragraph</p>
<a href="https://example.com">Link1</a>
<a href="https://google.com">Link2</a>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
# Find first tag
title = soup.find('title')
print(title.text) # Test Page
h1 = soup.find('h1')
print(h1.text) # Main Title
# Find all tags
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.text)
links = soup.find_all('a')
for link in links:
print(link.text, link['href'])
3.4 Finding Elements by Class and ID
from bs4 import BeautifulSoup
html = """
<html>
<body>
<div id="header">Header Area</div>
<div class="content">
<p class="intro">Introduction text</p>
<p class="main-text">Main content</p>
</div>
<div class="sidebar">Sidebar</div>
<ul class="menu">
<li class="menu-item active">Home</li>
<li class="menu-item">About</li>
<li class="menu-item">Contact</li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
# Find by ID
header = soup.find(id='header')
print(header.text) # Header Area
# Find by class
content = soup.find(class_='content')
print(content.text)
# Find multiple elements by class name
menu_items = soup.find_all(class_='menu-item')
for item in menu_items:
print(item.text)
# Find with compound conditions
active_item = soup.find('li', class_='active')
print(active_item.text) # Home
# Use CSS selectors (select)
main_text = soup.select_one('.content .main-text')
print(main_text.text) # Main content
all_menu_items = soup.select('ul.menu li')
for item in all_menu_items:
print(item.text)
3.5 Accessing Attributes and Extracting Text
from bs4 import BeautifulSoup
html = """
<html>
<body>
<a href="https://example.com" title="Example Link" data-id="123">
<span>Link</span> text
</a>
<img src="image.jpg" alt="Image description">
<div class="article">
<h2>Title</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
# Access attributes
link = soup.find('a')
print(link['href']) # https://example.com
print(link['title']) # Example Link
print(link.get('data-id')) # 123
print(link.get('class')) # None (returns None if not found)
# Get all attributes
print(link.attrs) # {'href': 'https://example.com', 'title': 'Example Link', 'data-id': '123'}
# Image tag attributes
img = soup.find('img')
print(img['src']) # image.jpg
print(img['alt']) # Image description
# Extract text
print(link.text) # Link text (includes all child text)
print(link.string) # None (None if multiple children)
print(link.get_text()) # Link text
# Strip whitespace
print(link.get_text(strip=True)) # Linktext
# Join with separator
article = soup.find(class_='article')
print(article.get_text(separator=' | ', strip=True))
# Title | First paragraph | Second paragraph
3.6 Advanced CSS Selector Usage
from bs4 import BeautifulSoup
html = """
<html>
<body>
<table id="data-table">
<tr><th>Name</th><th>Age</th><th>Job</th></tr>
<tr><td>John</td><td>30</td><td>Developer</td></tr>
<tr><td>Jane</td><td>25</td><td>Designer</td></tr>
<tr><td>Bob</td><td>35</td><td>Manager</td></tr>
</table>
<div class="products">
<div class="product" data-price="10000">
<span class="name">Product A</span>
</div>
<div class="product" data-price="20000">
<span class="name">Product B</span>
</div>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
# Descendant selector (space)
cells = soup.select('#data-table td')
for cell in cells:
print(cell.text)
# Direct child selector (>)
rows = soup.select('#data-table > tr')
# Attribute selector
products = soup.select('div[data-price]')
for product in products:
name = product.select_one('.name').text
price = product['data-price']
print(f"{name}: ${price}")
# nth element selector
first_row = soup.select_one('#data-table tr:nth-child(2)') # First data row
print([td.text for td in first_row.find_all('td')])
# Compound selector
products_with_high_price = soup.select('.product[data-price="20000"]')
for product in products_with_high_price:
print(product.select_one('.name').text)
4. Practical Examples: Building a Simple Scraper
Let's combine what we've learned to build practical scrapers.
4.1 News Headline Collector
import requests
from bs4 import BeautifulSoup
import time
class NewsHeadlineScraper:
"""Scraper that collects news headlines"""
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
self.session = requests.Session()
self.session.headers.update(self.headers)
def fetch_page(self, url):
"""Fetches a web page."""
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"Page request failed: {e}")
return None
def parse_headlines(self, html, selector):
"""Extracts headlines from HTML."""
soup = BeautifulSoup(html, 'lxml')
headlines = []
elements = soup.select(selector)
for element in elements:
title = element.get_text(strip=True)
link = element.get('href', '')
if title:
headlines.append({
'title': title,
'link': link
})
return headlines
def scrape(self, url, selector, delay=1):
"""Scrapes headlines from the given URL."""
print(f"Starting scrape: {url}")
html = self.fetch_page(url)
if not html:
return []
headlines = self.parse_headlines(html, selector)
print(f"Found {len(headlines)} headlines.")
time.sleep(delay) # Prevent server overload
return headlines
# Usage example (check the site's terms of service before actual use)
if __name__ == '__main__':
scraper = NewsHeadlineScraper()
# Example (modify URL and selector for target site)
headlines = scraper.scrape(
url='https://example.com/news',
selector='a.headline'
)
for idx, headline in enumerate(headlines, 1):
print(f"{idx}. {headline['title']}")
print(f" Link: {headline['link']}")
4.2 Table Data Extractor
import requests
from bs4 import BeautifulSoup
import csv
def extract_table_data(url, table_selector='table'):
"""Extracts table data from a web page."""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
table = soup.select_one(table_selector)
if not table:
print("Table not found.")
return []
data = []
rows = table.find_all('tr')
for row in rows:
# Extract header cells or data cells
cells = row.find_all(['th', 'td'])
row_data = [cell.get_text(strip=True) for cell in cells]
if row_data:
data.append(row_data)
return data
def save_to_csv(data, filename):
"""Saves data to a CSV file."""
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerows(data)
print(f"Data saved to {filename}.")
# Usage example
if __name__ == '__main__':
# Example URL (change to appropriate URL for actual use)
data = extract_table_data(
url='https://example.com/data-table',
table_selector='#main-table'
)
if data:
for row in data[:5]: # Print only first 5 rows
print(row)
save_to_csv(data, 'extracted_data.csv')
4.3 Image Downloader
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import os
import time
class ImageDownloader:
"""Class that downloads images from a web page"""
def __init__(self, download_dir='images'):
self.download_dir = download_dir
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
# Create download directory
if not os.path.exists(download_dir):
os.makedirs(download_dir)
def get_image_urls(self, page_url, img_selector='img'):
"""Extracts image URLs from a page."""
response = requests.get(page_url, headers=self.headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
images = soup.select(img_selector)
image_urls = []
for img in images:
src = img.get('src') or img.get('data-src')
if src:
# Convert relative URL to absolute URL
full_url = urljoin(page_url, src)
image_urls.append(full_url)
return image_urls
def download_image(self, url, filename=None):
"""Downloads an image."""
try:
response = requests.get(url, headers=self.headers, timeout=30)
response.raise_for_status()
# If no filename, extract from URL
if not filename:
parsed = urlparse(url)
filename = os.path.basename(parsed.path) or 'image.jpg'
filepath = os.path.join(self.download_dir, filename)
with open(filepath, 'wb') as f:
f.write(response.content)
print(f"Download complete: {filename}")
return filepath
except requests.RequestException as e:
print(f"Download failed ({url}): {e}")
return None
def download_all(self, page_url, img_selector='img', delay=1):
"""Downloads all images from a page."""
image_urls = self.get_image_urls(page_url, img_selector)
print(f"Found {len(image_urls)} images.")
downloaded = []
for idx, url in enumerate(image_urls, 1):
print(f"[{idx}/{len(image_urls)}] Downloading...")
filepath = self.download_image(url)
if filepath:
downloaded.append(filepath)
time.sleep(delay) # Prevent server overload
return downloaded
# Usage example
if __name__ == '__main__':
downloader = ImageDownloader(download_dir='downloaded_images')
# Example URL (change to appropriate URL for actual use)
downloaded_files = downloader.download_all(
page_url='https://example.com/gallery',
img_selector='div.gallery img',
delay=2
)
print(f"\nTotal {len(downloaded_files)} images downloaded.")
5. Tips and Precautions
5.1 Tips for Efficient Scraping
- Use Session: When making multiple requests to the same site, use requests.Session() to reuse connections.
- Appropriate Delays: Use time.sleep() between requests to avoid overloading the server.
- Error Handling: Add appropriate exception handling for network errors, timeouts, etc.
- Utilize Caching: Cache results to avoid making repeated requests to the same page.
- Logging: Log the scraping process for debugging when issues occur.
5.2 Common Problems and Solutions
- Encoding Issues: Explicitly set response.encoding or use the chardet library.
- 403 Forbidden: Set the User-Agent header or add other headers.
- Dynamic Content: Content loaded by JavaScript cannot be fetched with requests. We'll learn how to solve this with Selenium in the next part.
- IP Blocking: Reduce request frequency or use proxies.
Conclusion
In this part, we learned the fundamentals of Python web scraping. We explored how to fetch web pages with the requests library and parse HTML with BeautifulSoup to extract desired data.
In the next Part 4, we'll cover more complex web scraping scenarios. We'll learn about Selenium for handling dynamically generated content with JavaScript, scraping sites that require login, and various methods for storing collected data.
Series Notice: The Python Automation Master series continues. Stay tuned for more advanced web scraping techniques in the next part!