Introduction: What is Web Scraping?

Web Scraping is a technique for automatically extracting desired data from websites. Also called Web Crawling, it enables efficient collection of vast amounts of web data and is used in various fields such as data analysis, price monitoring, and news collection.

In this part, we will learn the fundamentals of web scraping using Python. We'll explore how to fetch web pages with the requests library and parse HTML with BeautifulSoup to extract the data we need step by step.

1. Legal/Ethical Considerations

Before starting web scraping, there are legal and ethical considerations you must know.

1.1 Checking robots.txt

robots.txt is a file located in the root directory of a website that specifies which pages web crawlers can access and which they should avoid.

# robots.txt example (https://example.com/robots.txt)
User-agent: *
Disallow: /private/
Disallow: /admin/
Allow: /public/

Crawl-delay: 10
  • User-agent: Specifies which crawlers the rules apply to. * means all crawlers.
  • Disallow: Specifies paths where crawling is prohibited.
  • Allow: Specifies paths where crawling is allowed.
  • Crawl-delay: Specifies the wait time (in seconds) between requests.

1.2 Web Scraping Ethics Guidelines

  • Respect robots.txt: Always check and follow the website's robots.txt rules.
  • Minimize server load: Add appropriate delays between requests to avoid overloading the server.
  • Check Terms of Service: Review the website's terms of service for scraping-related clauses.
  • Protect Personal Information: Do not collect personal information without authorization.
  • Respect Copyright: Respect the copyright of collected data and use it appropriately.

1.3 Checking robots.txt with Python

from urllib.robotparser import RobotFileParser

def check_robots_txt(url, user_agent='*'):
    """Checks robots.txt and returns crawling permission status."""
    # Generate robots.txt URL
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    # Set up RobotFileParser
    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    # Check if crawling is allowed for the URL
    can_fetch = rp.can_fetch(user_agent, url)
    crawl_delay = rp.crawl_delay(user_agent)

    return {
        'can_fetch': can_fetch,
        'crawl_delay': crawl_delay
    }

# Usage example
result = check_robots_txt('https://www.google.com/search')
print(f"Crawling allowed: {result['can_fetch']}")
print(f"Crawl delay: {result['crawl_delay']} seconds")

2. The requests Library

requests is the most popular library for making HTTP requests in Python. It provides a simple and intuitive API for easily fetching web pages.

2.1 Installation

# Install using pip
pip install requests

2.2 HTTP Methods

Let's learn about the main HTTP methods and how to use them with requests.

GET Request

The most common request method, used to retrieve data from a server.

import requests

# Basic GET request
response = requests.get('https://httpbin.org/get')
print(response.text)

# GET request with query parameters
params = {
    'search': 'python',
    'page': 1,
    'limit': 10
}
response = requests.get('https://httpbin.org/get', params=params)
print(response.url)  # https://httpbin.org/get?search=python&page=1&limit=10

POST Request

Used to send data to a server. You can send form data or JSON data.

import requests

# Send form data
form_data = {
    'username': 'user123',
    'password': 'pass456'
}
response = requests.post('https://httpbin.org/post', data=form_data)
print(response.json())

# Send JSON data
json_data = {
    'name': 'John Doe',
    'email': 'john@example.com'
}
response = requests.post('https://httpbin.org/post', json=json_data)
print(response.json())

Setting Headers

You can customize requests by setting HTTP headers like User-Agent.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://www.google.com/'
}

response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json())

2.3 Response Handling

requests provides various ways to process responses.

import requests

response = requests.get('https://httpbin.org/get')

# Check status code
print(f"Status code: {response.status_code}")  # 200

# Check if successful
if response.ok:  # True if status_code is in 200-299 range
    print("Request successful!")

# Response body (text)
print(response.text)

# Response body (JSON)
data = response.json()
print(data)

# Response body (binary) - for images, etc.
content = response.content

# Response headers
print(response.headers)
print(response.headers['Content-Type'])

# Encoding
print(response.encoding)  # UTF-8
response.encoding = 'utf-8'  # Set encoding

HTTP Status Codes

Status Code Meaning Description
200 OK Request successful
301 Moved Permanently Permanently redirected
302 Found Temporary redirect
400 Bad Request Invalid request
403 Forbidden Access denied
404 Not Found Page not found
500 Internal Server Error Server internal error

2.4 Error Handling and Timeout

import requests
from requests.exceptions import RequestException, Timeout, HTTPError

def safe_request(url, timeout=10):
    """Performs a safe HTTP request."""
    try:
        response = requests.get(url, timeout=timeout)
        response.raise_for_status()  # Raises exception for 4xx, 5xx errors
        return response
    except Timeout:
        print(f"Timeout occurred: {url}")
    except HTTPError as e:
        print(f"HTTP error: {e.response.status_code}")
    except RequestException as e:
        print(f"Request failed: {e}")
    return None

# Usage example
response = safe_request('https://httpbin.org/get')
if response:
    print(response.text)

3. HTML Parsing with BeautifulSoup

BeautifulSoup is a Python library for parsing HTML and XML documents. It makes it easy to extract desired data from complex HTML structures.

3.1 Installation

# Install BeautifulSoup and lxml parser
pip install beautifulsoup4 lxml

3.2 Basic Usage

from bs4 import BeautifulSoup
import requests

# Fetch web page
url = 'https://example.com'
response = requests.get(url)

# Create BeautifulSoup object
soup = BeautifulSoup(response.text, 'lxml')  # or 'html.parser'

# View HTML structure (pretty print)
print(soup.prettify())

3.3 Finding Elements by Tag

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Test Page</title></head>
<body>
    <h1>Main Title</h1>
    <p>First paragraph</p>
    <p>Second paragraph</p>
    <a href="https://example.com">Link1</a>
    <a href="https://google.com">Link2</a>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# Find first tag
title = soup.find('title')
print(title.text)  # Test Page

h1 = soup.find('h1')
print(h1.text)  # Main Title

# Find all tags
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

links = soup.find_all('a')
for link in links:
    print(link.text, link['href'])

3.4 Finding Elements by Class and ID

from bs4 import BeautifulSoup

html = """
<html>
<body>
    <div id="header">Header Area</div>
    <div class="content">
        <p class="intro">Introduction text</p>
        <p class="main-text">Main content</p>
    </div>
    <div class="sidebar">Sidebar</div>
    <ul class="menu">
        <li class="menu-item active">Home</li>
        <li class="menu-item">About</li>
        <li class="menu-item">Contact</li>
    </ul>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# Find by ID
header = soup.find(id='header')
print(header.text)  # Header Area

# Find by class
content = soup.find(class_='content')
print(content.text)

# Find multiple elements by class name
menu_items = soup.find_all(class_='menu-item')
for item in menu_items:
    print(item.text)

# Find with compound conditions
active_item = soup.find('li', class_='active')
print(active_item.text)  # Home

# Use CSS selectors (select)
main_text = soup.select_one('.content .main-text')
print(main_text.text)  # Main content

all_menu_items = soup.select('ul.menu li')
for item in all_menu_items:
    print(item.text)

3.5 Accessing Attributes and Extracting Text

from bs4 import BeautifulSoup

html = """
<html>
<body>
    <a href="https://example.com" title="Example Link" data-id="123">
        <span>Link</span> text
    </a>
    <img src="image.jpg" alt="Image description">
    <div class="article">
        <h2>Title</h2>
        <p>First paragraph</p>
        <p>Second paragraph</p>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# Access attributes
link = soup.find('a')
print(link['href'])        # https://example.com
print(link['title'])       # Example Link
print(link.get('data-id')) # 123
print(link.get('class'))   # None (returns None if not found)

# Get all attributes
print(link.attrs)  # {'href': 'https://example.com', 'title': 'Example Link', 'data-id': '123'}

# Image tag attributes
img = soup.find('img')
print(img['src'])  # image.jpg
print(img['alt'])  # Image description

# Extract text
print(link.text)          # Link text (includes all child text)
print(link.string)        # None (None if multiple children)
print(link.get_text())    # Link text

# Strip whitespace
print(link.get_text(strip=True))  # Linktext

# Join with separator
article = soup.find(class_='article')
print(article.get_text(separator=' | ', strip=True))
# Title | First paragraph | Second paragraph

3.6 Advanced CSS Selector Usage

from bs4 import BeautifulSoup

html = """
<html>
<body>
    <table id="data-table">
        <tr><th>Name</th><th>Age</th><th>Job</th></tr>
        <tr><td>John</td><td>30</td><td>Developer</td></tr>
        <tr><td>Jane</td><td>25</td><td>Designer</td></tr>
        <tr><td>Bob</td><td>35</td><td>Manager</td></tr>
    </table>
    <div class="products">
        <div class="product" data-price="10000">
            <span class="name">Product A</span>
        </div>
        <div class="product" data-price="20000">
            <span class="name">Product B</span>
        </div>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# Descendant selector (space)
cells = soup.select('#data-table td')
for cell in cells:
    print(cell.text)

# Direct child selector (>)
rows = soup.select('#data-table > tr')

# Attribute selector
products = soup.select('div[data-price]')
for product in products:
    name = product.select_one('.name').text
    price = product['data-price']
    print(f"{name}: ${price}")

# nth element selector
first_row = soup.select_one('#data-table tr:nth-child(2)')  # First data row
print([td.text for td in first_row.find_all('td')])

# Compound selector
products_with_high_price = soup.select('.product[data-price="20000"]')
for product in products_with_high_price:
    print(product.select_one('.name').text)

4. Practical Examples: Building a Simple Scraper

Let's combine what we've learned to build practical scrapers.

4.1 News Headline Collector

import requests
from bs4 import BeautifulSoup
import time

class NewsHeadlineScraper:
    """Scraper that collects news headlines"""

    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def fetch_page(self, url):
        """Fetches a web page."""
        try:
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            return response.text
        except requests.RequestException as e:
            print(f"Page request failed: {e}")
            return None

    def parse_headlines(self, html, selector):
        """Extracts headlines from HTML."""
        soup = BeautifulSoup(html, 'lxml')
        headlines = []

        elements = soup.select(selector)
        for element in elements:
            title = element.get_text(strip=True)
            link = element.get('href', '')
            if title:
                headlines.append({
                    'title': title,
                    'link': link
                })

        return headlines

    def scrape(self, url, selector, delay=1):
        """Scrapes headlines from the given URL."""
        print(f"Starting scrape: {url}")

        html = self.fetch_page(url)
        if not html:
            return []

        headlines = self.parse_headlines(html, selector)
        print(f"Found {len(headlines)} headlines.")

        time.sleep(delay)  # Prevent server overload
        return headlines

# Usage example (check the site's terms of service before actual use)
if __name__ == '__main__':
    scraper = NewsHeadlineScraper()

    # Example (modify URL and selector for target site)
    headlines = scraper.scrape(
        url='https://example.com/news',
        selector='a.headline'
    )

    for idx, headline in enumerate(headlines, 1):
        print(f"{idx}. {headline['title']}")
        print(f"   Link: {headline['link']}")

4.2 Table Data Extractor

import requests
from bs4 import BeautifulSoup
import csv

def extract_table_data(url, table_selector='table'):
    """Extracts table data from a web page."""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'lxml')
    table = soup.select_one(table_selector)

    if not table:
        print("Table not found.")
        return []

    data = []
    rows = table.find_all('tr')

    for row in rows:
        # Extract header cells or data cells
        cells = row.find_all(['th', 'td'])
        row_data = [cell.get_text(strip=True) for cell in cells]
        if row_data:
            data.append(row_data)

    return data

def save_to_csv(data, filename):
    """Saves data to a CSV file."""
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerows(data)
    print(f"Data saved to {filename}.")

# Usage example
if __name__ == '__main__':
    # Example URL (change to appropriate URL for actual use)
    data = extract_table_data(
        url='https://example.com/data-table',
        table_selector='#main-table'
    )

    if data:
        for row in data[:5]:  # Print only first 5 rows
            print(row)

        save_to_csv(data, 'extracted_data.csv')

4.3 Image Downloader

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import os
import time

class ImageDownloader:
    """Class that downloads images from a web page"""

    def __init__(self, download_dir='images'):
        self.download_dir = download_dir
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }

        # Create download directory
        if not os.path.exists(download_dir):
            os.makedirs(download_dir)

    def get_image_urls(self, page_url, img_selector='img'):
        """Extracts image URLs from a page."""
        response = requests.get(page_url, headers=self.headers, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'lxml')
        images = soup.select(img_selector)

        image_urls = []
        for img in images:
            src = img.get('src') or img.get('data-src')
            if src:
                # Convert relative URL to absolute URL
                full_url = urljoin(page_url, src)
                image_urls.append(full_url)

        return image_urls

    def download_image(self, url, filename=None):
        """Downloads an image."""
        try:
            response = requests.get(url, headers=self.headers, timeout=30)
            response.raise_for_status()

            # If no filename, extract from URL
            if not filename:
                parsed = urlparse(url)
                filename = os.path.basename(parsed.path) or 'image.jpg'

            filepath = os.path.join(self.download_dir, filename)

            with open(filepath, 'wb') as f:
                f.write(response.content)

            print(f"Download complete: {filename}")
            return filepath

        except requests.RequestException as e:
            print(f"Download failed ({url}): {e}")
            return None

    def download_all(self, page_url, img_selector='img', delay=1):
        """Downloads all images from a page."""
        image_urls = self.get_image_urls(page_url, img_selector)
        print(f"Found {len(image_urls)} images.")

        downloaded = []
        for idx, url in enumerate(image_urls, 1):
            print(f"[{idx}/{len(image_urls)}] Downloading...")
            filepath = self.download_image(url)
            if filepath:
                downloaded.append(filepath)
            time.sleep(delay)  # Prevent server overload

        return downloaded

# Usage example
if __name__ == '__main__':
    downloader = ImageDownloader(download_dir='downloaded_images')

    # Example URL (change to appropriate URL for actual use)
    downloaded_files = downloader.download_all(
        page_url='https://example.com/gallery',
        img_selector='div.gallery img',
        delay=2
    )

    print(f"\nTotal {len(downloaded_files)} images downloaded.")

5. Tips and Precautions

5.1 Tips for Efficient Scraping

  • Use Session: When making multiple requests to the same site, use requests.Session() to reuse connections.
  • Appropriate Delays: Use time.sleep() between requests to avoid overloading the server.
  • Error Handling: Add appropriate exception handling for network errors, timeouts, etc.
  • Utilize Caching: Cache results to avoid making repeated requests to the same page.
  • Logging: Log the scraping process for debugging when issues occur.

5.2 Common Problems and Solutions

  • Encoding Issues: Explicitly set response.encoding or use the chardet library.
  • 403 Forbidden: Set the User-Agent header or add other headers.
  • Dynamic Content: Content loaded by JavaScript cannot be fetched with requests. We'll learn how to solve this with Selenium in the next part.
  • IP Blocking: Reduce request frequency or use proxies.

Conclusion

In this part, we learned the fundamentals of Python web scraping. We explored how to fetch web pages with the requests library and parse HTML with BeautifulSoup to extract desired data.

In the next Part 4, we'll cover more complex web scraping scenarios. We'll learn about Selenium for handling dynamically generated content with JavaScript, scraping sites that require login, and various methods for storing collected data.

Series Notice: The Python Automation Master series continues. Stay tuned for more advanced web scraping techniques in the next part!