Introduction: The Importance of Spreadsheet Automation

Excel and CSV files are the most widely used data formats in business environments. Every day, vast amounts of data are created, processed, and reported in spreadsheet form. By automating these repetitive tasks with Python, you can save time and reduce human errors.

In this Part 5, we'll learn how to efficiently process CSV and Excel files using Python's csv module, pandas, and openpyxl libraries. We'll cover practical skills you can immediately apply in your work, from basic file reading/writing to cell styling, chart creation, and handling large files.

1. CSV File Processing Basics (csv Module)

Python's built-in csv module allows you to process CSV files without any additional installation.

1.1 Reading CSV Files

import csv

# Basic CSV reading
with open('data.csv', 'r', encoding='utf-8') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)  # Each row is returned as a list

# Reading as dictionary (header-based)
with open('data.csv', 'r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row['name'], row['age'])  # Access by column name

# Specifying delimiter
with open('data.tsv', 'r', encoding='utf-8') as file:
    reader = csv.reader(file, delimiter='\t')  # Tab-delimited
    for row in reader:
        print(row)

1.2 Writing CSV Files

import csv

# Basic CSV writing
data = [
    ['Name', 'Age', 'City'],
    ['John Smith', 25, 'New York'],
    ['Jane Doe', 30, 'Los Angeles'],
    ['Mike Johnson', 28, 'Chicago']
]

with open('output.csv', 'w', encoding='utf-8-sig', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)  # Write multiple rows at once

# Writing as dictionary
data_dict = [
    {'Name': 'John Smith', 'Age': 25, 'City': 'New York'},
    {'Name': 'Jane Doe', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Mike Johnson', 'Age': 28, 'City': 'Chicago'}
]

with open('output_dict.csv', 'w', encoding='utf-8-sig', newline='') as file:
    fieldnames = ['Name', 'Age', 'City']
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()  # Write header
    writer.writerows(data_dict)
Tip: To prevent character encoding issues when opening CSV files in Excel on Windows, use encoding='utf-8-sig'. This adds a BOM (Byte Order Mark) that Excel recognizes correctly.

2. Handling CSV with pandas

pandas is a library optimized for data analysis that provides powerful features for CSV file processing.

2.1 Installing and Basic Usage of pandas

# Installation
pip install pandas
import pandas as pd

# Reading CSV
df = pd.read_csv('data.csv', encoding='utf-8')

# Basic information
print(df.head())        # First 5 rows
print(df.info())        # Data types and missing values info
print(df.describe())    # Statistical summary

# Selecting specific columns
print(df['Name'])           # Single column
print(df[['Name', 'Age']])  # Multiple columns

# Filtering by condition
adults = df[df['Age'] >= 20]
new_yorkers = df[df['City'] == 'New York']

2.2 Data Preprocessing and Transformation

import pandas as pd

df = pd.read_csv('sales_data.csv')

# Handling missing values
df.fillna(0, inplace=True)              # Fill with 0
df.dropna(subset=['required_column'], inplace=True)  # Drop rows with missing values in specific column

# Adding new columns
df['Total'] = df['UnitPrice'] * df['Quantity']
df['DiscountedPrice'] = df['Total'] * 0.9

# Renaming columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Sorting data
df.sort_values(by='Total', ascending=False, inplace=True)

# Group aggregation
summary = df.groupby('Category').agg({
    'Total': 'sum',
    'Quantity': 'mean',
    'OrderID': 'count'
}).rename(columns={'OrderID': 'OrderCount'})

# Saving to CSV
df.to_csv('processed_data.csv', index=False, encoding='utf-8-sig')

2.3 Merging Multiple CSV Files

import pandas as pd
import glob

# Merging multiple CSV files with the same format
csv_files = glob.glob('data/*.csv')
df_list = [pd.read_csv(f) for f in csv_files]
combined_df = pd.concat(df_list, ignore_index=True)

# Merging two DataFrames (JOIN)
df1 = pd.read_csv('customers.csv')
df2 = pd.read_csv('orders.csv')

# Inner join (only matching records)
merged = pd.merge(df1, df2, on='CustomerID', how='inner')

# Outer join (all records included)
merged_all = pd.merge(df1, df2, on='CustomerID', how='outer')

3. Creating/Modifying Excel Files with openpyxl

openpyxl is a library specialized for creating and modifying Excel files (.xlsx).

3.1 Installation and Basic Usage

# Installation
pip install openpyxl
from openpyxl import Workbook, load_workbook

# Create new workbook
wb = Workbook()
ws = wb.active
ws.title = "Sales Data"

# Input data
ws['A1'] = 'Product Name'
ws['B1'] = 'Quantity'
ws['C1'] = 'Unit Price'

# Add multiple rows
data = [
    ['Laptop', 10, 1200],
    ['Mouse', 50, 25],
    ['Keyboard', 30, 45]
]
for row in data:
    ws.append(row)

# Save
wb.save('sales.xlsx')

# Open existing file
wb = load_workbook('sales.xlsx')
ws = wb.active

# Read cell values
print(ws['A1'].value)
print(ws.cell(row=2, column=1).value)

# Read range
for row in ws.iter_rows(min_row=2, max_row=4, min_col=1, max_col=3):
    for cell in row:
        print(cell.value, end=' ')
    print()

3.2 Cell Styling (Fonts, Colors, Borders)

from openpyxl import Workbook
from openpyxl.styles import Font, Fill, PatternFill, Border, Side, Alignment

wb = Workbook()
ws = wb.active

# Input data
headers = ['Product Name', 'Quantity', 'Unit Price', 'Total']
for col, header in enumerate(headers, 1):
    ws.cell(row=1, column=col, value=header)

# Font styles
bold_font = Font(bold=True, size=12, color='FFFFFF')
normal_font = Font(size=11)

# Background color
header_fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
alt_fill = PatternFill(start_color='D9E2F3', end_color='D9E2F3', fill_type='solid')

# Borders
thin_border = Border(
    left=Side(style='thin'),
    right=Side(style='thin'),
    top=Side(style='thin'),
    bottom=Side(style='thin')
)

# Alignment
center_align = Alignment(horizontal='center', vertical='center')
right_align = Alignment(horizontal='right')

# Apply styles to headers
for col in range(1, 5):
    cell = ws.cell(row=1, column=col)
    cell.font = bold_font
    cell.fill = header_fill
    cell.border = thin_border
    cell.alignment = center_align

# Style data rows
data = [
    ['Laptop', 10, 1200000],
    ['Mouse', 50, 25000],
    ['Keyboard', 30, 45000]
]

for row_idx, row_data in enumerate(data, 2):
    for col_idx, value in enumerate(row_data, 1):
        cell = ws.cell(row=row_idx, column=col_idx, value=value)
        cell.border = thin_border
        if row_idx % 2 == 0:
            cell.fill = alt_fill

# Adjust column widths
ws.column_dimensions['A'].width = 15
ws.column_dimensions['B'].width = 10
ws.column_dimensions['C'].width = 15
ws.column_dimensions['D'].width = 15

# Adjust row height
ws.row_dimensions[1].height = 25

wb.save('styled_sales.xlsx')

3.3 Inserting Formulas

from openpyxl import Workbook

wb = Workbook()
ws = wb.active

# Input data
ws['A1'] = 'Quantity'
ws['B1'] = 'Unit Price'
ws['C1'] = 'Total'

data = [
    [10, 1200000],
    [50, 25000],
    [30, 45000]
]

for row_idx, (qty, price) in enumerate(data, 2):
    ws.cell(row=row_idx, column=1, value=qty)
    ws.cell(row=row_idx, column=2, value=price)
    # Insert formula (Total = Quantity * Unit Price)
    ws.cell(row=row_idx, column=3, value=f'=A{row_idx}*B{row_idx}')

# Sum formula
last_row = len(data) + 1
ws.cell(row=last_row + 1, column=1, value='Total')
ws.cell(row=last_row + 1, column=3, value=f'=SUM(C2:C{last_row})')

# Conditional formula
ws['D1'] = 'Grade'
for row_idx in range(2, last_row + 1):
    ws.cell(row=row_idx, column=4,
            value=f'=IF(C{row_idx}>=10000000,"A",IF(C{row_idx}>=1000000,"B","C"))')

wb.save('formulas.xlsx')

4. Chart Creation

With openpyxl, you can programmatically create various charts in Excel.

from openpyxl import Workbook
from openpyxl.chart import BarChart, LineChart, PieChart, Reference

wb = Workbook()
ws = wb.active

# Input data
data = [
    ['Month', 'Revenue', 'Cost', 'Profit'],
    ['Jan', 5000000, 3000000, 2000000],
    ['Feb', 6000000, 3500000, 2500000],
    ['Mar', 5500000, 3200000, 2300000],
    ['Apr', 7000000, 4000000, 3000000],
    ['May', 8000000, 4500000, 3500000],
    ['Jun', 7500000, 4200000, 3300000]
]

for row in data:
    ws.append(row)

# Bar chart
bar_chart = BarChart()
bar_chart.type = "col"  # Vertical bars
bar_chart.style = 10
bar_chart.title = "Monthly Revenue Status"
bar_chart.y_axis.title = "Amount ($)"
bar_chart.x_axis.title = "Month"

# Specify data range
data_ref = Reference(ws, min_col=2, min_row=1, max_col=4, max_row=7)
categories = Reference(ws, min_col=1, min_row=2, max_row=7)

bar_chart.add_data(data_ref, titles_from_data=True)
bar_chart.set_categories(categories)
bar_chart.shape = 4  # Chart style

ws.add_chart(bar_chart, "F2")  # Chart position

# Line chart
line_chart = LineChart()
line_chart.title = "Monthly Trend"
line_chart.style = 13
line_chart.y_axis.title = "Amount"
line_chart.x_axis.title = "Month"

line_chart.add_data(data_ref, titles_from_data=True)
line_chart.set_categories(categories)

ws.add_chart(line_chart, "F18")

# Pie chart (June data only)
pie_chart = PieChart()
pie_chart.title = "June Cost Structure"

labels = Reference(ws, min_col=2, max_col=4, min_row=1)
pie_data = Reference(ws, min_col=2, max_col=4, min_row=7)

pie_chart.add_data(pie_data, titles_from_data=False)
pie_chart.set_categories(labels)

ws.add_chart(pie_chart, "N2")

wb.save('charts.xlsx')

5. Working with Multiple Sheets

from openpyxl import Workbook, load_workbook

# Create new workbook with multiple sheets
wb = Workbook()

# First sheet (default sheet)
ws1 = wb.active
ws1.title = "Summary"

# Create additional sheets
ws2 = wb.create_sheet("Q1")
ws3 = wb.create_sheet("Q2")
ws4 = wb.create_sheet("Q3", 0)  # Insert at the beginning

# Input data for each sheet
quarters = {
    'Q1': [['Jan', 100], ['Feb', 120], ['Mar', 110]],
    'Q2': [['Apr', 130], ['May', 140], ['Jun', 135]],
    'Q3': [['Jul', 150], ['Aug', 160], ['Sep', 155]]
}

for sheet_name, data in quarters.items():
    ws = wb[sheet_name]
    ws.append(['Month', 'Sales'])
    for row in data:
        ws.append(row)

# Reference each quarter's total in summary sheet
summary = wb['Summary']
summary.append(['Quarter', 'Total Sales'])
summary.append(['Q1', "=SUM('Q1'!B2:B4)"])
summary.append(['Q2', "=SUM('Q2'!B2:B4)"])
summary.append(['Q3', "=SUM('Q3'!B2:B4)"])

# Change sheet order
wb._sheets = [wb['Summary'], wb['Q1'], wb['Q2'], wb['Q3']]

# Copy sheet
source = wb['Q1']
target = wb.copy_worksheet(source)
target.title = "Q1_Backup"

# Delete sheet
# del wb['Q1_Backup']

wb.save('multi_sheets.xlsx')

# Iterate through all sheets in existing file
wb = load_workbook('multi_sheets.xlsx')
for sheet_name in wb.sheetnames:
    ws = wb[sheet_name]
    print(f"Sheet: {sheet_name}")
    for row in ws.iter_rows(values_only=True):
        print(row)

6. Merging Excel Files

import pandas as pd
from openpyxl import load_workbook
import glob

# Merging Excel files using pandas
def merge_excel_files(file_pattern, output_file):
    """Merge multiple Excel files into one"""
    excel_files = glob.glob(file_pattern)
    df_list = []

    for file in excel_files:
        df = pd.read_excel(file)
        df['SourceFile'] = file  # Mark source
        df_list.append(df)

    combined_df = pd.concat(df_list, ignore_index=True)
    combined_df.to_excel(output_file, index=False)
    print(f"Merge complete: {len(excel_files)} files -> {output_file}")

merge_excel_files('reports/*.xlsx', 'combined_report.xlsx')

# Merge all sheets in one Excel file into single sheet
def merge_sheets(input_file, output_file):
    """Merge all sheets within an Excel file into one"""
    xl = pd.ExcelFile(input_file)
    df_list = []

    for sheet_name in xl.sheet_names:
        df = pd.read_excel(xl, sheet_name=sheet_name)
        df['SheetName'] = sheet_name
        df_list.append(df)

    combined_df = pd.concat(df_list, ignore_index=True)
    combined_df.to_excel(output_file, index=False)

merge_sheets('multi_sheets.xlsx', 'merged_single_sheet.xlsx')

# Conditional merge (based on specific column)
def conditional_merge(files, key_column, output_file):
    """Merge multiple files based on a specific key column"""
    base_df = pd.read_excel(files[0])

    for file in files[1:]:
        df = pd.read_excel(file)
        base_df = pd.merge(base_df, df, on=key_column, how='outer')

    base_df.to_excel(output_file, index=False)

conditional_merge(['customers.xlsx', 'orders.xlsx', 'payments.xlsx'],
                  'CustomerID', 'customer_full_data.xlsx')

7. Handling Large Files (Chunking)

When processing files with hundreds of thousands of rows, you need to consider memory efficiency.

import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows

# Processing large CSV files in chunks
def process_large_csv(input_file, output_file, chunk_size=10000):
    """Process large CSV files in chunks"""
    processed_chunks = []

    for chunk in pd.read_csv(input_file, chunksize=chunk_size):
        # Process each chunk
        chunk['Processed'] = True
        chunk['Total'] = chunk['Quantity'] * chunk['UnitPrice']
        processed_chunks.append(chunk)
        print(f"Processed rows: {len(chunk)}")

    # Save results
    result_df = pd.concat(processed_chunks, ignore_index=True)
    result_df.to_csv(output_file, index=False)

# Filter and save chunks (memory efficient)
def filter_large_csv(input_file, output_file, condition_func, chunk_size=10000):
    """Extract only rows matching a condition from large CSV"""
    first_chunk = True

    for chunk in pd.read_csv(input_file, chunksize=chunk_size):
        filtered = chunk[condition_func(chunk)]

        if first_chunk:
            filtered.to_csv(output_file, index=False, mode='w')
            first_chunk = False
        else:
            filtered.to_csv(output_file, index=False, mode='a', header=False)

# Example usage: Extract rows where Total >= 1,000,000
filter_large_csv(
    'large_sales.csv',
    'high_value_sales.csv',
    lambda df: df['Total'] >= 1000000
)

# Reading large Excel files (read_only mode)
def read_large_excel(file_path):
    """Read large Excel files memory-efficiently"""
    wb = load_workbook(file_path, read_only=True)
    ws = wb.active

    for row in ws.iter_rows(values_only=True):
        yield row  # Return one row at a time using generator

    wb.close()

# Writing large Excel files (write_only mode)
from openpyxl import Workbook

def write_large_excel(data_generator, output_file):
    """Write large data to Excel efficiently"""
    wb = Workbook(write_only=True)
    ws = wb.create_sheet()

    for row in data_generator:
        ws.append(row)

    wb.save(output_file)

8. Practical Example: Automated Monthly Report Generation

Now let's combine everything we've learned to create an automated monthly sales report generation system that you can use in practice.

import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.chart import BarChart, LineChart, PieChart, Reference
from openpyxl.utils.dataframe import dataframe_to_rows
from datetime import datetime
import os

class MonthlyReportGenerator:
    """Automated Monthly Sales Report Generation Class"""

    def __init__(self, data_file, output_dir='reports'):
        self.data_file = data_file
        self.output_dir = output_dir
        self.df = None
        self.wb = None

        # Create output directory
        os.makedirs(output_dir, exist_ok=True)

        # Define styles
        self.header_font = Font(bold=True, size=11, color='FFFFFF')
        self.header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid')
        self.border = Border(
            left=Side(style='thin'),
            right=Side(style='thin'),
            top=Side(style='thin'),
            bottom=Side(style='thin')
        )
        self.center_align = Alignment(horizontal='center', vertical='center')
        self.number_format = '#,##0'
        self.currency_format = '$#,##0'

    def load_data(self):
        """Load data"""
        self.df = pd.read_csv(self.data_file, parse_dates=['OrderDate'])
        self.df['Month'] = self.df['OrderDate'].dt.to_period('M')
        print(f"Data loaded: {len(self.df)} rows")

    def create_summary_sheet(self, ws):
        """Create summary sheet"""
        ws.title = "Summary"

        # Title
        ws['A1'] = 'Monthly Sales Report'
        ws['A1'].font = Font(bold=True, size=16)
        ws.merge_cells('A1:E1')

        # Generation time
        ws['A2'] = f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}'
        ws['A2'].font = Font(italic=True, size=10)

        # Monthly summary data
        monthly_summary = self.df.groupby('Month').agg({
            'OrderID': 'count',
            'Quantity': 'sum',
            'Total': 'sum'
        }).reset_index()
        monthly_summary.columns = ['Month', 'OrderCount', 'SalesQuantity', 'TotalRevenue']

        # Table start position
        start_row = 4
        headers = ['Month', 'Order Count', 'Sales Quantity', 'Total Revenue']

        # Write headers
        for col, header in enumerate(headers, 1):
            cell = ws.cell(row=start_row, column=col, value=header)
            cell.font = self.header_font
            cell.fill = self.header_fill
            cell.border = self.border
            cell.alignment = self.center_align

        # Write data
        for row_idx, row in monthly_summary.iterrows():
            ws.cell(row=start_row + row_idx + 1, column=1, value=str(row['Month']))
            ws.cell(row=start_row + row_idx + 1, column=2, value=row['OrderCount'])
            ws.cell(row=start_row + row_idx + 1, column=3, value=row['SalesQuantity'])
            cell = ws.cell(row=start_row + row_idx + 1, column=4, value=row['TotalRevenue'])
            cell.number_format = self.currency_format

            for col in range(1, 5):
                ws.cell(row=start_row + row_idx + 1, column=col).border = self.border

        # Total row
        total_row = start_row + len(monthly_summary) + 1
        ws.cell(row=total_row, column=1, value='Total').font = Font(bold=True)
        ws.cell(row=total_row, column=2, value=f'=SUM(B{start_row+1}:B{total_row-1})')
        ws.cell(row=total_row, column=3, value=f'=SUM(C{start_row+1}:C{total_row-1})')
        ws.cell(row=total_row, column=4, value=f'=SUM(D{start_row+1}:D{total_row-1})')
        ws.cell(row=total_row, column=4).number_format = self.currency_format

        for col in range(1, 5):
            ws.cell(row=total_row, column=col).border = self.border

        # Adjust column widths
        ws.column_dimensions['A'].width = 12
        ws.column_dimensions['B'].width = 12
        ws.column_dimensions['C'].width = 14
        ws.column_dimensions['D'].width = 18

        # Add chart
        chart = BarChart()
        chart.type = "col"
        chart.style = 10
        chart.title = "Monthly Revenue"
        chart.y_axis.title = "Revenue ($)"

        data = Reference(ws, min_col=4, min_row=start_row,
                        max_row=start_row + len(monthly_summary))
        cats = Reference(ws, min_col=1, min_row=start_row + 1,
                        max_row=start_row + len(monthly_summary))

        chart.add_data(data, titles_from_data=True)
        chart.set_categories(cats)
        chart.shape = 4

        ws.add_chart(chart, "F4")

        return monthly_summary

    def generate_report(self):
        """Main report generation function"""
        print("Starting report generation...")

        # Load data
        self.load_data()

        # Create workbook
        self.wb = Workbook()

        # Create sheets
        ws_summary = self.wb.active
        self.create_summary_sheet(ws_summary)

        # Save file
        today = datetime.now().strftime('%Y%m')
        output_file = os.path.join(self.output_dir, f'MonthlyReport_{today}.xlsx')
        self.wb.save(output_file)

        print(f"Report generation complete: {output_file}")
        return output_file

# Usage example
if __name__ == "__main__":
    # Generate sample data (use existing CSV in practice)
    import random
    from datetime import timedelta

    # Generate sample data
    categories = ['Electronics', 'Clothing', 'Food', 'Furniture', 'Books']
    products = {
        'Electronics': ['Laptop', 'Smartphone', 'Tablet', 'Earphones'],
        'Clothing': ['T-shirt', 'Jeans', 'Jacket', 'Coat'],
        'Food': ['Snacks', 'Beverages', 'Ramen', 'Coffee'],
        'Furniture': ['Desk', 'Chair', 'Bed', 'Sofa'],
        'Books': ['Novel', 'Self-help', 'Comic', 'Tech Books']
    }

    data = []
    base_date = datetime(2025, 1, 1)

    for i in range(500):
        category = random.choice(categories)
        product = random.choice(products[category])
        qty = random.randint(1, 10)
        price = random.randint(10, 500)

        data.append({
            'OrderID': f'ORD{i+1:05d}',
            'OrderDate': base_date + timedelta(days=random.randint(0, 180)),
            'CustomerName': f'Customer{random.randint(1, 100):03d}',
            'ProductName': product,
            'Category': category,
            'Quantity': qty,
            'UnitPrice': price,
            'Total': qty * price
        })

    sample_df = pd.DataFrame(data)
    sample_df.to_csv('sample_sales_data.csv', index=False, encoding='utf-8-sig')

    # Generate report
    generator = MonthlyReportGenerator('sample_sales_data.csv')
    generator.generate_report()

Conclusion

In this article, we learned the key techniques for automating Excel and CSV files using Python.

  • csv module: Suitable for simple CSV file reading/writing
  • pandas: Provides powerful features for data analysis and preprocessing
  • openpyxl: Specialized for Excel file creation, styling, and chart generation

In practice, you'll combine these three tools according to your needs. A common approach is to use pandas for simple data processing and openpyxl for styling the final report.

In the next Part 6, we'll cover email and notification automation. You'll learn how to automatically send the reports we created today via email, or send notifications through Slack/Telegram.