Python Automation Master Part 5: Excel and CSV File Automation
Mastering Spreadsheet Automation with Python
Introduction: The Importance of Spreadsheet Automation
Excel and CSV files are the most widely used data formats in business environments. Every day, vast amounts of data are created, processed, and reported in spreadsheet form. By automating these repetitive tasks with Python, you can save time and reduce human errors.
In this Part 5, we'll learn how to efficiently process CSV and Excel files using Python's csv module, pandas, and openpyxl libraries. We'll cover practical skills you can immediately apply in your work, from basic file reading/writing to cell styling, chart creation, and handling large files.
1. CSV File Processing Basics (csv Module)
Python's built-in csv module allows you to process CSV files without any additional installation.
1.1 Reading CSV Files
import csv
# Basic CSV reading
with open('data.csv', 'r', encoding='utf-8') as file:
reader = csv.reader(file)
for row in reader:
print(row) # Each row is returned as a list
# Reading as dictionary (header-based)
with open('data.csv', 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['name'], row['age']) # Access by column name
# Specifying delimiter
with open('data.tsv', 'r', encoding='utf-8') as file:
reader = csv.reader(file, delimiter='\t') # Tab-delimited
for row in reader:
print(row)
1.2 Writing CSV Files
import csv
# Basic CSV writing
data = [
['Name', 'Age', 'City'],
['John Smith', 25, 'New York'],
['Jane Doe', 30, 'Los Angeles'],
['Mike Johnson', 28, 'Chicago']
]
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as file:
writer = csv.writer(file)
writer.writerows(data) # Write multiple rows at once
# Writing as dictionary
data_dict = [
{'Name': 'John Smith', 'Age': 25, 'City': 'New York'},
{'Name': 'Jane Doe', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Mike Johnson', 'Age': 28, 'City': 'Chicago'}
]
with open('output_dict.csv', 'w', encoding='utf-8-sig', newline='') as file:
fieldnames = ['Name', 'Age', 'City']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader() # Write header
writer.writerows(data_dict)
Tip: To prevent character encoding issues when opening CSV files in Excel on Windows, use encoding='utf-8-sig'. This adds a BOM (Byte Order Mark) that Excel recognizes correctly.
2. Handling CSV with pandas
pandas is a library optimized for data analysis that provides powerful features for CSV file processing.
2.1 Installing and Basic Usage of pandas
# Installation
pip install pandas
import pandas as pd
# Reading CSV
df = pd.read_csv('data.csv', encoding='utf-8')
# Basic information
print(df.head()) # First 5 rows
print(df.info()) # Data types and missing values info
print(df.describe()) # Statistical summary
# Selecting specific columns
print(df['Name']) # Single column
print(df[['Name', 'Age']]) # Multiple columns
# Filtering by condition
adults = df[df['Age'] >= 20]
new_yorkers = df[df['City'] == 'New York']
2.2 Data Preprocessing and Transformation
import pandas as pd
df = pd.read_csv('sales_data.csv')
# Handling missing values
df.fillna(0, inplace=True) # Fill with 0
df.dropna(subset=['required_column'], inplace=True) # Drop rows with missing values in specific column
# Adding new columns
df['Total'] = df['UnitPrice'] * df['Quantity']
df['DiscountedPrice'] = df['Total'] * 0.9
# Renaming columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
# Sorting data
df.sort_values(by='Total', ascending=False, inplace=True)
# Group aggregation
summary = df.groupby('Category').agg({
'Total': 'sum',
'Quantity': 'mean',
'OrderID': 'count'
}).rename(columns={'OrderID': 'OrderCount'})
# Saving to CSV
df.to_csv('processed_data.csv', index=False, encoding='utf-8-sig')
2.3 Merging Multiple CSV Files
import pandas as pd
import glob
# Merging multiple CSV files with the same format
csv_files = glob.glob('data/*.csv')
df_list = [pd.read_csv(f) for f in csv_files]
combined_df = pd.concat(df_list, ignore_index=True)
# Merging two DataFrames (JOIN)
df1 = pd.read_csv('customers.csv')
df2 = pd.read_csv('orders.csv')
# Inner join (only matching records)
merged = pd.merge(df1, df2, on='CustomerID', how='inner')
# Outer join (all records included)
merged_all = pd.merge(df1, df2, on='CustomerID', how='outer')
3. Creating/Modifying Excel Files with openpyxl
openpyxl is a library specialized for creating and modifying Excel files (.xlsx).
3.1 Installation and Basic Usage
# Installation
pip install openpyxl
from openpyxl import Workbook, load_workbook
# Create new workbook
wb = Workbook()
ws = wb.active
ws.title = "Sales Data"
# Input data
ws['A1'] = 'Product Name'
ws['B1'] = 'Quantity'
ws['C1'] = 'Unit Price'
# Add multiple rows
data = [
['Laptop', 10, 1200],
['Mouse', 50, 25],
['Keyboard', 30, 45]
]
for row in data:
ws.append(row)
# Save
wb.save('sales.xlsx')
# Open existing file
wb = load_workbook('sales.xlsx')
ws = wb.active
# Read cell values
print(ws['A1'].value)
print(ws.cell(row=2, column=1).value)
# Read range
for row in ws.iter_rows(min_row=2, max_row=4, min_col=1, max_col=3):
for cell in row:
print(cell.value, end=' ')
print()
3.2 Cell Styling (Fonts, Colors, Borders)
from openpyxl import Workbook
from openpyxl.styles import Font, Fill, PatternFill, Border, Side, Alignment
wb = Workbook()
ws = wb.active
# Input data
headers = ['Product Name', 'Quantity', 'Unit Price', 'Total']
for col, header in enumerate(headers, 1):
ws.cell(row=1, column=col, value=header)
# Font styles
bold_font = Font(bold=True, size=12, color='FFFFFF')
normal_font = Font(size=11)
# Background color
header_fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
alt_fill = PatternFill(start_color='D9E2F3', end_color='D9E2F3', fill_type='solid')
# Borders
thin_border = Border(
left=Side(style='thin'),
right=Side(style='thin'),
top=Side(style='thin'),
bottom=Side(style='thin')
)
# Alignment
center_align = Alignment(horizontal='center', vertical='center')
right_align = Alignment(horizontal='right')
# Apply styles to headers
for col in range(1, 5):
cell = ws.cell(row=1, column=col)
cell.font = bold_font
cell.fill = header_fill
cell.border = thin_border
cell.alignment = center_align
# Style data rows
data = [
['Laptop', 10, 1200000],
['Mouse', 50, 25000],
['Keyboard', 30, 45000]
]
for row_idx, row_data in enumerate(data, 2):
for col_idx, value in enumerate(row_data, 1):
cell = ws.cell(row=row_idx, column=col_idx, value=value)
cell.border = thin_border
if row_idx % 2 == 0:
cell.fill = alt_fill
# Adjust column widths
ws.column_dimensions['A'].width = 15
ws.column_dimensions['B'].width = 10
ws.column_dimensions['C'].width = 15
ws.column_dimensions['D'].width = 15
# Adjust row height
ws.row_dimensions[1].height = 25
wb.save('styled_sales.xlsx')
3.3 Inserting Formulas
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
# Input data
ws['A1'] = 'Quantity'
ws['B1'] = 'Unit Price'
ws['C1'] = 'Total'
data = [
[10, 1200000],
[50, 25000],
[30, 45000]
]
for row_idx, (qty, price) in enumerate(data, 2):
ws.cell(row=row_idx, column=1, value=qty)
ws.cell(row=row_idx, column=2, value=price)
# Insert formula (Total = Quantity * Unit Price)
ws.cell(row=row_idx, column=3, value=f'=A{row_idx}*B{row_idx}')
# Sum formula
last_row = len(data) + 1
ws.cell(row=last_row + 1, column=1, value='Total')
ws.cell(row=last_row + 1, column=3, value=f'=SUM(C2:C{last_row})')
# Conditional formula
ws['D1'] = 'Grade'
for row_idx in range(2, last_row + 1):
ws.cell(row=row_idx, column=4,
value=f'=IF(C{row_idx}>=10000000,"A",IF(C{row_idx}>=1000000,"B","C"))')
wb.save('formulas.xlsx')
4. Chart Creation
With openpyxl, you can programmatically create various charts in Excel.
from openpyxl import Workbook
from openpyxl.chart import BarChart, LineChart, PieChart, Reference
wb = Workbook()
ws = wb.active
# Input data
data = [
['Month', 'Revenue', 'Cost', 'Profit'],
['Jan', 5000000, 3000000, 2000000],
['Feb', 6000000, 3500000, 2500000],
['Mar', 5500000, 3200000, 2300000],
['Apr', 7000000, 4000000, 3000000],
['May', 8000000, 4500000, 3500000],
['Jun', 7500000, 4200000, 3300000]
]
for row in data:
ws.append(row)
# Bar chart
bar_chart = BarChart()
bar_chart.type = "col" # Vertical bars
bar_chart.style = 10
bar_chart.title = "Monthly Revenue Status"
bar_chart.y_axis.title = "Amount ($)"
bar_chart.x_axis.title = "Month"
# Specify data range
data_ref = Reference(ws, min_col=2, min_row=1, max_col=4, max_row=7)
categories = Reference(ws, min_col=1, min_row=2, max_row=7)
bar_chart.add_data(data_ref, titles_from_data=True)
bar_chart.set_categories(categories)
bar_chart.shape = 4 # Chart style
ws.add_chart(bar_chart, "F2") # Chart position
# Line chart
line_chart = LineChart()
line_chart.title = "Monthly Trend"
line_chart.style = 13
line_chart.y_axis.title = "Amount"
line_chart.x_axis.title = "Month"
line_chart.add_data(data_ref, titles_from_data=True)
line_chart.set_categories(categories)
ws.add_chart(line_chart, "F18")
# Pie chart (June data only)
pie_chart = PieChart()
pie_chart.title = "June Cost Structure"
labels = Reference(ws, min_col=2, max_col=4, min_row=1)
pie_data = Reference(ws, min_col=2, max_col=4, min_row=7)
pie_chart.add_data(pie_data, titles_from_data=False)
pie_chart.set_categories(labels)
ws.add_chart(pie_chart, "N2")
wb.save('charts.xlsx')
5. Working with Multiple Sheets
from openpyxl import Workbook, load_workbook
# Create new workbook with multiple sheets
wb = Workbook()
# First sheet (default sheet)
ws1 = wb.active
ws1.title = "Summary"
# Create additional sheets
ws2 = wb.create_sheet("Q1")
ws3 = wb.create_sheet("Q2")
ws4 = wb.create_sheet("Q3", 0) # Insert at the beginning
# Input data for each sheet
quarters = {
'Q1': [['Jan', 100], ['Feb', 120], ['Mar', 110]],
'Q2': [['Apr', 130], ['May', 140], ['Jun', 135]],
'Q3': [['Jul', 150], ['Aug', 160], ['Sep', 155]]
}
for sheet_name, data in quarters.items():
ws = wb[sheet_name]
ws.append(['Month', 'Sales'])
for row in data:
ws.append(row)
# Reference each quarter's total in summary sheet
summary = wb['Summary']
summary.append(['Quarter', 'Total Sales'])
summary.append(['Q1', "=SUM('Q1'!B2:B4)"])
summary.append(['Q2', "=SUM('Q2'!B2:B4)"])
summary.append(['Q3', "=SUM('Q3'!B2:B4)"])
# Change sheet order
wb._sheets = [wb['Summary'], wb['Q1'], wb['Q2'], wb['Q3']]
# Copy sheet
source = wb['Q1']
target = wb.copy_worksheet(source)
target.title = "Q1_Backup"
# Delete sheet
# del wb['Q1_Backup']
wb.save('multi_sheets.xlsx')
# Iterate through all sheets in existing file
wb = load_workbook('multi_sheets.xlsx')
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
print(f"Sheet: {sheet_name}")
for row in ws.iter_rows(values_only=True):
print(row)
6. Merging Excel Files
import pandas as pd
from openpyxl import load_workbook
import glob
# Merging Excel files using pandas
def merge_excel_files(file_pattern, output_file):
"""Merge multiple Excel files into one"""
excel_files = glob.glob(file_pattern)
df_list = []
for file in excel_files:
df = pd.read_excel(file)
df['SourceFile'] = file # Mark source
df_list.append(df)
combined_df = pd.concat(df_list, ignore_index=True)
combined_df.to_excel(output_file, index=False)
print(f"Merge complete: {len(excel_files)} files -> {output_file}")
merge_excel_files('reports/*.xlsx', 'combined_report.xlsx')
# Merge all sheets in one Excel file into single sheet
def merge_sheets(input_file, output_file):
"""Merge all sheets within an Excel file into one"""
xl = pd.ExcelFile(input_file)
df_list = []
for sheet_name in xl.sheet_names:
df = pd.read_excel(xl, sheet_name=sheet_name)
df['SheetName'] = sheet_name
df_list.append(df)
combined_df = pd.concat(df_list, ignore_index=True)
combined_df.to_excel(output_file, index=False)
merge_sheets('multi_sheets.xlsx', 'merged_single_sheet.xlsx')
# Conditional merge (based on specific column)
def conditional_merge(files, key_column, output_file):
"""Merge multiple files based on a specific key column"""
base_df = pd.read_excel(files[0])
for file in files[1:]:
df = pd.read_excel(file)
base_df = pd.merge(base_df, df, on=key_column, how='outer')
base_df.to_excel(output_file, index=False)
conditional_merge(['customers.xlsx', 'orders.xlsx', 'payments.xlsx'],
'CustomerID', 'customer_full_data.xlsx')
7. Handling Large Files (Chunking)
When processing files with hundreds of thousands of rows, you need to consider memory efficiency.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
# Processing large CSV files in chunks
def process_large_csv(input_file, output_file, chunk_size=10000):
"""Process large CSV files in chunks"""
processed_chunks = []
for chunk in pd.read_csv(input_file, chunksize=chunk_size):
# Process each chunk
chunk['Processed'] = True
chunk['Total'] = chunk['Quantity'] * chunk['UnitPrice']
processed_chunks.append(chunk)
print(f"Processed rows: {len(chunk)}")
# Save results
result_df = pd.concat(processed_chunks, ignore_index=True)
result_df.to_csv(output_file, index=False)
# Filter and save chunks (memory efficient)
def filter_large_csv(input_file, output_file, condition_func, chunk_size=10000):
"""Extract only rows matching a condition from large CSV"""
first_chunk = True
for chunk in pd.read_csv(input_file, chunksize=chunk_size):
filtered = chunk[condition_func(chunk)]
if first_chunk:
filtered.to_csv(output_file, index=False, mode='w')
first_chunk = False
else:
filtered.to_csv(output_file, index=False, mode='a', header=False)
# Example usage: Extract rows where Total >= 1,000,000
filter_large_csv(
'large_sales.csv',
'high_value_sales.csv',
lambda df: df['Total'] >= 1000000
)
# Reading large Excel files (read_only mode)
def read_large_excel(file_path):
"""Read large Excel files memory-efficiently"""
wb = load_workbook(file_path, read_only=True)
ws = wb.active
for row in ws.iter_rows(values_only=True):
yield row # Return one row at a time using generator
wb.close()
# Writing large Excel files (write_only mode)
from openpyxl import Workbook
def write_large_excel(data_generator, output_file):
"""Write large data to Excel efficiently"""
wb = Workbook(write_only=True)
ws = wb.create_sheet()
for row in data_generator:
ws.append(row)
wb.save(output_file)
8. Practical Example: Automated Monthly Report Generation
Now let's combine everything we've learned to create an automated monthly sales report generation system that you can use in practice.
import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.chart import BarChart, LineChart, PieChart, Reference
from openpyxl.utils.dataframe import dataframe_to_rows
from datetime import datetime
import os
class MonthlyReportGenerator:
"""Automated Monthly Sales Report Generation Class"""
def __init__(self, data_file, output_dir='reports'):
self.data_file = data_file
self.output_dir = output_dir
self.df = None
self.wb = None
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Define styles
self.header_font = Font(bold=True, size=11, color='FFFFFF')
self.header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid')
self.border = Border(
left=Side(style='thin'),
right=Side(style='thin'),
top=Side(style='thin'),
bottom=Side(style='thin')
)
self.center_align = Alignment(horizontal='center', vertical='center')
self.number_format = '#,##0'
self.currency_format = '$#,##0'
def load_data(self):
"""Load data"""
self.df = pd.read_csv(self.data_file, parse_dates=['OrderDate'])
self.df['Month'] = self.df['OrderDate'].dt.to_period('M')
print(f"Data loaded: {len(self.df)} rows")
def create_summary_sheet(self, ws):
"""Create summary sheet"""
ws.title = "Summary"
# Title
ws['A1'] = 'Monthly Sales Report'
ws['A1'].font = Font(bold=True, size=16)
ws.merge_cells('A1:E1')
# Generation time
ws['A2'] = f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}'
ws['A2'].font = Font(italic=True, size=10)
# Monthly summary data
monthly_summary = self.df.groupby('Month').agg({
'OrderID': 'count',
'Quantity': 'sum',
'Total': 'sum'
}).reset_index()
monthly_summary.columns = ['Month', 'OrderCount', 'SalesQuantity', 'TotalRevenue']
# Table start position
start_row = 4
headers = ['Month', 'Order Count', 'Sales Quantity', 'Total Revenue']
# Write headers
for col, header in enumerate(headers, 1):
cell = ws.cell(row=start_row, column=col, value=header)
cell.font = self.header_font
cell.fill = self.header_fill
cell.border = self.border
cell.alignment = self.center_align
# Write data
for row_idx, row in monthly_summary.iterrows():
ws.cell(row=start_row + row_idx + 1, column=1, value=str(row['Month']))
ws.cell(row=start_row + row_idx + 1, column=2, value=row['OrderCount'])
ws.cell(row=start_row + row_idx + 1, column=3, value=row['SalesQuantity'])
cell = ws.cell(row=start_row + row_idx + 1, column=4, value=row['TotalRevenue'])
cell.number_format = self.currency_format
for col in range(1, 5):
ws.cell(row=start_row + row_idx + 1, column=col).border = self.border
# Total row
total_row = start_row + len(monthly_summary) + 1
ws.cell(row=total_row, column=1, value='Total').font = Font(bold=True)
ws.cell(row=total_row, column=2, value=f'=SUM(B{start_row+1}:B{total_row-1})')
ws.cell(row=total_row, column=3, value=f'=SUM(C{start_row+1}:C{total_row-1})')
ws.cell(row=total_row, column=4, value=f'=SUM(D{start_row+1}:D{total_row-1})')
ws.cell(row=total_row, column=4).number_format = self.currency_format
for col in range(1, 5):
ws.cell(row=total_row, column=col).border = self.border
# Adjust column widths
ws.column_dimensions['A'].width = 12
ws.column_dimensions['B'].width = 12
ws.column_dimensions['C'].width = 14
ws.column_dimensions['D'].width = 18
# Add chart
chart = BarChart()
chart.type = "col"
chart.style = 10
chart.title = "Monthly Revenue"
chart.y_axis.title = "Revenue ($)"
data = Reference(ws, min_col=4, min_row=start_row,
max_row=start_row + len(monthly_summary))
cats = Reference(ws, min_col=1, min_row=start_row + 1,
max_row=start_row + len(monthly_summary))
chart.add_data(data, titles_from_data=True)
chart.set_categories(cats)
chart.shape = 4
ws.add_chart(chart, "F4")
return monthly_summary
def generate_report(self):
"""Main report generation function"""
print("Starting report generation...")
# Load data
self.load_data()
# Create workbook
self.wb = Workbook()
# Create sheets
ws_summary = self.wb.active
self.create_summary_sheet(ws_summary)
# Save file
today = datetime.now().strftime('%Y%m')
output_file = os.path.join(self.output_dir, f'MonthlyReport_{today}.xlsx')
self.wb.save(output_file)
print(f"Report generation complete: {output_file}")
return output_file
# Usage example
if __name__ == "__main__":
# Generate sample data (use existing CSV in practice)
import random
from datetime import timedelta
# Generate sample data
categories = ['Electronics', 'Clothing', 'Food', 'Furniture', 'Books']
products = {
'Electronics': ['Laptop', 'Smartphone', 'Tablet', 'Earphones'],
'Clothing': ['T-shirt', 'Jeans', 'Jacket', 'Coat'],
'Food': ['Snacks', 'Beverages', 'Ramen', 'Coffee'],
'Furniture': ['Desk', 'Chair', 'Bed', 'Sofa'],
'Books': ['Novel', 'Self-help', 'Comic', 'Tech Books']
}
data = []
base_date = datetime(2025, 1, 1)
for i in range(500):
category = random.choice(categories)
product = random.choice(products[category])
qty = random.randint(1, 10)
price = random.randint(10, 500)
data.append({
'OrderID': f'ORD{i+1:05d}',
'OrderDate': base_date + timedelta(days=random.randint(0, 180)),
'CustomerName': f'Customer{random.randint(1, 100):03d}',
'ProductName': product,
'Category': category,
'Quantity': qty,
'UnitPrice': price,
'Total': qty * price
})
sample_df = pd.DataFrame(data)
sample_df.to_csv('sample_sales_data.csv', index=False, encoding='utf-8-sig')
# Generate report
generator = MonthlyReportGenerator('sample_sales_data.csv')
generator.generate_report()
Conclusion
In this article, we learned the key techniques for automating Excel and CSV files using Python.
- csv module: Suitable for simple CSV file reading/writing
- pandas: Provides powerful features for data analysis and preprocessing
- openpyxl: Specialized for Excel file creation, styling, and chart generation
In practice, you'll combine these three tools according to your needs. A common approach is to use pandas for simple data processing and openpyxl for styling the final report.
In the next Part 6, we'll cover email and notification automation. You'll learn how to automatically send the reports we created today via email, or send notifications through Slack/Telegram.