Python Data Analysis for Beginners - From Pandas to Visualization
A Complete Guide to Data Analysis with Python, Pandas, Matplotlib, and Seaborn
Introduction: Why Data Analysis Matters and Why Python Is the Best Choice
We live in an era where massive amounts of data are generated every day. From corporate sales data to social media user behavior logs and government public statistics - meaningful patterns and insights are hidden within all this data. Data analysis is the process of systematically organizing and analyzing raw data to extract information that aids in decision-making. From developing marketing strategies to improving products and predicting risks, data analysis has become a core competency in virtually every industry.
So why has Python become the leading language for data analysis among so many programming languages? First, Python has intuitive and readable syntax, making it easy for beginners to learn quickly. Second, it has a powerful library ecosystem specialized for data analysis, including Pandas, NumPy, Matplotlib, and Seaborn. Third, through the interactive development environment called Jupyter Notebook, you can immediately see the results of code execution while working on your analysis. Fourth, it has the largest developer community worldwide, making it easy to find resources for problem-solving.
This article provides a step-by-step guide to the fundamentals of Python data analysis, from environment setup to a hands-on project. Even complete beginners with no programming experience will be able to perform basic data analysis by following this guide.
1. Environment Setup: The First Step Toward Analysis
1.1 Installing Python
To start data analysis, you first need to install Python. As of 2026, Python 3.12 or later is recommended. You can download the installer for your operating system from the official website (python.org). During installation, make sure to check the "Add Python to PATH" option so you can run Python directly from the terminal.
Once installation is complete, verify the version in the terminal (command prompt).
# Check Python version
python --version
# Example output: Python 3.12.x
# Check pip version
pip --version
1.2 All-in-One Installation with Anaconda
If installing the required libraries individually feels cumbersome, we recommend the Anaconda distribution. Anaconda installs Python along with all the major packages needed for data science, including Pandas, NumPy, Matplotlib, and Jupyter Notebook, all at once.
# Create a virtual environment after installing Anaconda
conda create -n data_analysis python=3.12
conda activate data_analysis
# Required packages are already included, but if additional installation is needed
conda install pandas numpy matplotlib seaborn jupyter
1.3 Getting Started with Jupyter Notebook
Jupyter Notebook is an essential tool for data analysts. It allows you to execute code cell by cell and see results immediately, making it ideal for exploratory data analysis (EDA). You can also use Markdown cells to document your analysis process.
# Install with pip
pip install jupyter notebook
# Launch Jupyter Notebook
jupyter notebook
# Or launch JupyterLab (next-generation version)
pip install jupyterlab
jupyter lab
A browser window will automatically open with the Jupyter dashboard. Click "New > Python 3" in the top right to create a new notebook.
2. Pandas Basics: The Core Library for Data Analysis
2.1 Understanding DataFrame and Series
Pandas is the core library for Python data analysis. Understanding Pandas' two main data structures - Series and DataFrame - is the first step. A Series is a one-dimensional array (a single column), and a DataFrame is a two-dimensional table with rows and columns (similar to an Excel spreadsheet).
import pandas as pd
import numpy as np
# Create a Series
s = pd.Series([10, 20, 30, 40, 50],
index=['a', 'b', 'c', 'd', 'e'])
print(s)
# Output:
# a 10
# b 20
# c 30
# d 40
# e 50
# dtype: int64
# Create a DataFrame
data = {
'Name': ['Kim', 'Lee', 'Park', 'Choi', 'Jung'],
'Age': [25, 30, 28, 35, 22],
'City': ['Seoul', 'Busan', 'Daejeon', 'Seoul', 'Incheon'],
'Score': [85, 92, 78, 95, 88]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Name Age City Score
# 0 Kim 25 Seoul 85
# 1 Lee 30 Busan 92
# 2 Park 28 Daejeon 78
# 3 Choi 35 Seoul 95
# 4 Jung 22 Incheon 88
2.2 Reading and Writing Data
In real-world data analysis, you need to read files in various formats such as CSV, Excel, and JSON. Pandas provides convenient functions for this.
# Read a CSV file
df_csv = pd.read_csv('data.csv', encoding='utf-8')
# Read an Excel file
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read a JSON file
df_json = pd.read_json('data.json')
# Read directly from a web URL
url = 'https://example.com/data.csv'
df_web = pd.read_csv(url)
# Check basic DataFrame information
print(df_csv.shape) # (number of rows, number of columns)
print(df_csv.info()) # Column names, types, missing value info
print(df_csv.describe()) # Statistical summary of numeric columns
print(df_csv.head(10)) # Display first 10 rows
# Save to file
df_csv.to_csv('output.csv', index=False, encoding='utf-8-sig')
df_csv.to_excel('output.xlsx', index=False)
2.3 Indexing and Filtering
Selecting and extracting specific parts of data is the most fundamental task in data analysis. Pandas provides various indexing methods.
# Select columns
print(df['Name']) # Single column (returns Series)
print(df[['Name', 'Score']]) # Multiple columns (returns DataFrame)
# Select rows - loc (label-based)
print(df.loc[0]) # Row at index 0
print(df.loc[0:2]) # Rows from index 0 to 2 (inclusive)
# Select rows - iloc (integer position-based)
print(df.iloc[0]) # First row
print(df.iloc[0:2]) # First to second row (exclusive end)
# Conditional filtering
print(df[df['Age'] >= 28]) # Age 28 or older
print(df[df['City'] == 'Seoul']) # Seoul residents
print(df[(df['Age'] >= 25) & (df['Score'] >= 90)]) # Compound condition
# Filter by inclusion in a list
cities = ['Seoul', 'Busan']
print(df[df['City'].isin(cities)])
# Filter by string containment
print(df[df['Name'].str.contains('Kim')])
3. Data Preprocessing: The Key Stage That Determines Analysis Quality
Real-world data is never perfect. Various issues exist, such as missing values, duplicate data, and incorrect formats. Data preprocessing is the process of fixing these problems to prepare data in a suitable form for analysis. It is known that preprocessing accounts for 60-80% of total work time in real data analysis tasks, highlighting its importance.
3.1 Handling Missing Values
# Create sample data with missing values
data = {
'Name': ['Kim', 'Lee', None, 'Choi', 'Jung'],
'Age': [25, np.nan, 28, 35, 22],
'City': ['Seoul', 'Busan', 'Daejeon', None, 'Incheon'],
'Score': [85, 92, np.nan, 95, 88]
}
df = pd.DataFrame(data)
# Check missing values
print(df.isnull().sum()) # Missing value count per column
print(df.isnull().sum().sum()) # Total missing value count
# Remove missing values
df_dropped = df.dropna() # Drop all rows with missing values
df_dropped_col = df.dropna(axis=1) # Drop all columns with missing values
df_thresh = df.dropna(thresh=3) # Keep rows with at least 3 non-null values
# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean()) # Replace with mean
df['City'] = df['City'].fillna('Unknown') # Replace with specific value
df['Score'] = df['Score'].fillna(method='ffill') # Forward fill
df['Name'] = df['Name'].fillna('Unknown') # String replacement
3.2 Removing Duplicate Data
# Check for duplicates
print(df.duplicated().sum()) # Total duplicate row count
print(df.duplicated(subset=['Name'])) # Check duplicates by specific column
# Remove duplicates
df_unique = df.drop_duplicates() # Based on all columns
df_unique_name = df.drop_duplicates(subset=['Name']) # Based on 'Name' column
df_unique_last = df.drop_duplicates(
subset=['Name'], keep='last' # Keep last occurrence
)
3.3 Data Type Conversion
# Check current data types
print(df.dtypes)
# Type conversion
df['Age'] = df['Age'].astype(int) # Convert to integer
df['Score'] = df['Score'].astype(float) # Convert to float
df['Name'] = df['Name'].astype(str) # Convert to string
# Convert to datetime
df['Date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')
# Convert to category (effective for memory savings)
df['City'] = df['City'].astype('category')
3.4 Grouping and Aggregation
Grouping data by specific criteria and aggregating it is a core technique for discovering meaningful patterns in data.
# Average score by city
print(df.groupby('City')['Score'].mean())
# Calculate multiple statistics at once by city
print(df.groupby('City')['Score'].agg(['mean', 'max', 'min', 'count']))
# Group by multiple columns
print(df.groupby(['City', 'Gender'])['Score'].mean())
# Pivot table (similar to Excel pivot tables)
pivot = df.pivot_table(
values='Score',
index='City',
columns='Gender',
aggfunc='mean',
fill_value=0
)
print(pivot)
4. Data Visualization: Matplotlib and Seaborn
Numbers alone make it difficult to intuitively grasp patterns and trends in data. Visualization represents data as graphs and charts, making complex information understandable at a glance. In Python, Matplotlib (basic visualization) and Seaborn (statistical visualization) are the most widely used libraries.
4.1 Matplotlib Basics
import matplotlib.pyplot as plt
# Font settings (for Windows)
plt.rcParams['font.family'] = 'Malgun Gothic'
plt.rcParams['axes.unicode_minus'] = False # Prevent minus sign rendering issue
# 1. Line Plot - Suitable for tracking changes over time
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [150, 180, 200, 170, 220, 250]
plt.figure(figsize=(10, 6))
plt.plot(months, sales, marker='o', linewidth=2, color='#3498db')
plt.title('Monthly Sales Trend', fontsize=16, fontweight='bold')
plt.xlabel('Month', fontsize=12)
plt.ylabel('Sales (10K KRW)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('line_chart.png', dpi=150)
plt.show()
# 2. Bar Chart - Suitable for comparing categories
categories = ['Electronics', 'Clothing', 'Food', 'Furniture', 'Books']
values = [450, 320, 280, 190, 150]
plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, color=['#3498db', '#2ecc71', '#e74c3c',
'#f39c12', '#9b59b6'])
plt.title('Sales by Category', fontsize=16, fontweight='bold')
plt.xlabel('Category', fontsize=12)
plt.ylabel('Sales Volume', fontsize=12)
# Display values on top of each bar
for bar, val in zip(bars, values):
plt.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 5,
str(val), ha='center', va='bottom', fontsize=11)
plt.tight_layout()
plt.show()
# 3. Pie Chart - Suitable for showing proportions
labels = ['Seoul', 'Gyeonggi', 'Busan', 'Daejeon', 'Others']
sizes = [35, 25, 15, 10, 15]
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12', '#9b59b6']
explode = (0.05, 0, 0, 0, 0) # Slightly separate the Seoul slice
plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('User Distribution by Region', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
4.2 Advanced Visualization with Seaborn
Seaborn is built on top of Matplotlib but allows you to create more beautiful and statistically meaningful charts with simpler code.
import seaborn as sns
# Set Seaborn default theme
sns.set_theme(style='whitegrid', font='Malgun Gothic')
# Create sample data
np.random.seed(42)
df_sample = pd.DataFrame({
'Department': np.random.choice(['Development', 'Marketing', 'Sales', 'Design'], 200),
'Experience(yrs)': np.random.randint(1, 15, 200),
'Salary(10K KRW)': np.random.normal(5000, 1500, 200).astype(int),
'Satisfaction': np.random.uniform(1, 5, 200).round(1)
})
# 1. Histogram - Check data distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=df_sample, x='Salary(10K KRW)', bins=20, kde=True, color='#3498db')
plt.title('Salary Distribution', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# 2. Box Plot - Compare salary distribution by department
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_sample, x='Department', y='Salary(10K KRW)', palette='Set2')
plt.title('Salary Distribution by Department', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# 3. Scatter Plot - Relationship between experience and salary
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_sample, x='Experience(yrs)', y='Salary(10K KRW)',
hue='Department', style='Department', s=80, alpha=0.7)
plt.title('Experience vs Salary (by Department)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# 4. Heatmap - Correlation visualization
plt.figure(figsize=(8, 6))
numeric_cols = df_sample.select_dtypes(include=[np.number])
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm',
center=0, fmt='.2f', linewidths=0.5)
plt.title('Correlation Between Variables', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
5. Hands-on Project: Learning Data Analysis with Public Data
Let's put together everything we've learned so far by conducting a mini project analyzing real public data. In this example, we use fictional Seoul district population statistics. You can download actual public data from the Korea Open Data Portal (data.go.kr) or the Seoul Open Data Plaza (data.seoul.go.kr).
5.1 Data Preparation and Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Korean font settings
plt.rcParams['font.family'] = 'Malgun Gothic'
plt.rcParams['axes.unicode_minus'] = False
sns.set_theme(style='whitegrid', font='Malgun Gothic')
# Create fictional Seoul district population data
data = {
'District': ['Gangnam', 'Gangdong', 'Gangbuk', 'Gangseo', 'Gwanak',
'Gwangjin', 'Guro', 'Geumcheon', 'Nowon', 'Dobong',
'Dongdaemun', 'Dongjak', 'Mapo', 'Seodaemun', 'Seocho',
'Seongdong', 'Seongbuk', 'Songpa', 'Yangcheon', 'Yeongdeungpo',
'Yongsan', 'Eunpyeong', 'Jongno', 'Jung', 'Jungnang'],
'Population': [527641, 426067, 303065, 577586, 497702,
356174, 411084, 230696, 518922, 324686,
347780, 392163, 378476, 314109, 429154,
311024, 436592, 667483, 457806, 397789,
229431, 481623, 149479, 125177, 398690],
'Households': [231547, 183224, 143628, 258921, 247896,
161438, 189362, 115892, 214567, 137852,
168234, 177621, 184523, 148762, 185423,
145234, 193458, 278456, 186742, 189234,
113256, 201234, 75623, 65892, 172345],
'Area': [39.50, 24.59, 23.60, 41.44, 29.57,
17.06, 20.12, 13.01, 35.44, 20.70,
14.22, 16.35, 23.84, 17.61, 46.98,
16.85, 24.57, 33.87, 17.41, 24.55,
21.87, 29.71, 23.91, 9.96, 18.50],
'ElderlyRatio': [12.8, 14.2, 19.5, 13.6, 14.8,
13.1, 15.3, 13.9, 16.2, 18.7,
16.8, 14.1, 12.4, 16.1, 11.9,
13.5, 17.2, 11.8, 14.5, 14.9,
15.2, 16.4, 18.1, 17.5, 16.9]
}
df = pd.DataFrame(data)
# Basic exploration
print("=== Basic Data Information ===")
print(f"Data shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nBasic statistics:\n{df.describe()}")
print(f"\nFirst 5 rows:\n{df.head()}")
5.2 Creating Derived Variables and Analysis
# Calculate population density (Population / Area)
df['PopDensity'] = (df['Population'] / df['Area']).round(0).astype(int)
# Calculate population per household
df['PopPerHousehold'] = (df['Population'] / df['Households']).round(2)
# Top 5 districts by population
top5_pop = df.nlargest(5, 'Population')
print("=== Top 5 Districts by Population ===")
print(top5_pop[['District', 'Population', 'PopDensity']])
# Top 5 districts by population density
top5_density = df.nlargest(5, 'PopDensity')
print("\n=== Top 5 Districts by Population Density ===")
print(top5_density[['District', 'PopDensity', 'Area']])
# Classify by elderly population ratio
df['AgingStage'] = pd.cut(df['ElderlyRatio'],
bins=[0, 7, 14, 20, 100],
labels=['Below Aging', 'Aging Society',
'Aged Society', 'Super-aged Society'])
print("\n=== Number of Districts by Aging Stage ===")
print(df['AgingStage'].value_counts())
5.3 Comprehensive Visualization
# Arrange 4 charts in a single Figure
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Seoul District Population Analysis Dashboard', fontsize=20, fontweight='bold', y=1.02)
# 1) Top 10 districts by population - Horizontal bar chart
top10 = df.nlargest(10, 'Population')
axes[0, 0].barh(top10['District'], top10['Population'], color='#3498db')
axes[0, 0].set_title('Top 10 Districts by Population', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Population')
axes[0, 0].invert_yaxis()
# 2) Population density distribution - Histogram
axes[0, 1].hist(df['PopDensity'], bins=10, color='#2ecc71', edgecolor='white')
axes[0, 1].set_title('Population Density Distribution', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Population Density (per km2)')
axes[0, 1].set_ylabel('Number of Districts')
axes[0, 1].axvline(df['PopDensity'].mean(), color='red', linestyle='--',
label=f'Mean: {df["PopDensity"].mean():.0f}')
axes[0, 1].legend()
# 3) Area vs Population scatter plot
scatter = axes[1, 0].scatter(df['Area'], df['Population'],
c=df['ElderlyRatio'], cmap='RdYlGn_r',
s=100, alpha=0.7, edgecolors='gray')
axes[1, 0].set_title('Area vs Population (Color: Elderly Ratio)', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Area (km2)')
axes[1, 0].set_ylabel('Population')
plt.colorbar(scatter, ax=axes[1, 0], label='Elderly Ratio(%)')
# 4) Aging stage distribution - Pie chart
stage_counts = df['AgingStage'].value_counts()
axes[1, 1].pie(stage_counts.values, labels=stage_counts.index,
autopct='%1.1f%%', colors=['#2ecc71', '#f39c12', '#e74c3c'],
startangle=90, shadow=True)
axes[1, 1].set_title('District Distribution by Aging Stage', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('seoul_population_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nAnalysis complete! 'seoul_population_dashboard.png' has been saved.")
TIP: Try downloading actual data from the Korea Open Data Portal (data.go.kr) and use it instead of the fictional data in the code above. Many datasets are provided in CSV format, so you can read them directly with
pd.read_csv().
Conclusion: A Learning Roadmap for the Next Steps
What we covered in this article represents the fundamentals of Python data analysis. From environment setup to data manipulation with Pandas, preprocessing, and visualization with Matplotlib and Seaborn, you have experienced the entire data analysis workflow. Through the hands-on project, you were also able to see how all these processes connect together.
Don't stop here - here is a suggested learning roadmap for the next steps. First, study NumPy in depth. Since NumPy operates internally within Pandas, understanding array operations and vectorization concepts will help you write much more efficient code. Second, learn the basics of machine learning through the Scikit-learn library. By applying fundamental algorithms such as regression, classification, and clustering to your data, you can expand into the realm of prediction and pattern discovery. Third, learn SQL fundamentals alongside Python. In practice, most data is stored in databases, so the workflow of extracting data with SQL and analyzing it with Pandas is extremely common.
Additionally, by expanding your learning scope to include interactive visualization libraries like Plotly and Bokeh, building data dashboards with Streamlit or Dash, and large-scale data processing with Apache Spark's PySpark, you can systematically build your capabilities as a data analyst or data scientist. The most important thing is consistent practice. Find datasets that interest you on the Korea Open Data Portal, Kaggle, and other platforms, and try conducting your own analysis projects. You will develop practical skills that cannot be gained from theory alone. The world of data analysis is vast, and your journey has just begun.