Python Data Engineering Tutorial: Build a Simple ETL Pipeline

Python data engineering tutorial diagram showing ETL pipeline with extract, transform and load steps

Python Data Engineering Tutorial: Build a Simple ETL Pipeline

This Python data engineering tutorial walks through a small but realistic ETL pipeline. Instead of only talking about theory, we will extract data from CSV and an API, transform it with pandas, and finally load it into a SQL database. By the end, you will have a repeatable Python ETL workflow that you can adapt to real projects.


Python data engineering tutorial diagram showing ETL pipeline with extract, transform and load steps
A Python data engineering tutorial diagram showing a simple ETL pipeline from raw data to a SQL warehouse.

In this guide we will cover:

  • Why Python is so widely used for data engineering
  • Setting up a small Python environment for ETL work
  • Extracting data from CSV and an HTTP API
  • Transforming data with pandas (cleaning, typing and aggregation)
  • Loading results into a relational database using SQL
  • Visualizing a basic metrics table directly in WordPress

For reference, it is useful to keep the official
Python documentation
and the
pandas documentation
handy. Later you can connect this tutorial to batch processing with
Apache Spark,
streaming with
Apache Flink,
and storage systems covered in the
SQL basics tutorial
and
MariaDB tutorial for developers.


1. Why Python for data engineering?

Python has become a default choice for data engineering because it balances readability, ecosystem and flexibility. On one side, you have clear syntax that is easy to maintain in large codebases. On the other side, you get powerful libraries for ETL, orchestration and analytics.

In practice, a typical Python data engineering stack might combine:

  • pandas for in-memory data wrangling and quick experiments.
  • SQLAlchemy for talking to databases such as MariaDB or PostgreSQL.
  • Airflow or similar tools for scheduled workflows.
  • Apache Spark for jobs that outgrow a single machine.

Because Python integrates well with tools like
Linux,
CI/CD pipelines and
React dashboards,
you can often use one language across large parts of your stack. This tutorial focuses on the ETL core of that story.


2. Set up a Python ETL environment

To keep things clean, we will create a virtual environment and install just the tools we need. This lightweight setup is usually enough to run local jobs or prototype pipelines before they move into Airflow or a containerized environment.

2.1 Create a virtual environment

First, create and activate a virtualenv:

# create and activate a virtualenv (Unix/macOS)
python -m venv venv
source venv/bin/activate

# on Windows PowerShell:
# python -m venv venv
# .\venv\Scripts\Activate.ps1

2.2 Install core data engineering packages

Next, install pandas and SQLAlchemy. We will also install a MariaDB client so that we can write data to a real database later:

pip install pandas sqlalchemy mariadb python-dotenv requests

At this point, your Python data engineering tutorial environment is ready for ETL code. We can now start extracting data.


3. Extract: get data from CSV and an API

In many pipelines, “extract” steps pull data from a mix of files, APIs and databases. To keep the example compact, we will use a CSV file for transactional data and a simple HTTP API for country metadata.

3.1 Extract data from CSV with pandas

First, imagine you have a CSV file called orders.csv with basic ecommerce orders:

order_id,user_id,country,amount_usd,created_at
1,u1,US,39.99,2025-11-01T12:01:00Z
2,u2,CA,19.99,2025-11-01T12:05:00Z
3,u1,US,59.99,2025-11-01T13:20:00Z
4,u3,DE,15.00,2025-11-02T09:10:00Z

To extract this into a pandas DataFrame, you can write:

import pandas as pd

orders = pd.read_csv("orders.csv", parse_dates=["created_at"])
print(orders.dtypes)
print(orders.head())

Here, parse_dates ensures that created_at becomes a proper datetime column. This small step pays off later when you group by time or join with other time-based tables.

3.2 Enrich with metadata from an HTTP API

To demonstrate combining sources, we will call a toy API that returns country metadata. In a real pipeline, this might be an internal service or a public dataset. For our purposes, we will treat it as a simple JSON API.

import requests

resp = requests.get("https://restcountries.com/v3.1/all")
resp.raise_for_status()
countries_raw = resp.json()

countries = pd.DataFrame([
    {
        "country_code": c.get("cca2"),
        "country_name": c.get("name", {}).get("common"),
        "region": c.get("region"),
    }
    for c in countries_raw
])

print(countries.head())

Now we have two extracted datasets: transactional orders from CSV and country metadata from an API. The next step in our Python data engineering tutorial is to transform and join them.


4. Transform: clean, type and aggregate with pandas

The “transform” stage is where a lot of data engineering work happens. During this stage, you typically normalize types, handle missing values, derive new fields and compute aggregates. We will illustrate a few common patterns.

4.1 Clean and normalize orders

First, we ensure the numeric types and handle any obvious data quality issues:

# Ensure amount_usd is numeric
orders["amount_usd"] = pd.to_numeric(orders["amount_usd"], errors="coerce")

# Drop rows that are clearly broken
orders = orders.dropna(subset=["user_id", "country", "amount_usd", "created_at"])

# Standardize country codes
orders["country"] = orders["country"].str.upper()

Even though this looks simple, these transformations are exactly the sort of logic that often ends up in production ETL jobs built with Airflow, Spark or tools discussed in the
CI/CD pipeline tutorial.

4.2 Join with country metadata

Next, we join orders with country metadata to get human-readable country names and regions:

orders_enriched = orders.merge(
    countries,
    left_on="country",
    right_on="country_code",
    how="left",
)

print(orders_enriched[["order_id", "country", "country_name", "region", "amount_usd"]].head())

At this point, you have a richer dataset that can feed dashboards, machine learning models or downstream batch jobs (for example, ones implemented with
Apache Spark).

4.3 Aggregate revenue by country and day

Finally, let us compute a simple metric: total revenue per country per day. This is the kind of query stakeholders ask for continually, so it is useful to bake it into a reusable ETL job.

orders_enriched["order_date"] = orders_enriched["created_at"].dt.date

revenue_by_country_day = (
    orders_enriched
    .groupby(["order_date", "country", "country_name"], as_index=False)
    .agg(total_revenue_usd=("amount_usd", "sum"),
         orders_count=("order_id", "count"))
    .sort_values(["order_date", "total_revenue_usd"], ascending=[True, False])
)

print(revenue_by_country_day.head())

This DataFrame is exactly what we will load into a SQL database next.


5. Load: write results into a SQL database

For the “load” step, we will push our aggregated metrics into a relational database. Although you can choose any engine you like, we will use MariaDB here, since it appears elsewhere on this site and plays nicely with Python via SQLAlchemy and the MariaDB driver.

5.1 Create a SQLAlchemy engine

First, define a connection URL and engine. In production you would probably read credentials from environment variables or a .env file.

from sqlalchemy import create_engine

# Example credentials - replace with your actual settings
user = "etl_user"
password = "strong_password"
host = "localhost"
port = 3306
database = "analytics"

engine = create_engine(
    f"mariadb+mariadbconnector://{user}:{password}@{host}:{port}/{database}"
)

Make sure you have a database ready (for example, using tools discussed in the
MariaDB tutorial for developers).
Once that is set, we can load the DataFrame.

5.2 Load the aggregated metrics table

Now we can write our aggregated table straight into MariaDB using to_sql. In a real job you would usually use append mode and possibly manage partitions, but this example keeps it simple.

table_name = "daily_revenue_by_country"

# if_exists options: 'fail', 'replace', or 'append'
revenue_by_country_day.to_sql(
    table_name,
    engine,
    if_exists="replace",
    index=False,
)

print(f"Wrote {len(revenue_by_country_day)} rows to {table_name}")

At this point, the ETL flow is complete: Python extracted, transformed and loaded data into a warehouse-style table that other tools can query. From here, you can point BI tools, dashboards or even advertising systems (such as those discussed in the
Google Analytics / Ads tracking tutorial)
at the same database.


6. Quick HTML “chart” of ETL results

To tie this Python data engineering tutorial back to WordPress, we can render a tiny metrics table directly in a post. Suppose the top three rows of daily_revenue_by_country look like this:

order_date,country,country_name,total_revenue_usd,orders_count
2025-11-01,US,United States,99.98,2
2025-11-01,CA,Canada,19.99,1
2025-11-02,DE,Germany,15.00,1

You could render that snippet as a simple HTML table with small bars for relative revenue. This is not a full charting library, but it gives you a quick sense of scale. Later you might replace it with a React or Chart.js visualization like the ones described in your other tech stack posts.

Date Country Revenue (USD) Orders Relative Revenue
2025-11-01 United States $99.98 2
2025-11-01 Canada $19.99 1
2025-11-02 Germany $15.00 1

Even this small table highlights a key idea: once your ETL logic is solid, surfacing the numbers is mostly a front-end problem. Consequently, you can plug the same ETL pipeline into dashboards, APIs or reporting tools without rewriting the core Python code.

Together, these steps form a compact but realistic Python data engineering tutorial: extract from multiple sources, transform with pandas, and load into a SQL database ready for analytics.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top