Hive and Hadoop Tutorial: Data Warehousing on HDFS

Hive and Hadoop tutorial diagram with Hive tables and Hadoop HDFS cluster

Hive and Hadoop Tutorial: Data Warehousing on HDFS

This Hive and Hadoop tutorial introduces the classic big data stack for batch analytics. We will look at how Hadoop stores data in HDFS, how Apache Hive lets you query that data with SQL, and how partitions and file layout affect performance. By the end, you will have a working mental model for building a simple data warehouse on Hadoop that you can connect to tools like Spark and BI dashboards.

In this guide we will cover:

  • The relationship between Hive, Hadoop and HDFS
  • Setting up a small local Hadoop/Hive environment
  • Creating Hive databases and external tables on HDFS
  • Using partitions to speed up queries
  • Running basic Hive SQL queries and inspecting results
  • Rendering a simple aggregated metrics table in HTML

For additional background, it helps to skim the
Apache Hadoop project page
and the
Apache Hive documentation.
You can also tie these concepts to your
Apache Spark tutorial,
Scala for data engineering tutorial,
and downstream storage posts like the
SQL basics tutorial
and
MariaDB tutorial for developers.


1. How Hive and Hadoop fit together

To start, it is worth clarifying how these technologies relate. Hadoop is an ecosystem of tools, but the two that matter most for this tutorial are:

  • HDFS (Hadoop Distributed File System) – a distributed file system that stores large files across many nodes.
  • YARN / MapReduce – a resource manager and processing model for running batch jobs over data in HDFS.

Hive sits on top of Hadoop and provides a SQL-like layer. Instead of hand-writing MapReduce jobs, you write SELECT statements, and Hive compiles them into jobs that run on Hadoop (traditional MapReduce, Tez, or Spark, depending on your configuration).

In practice, a Hive and Hadoop tutorial like this focuses on three core ideas:

  1. Tables – logical schemas that map over raw files in HDFS.
  2. Partitions – directory layouts that speed up queries by skipping irrelevant files.
  3. File formats – choosing between CSV, ORC, Parquet and others for performance and compression.

Once you understand those, you can combine Hive/Hadoop with modern processing tools like
Apache Spark
or
Apache Flink
for more advanced workloads.


2. Set up a small Hive/Hadoop environment

There are many ways to run Hive and Hadoop (local installs, Docker images, cloud distributions). For this tutorial, the exact distribution is less important than the commands; you can adapt them to your environment. A simple approach is to spin up a prebuilt Docker image that bundles Hadoop, HDFS and Hive together.

2.1 Example: start Hadoop + Hive with Docker

The following commands assume a Docker image that exposes HDFS and HiveServer2 (check your specific image docs). The goal is simply to have:

  • An HDFS namenode and datanodes
  • A Hive metastore and HiveServer2
  • A CLI such as hive or beeline to run queries

# example only: your image/tag may differ
docker run -d --name hadoop-hive \
  -p 9870:9870 \  # HDFS web UI
  -p 10000:10000 \  # HiveServer2
  some-hadoop-hive-image:latest

# exec into the container to run hdfs/hive CLIs
docker exec -it hadoop-hive /bin/bash

Inside the container, you should be able to run simple commands like:

hdfs dfs -ls /
hive --version

Once this works, you are ready to define Hive databases and tables on top of HDFS files.


3. Create a Hive database and external table

In this section of the Hive and Hadoop tutorial, we will create a database for analytics and define a simple external table representing ecommerce orders. An external table tells Hive that the data lives in HDFS under a specific path and should not be deleted when you drop the table.

3.1 Prepare data in HDFS

First, create a small CSV file on your local machine called orders.csv:

order_id,user_id,country,amount_usd,created_at
1,u1,US,39.99,2025-11-01T12:01:00Z
2,u2,CA,19.99,2025-11-01T12:05:00Z
3,u1,US,59.99,2025-11-01T13:20:00Z
4,u3,DE,15.00,2025-11-02T09:10:00Z

Then, copy it into HDFS under a warehouse-style directory:

hdfs dfs -mkdir -p /warehouse/raw/orders
hdfs dfs -put orders.csv /warehouse/raw/orders/

3.2 Create a Hive database and external table

Next, open the Hive shell and create a database plus an external table mapped to that HDFS location:

CREATE DATABASE IF NOT EXISTS analytics;

USE analytics;

CREATE EXTERNAL TABLE IF NOT EXISTS orders_raw (
  order_id     BIGINT,
  user_id      STRING,
  country      STRING,
  amount_usd   DOUBLE,
  created_at   STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ",",
  "quoteChar"     = "\""
)
LOCATION 'hdfs:///warehouse/raw/orders';

At this point, Hive knows that analytics.orders_raw is backed by CSV files in /warehouse/raw/orders on HDFS. You can run a quick sanity check:

SELECT * FROM analytics.orders_raw LIMIT 5;

This pattern—defining external tables over HDFS paths—is the core of many Hive and Hadoop based data warehouses.


4. Partitioned Hive tables for better performance

As your data grows, scanning every file for every query becomes expensive. That is where partitions come in. A partitioned table organizes files into directories based on one or more columns (for example, date or country). When you filter on those columns, Hive can skip entire directories.

4.1 Create a partitioned table by order date

To illustrate, suppose we want a partitioned table by order date. We can create a table that uses order_date as a partition key and stores one folder per date.

CREATE EXTERNAL TABLE IF NOT EXISTS orders_by_date (
  order_id     BIGINT,
  user_id      STRING,
  country      STRING,
  amount_usd   DOUBLE,
  created_at   STRING
)
PARTITIONED BY (order_date STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ",",
  "quoteChar"     = "\""
)
LOCATION 'hdfs:///warehouse/curated/orders_by_date';

In a production pipeline, you would write files into date-specific directories like order_date=2025-11-01. For this tutorial, you can also load partitions manually using ALTER TABLE ... ADD PARTITION once you have files in place.

4.2 Query with partition pruning

Once data is laid out by partition key, queries that filter on that key become much faster because Hive reads fewer files. For example:

SELECT
  country,
  SUM(amount_usd) AS total_revenue_usd,
  COUNT(*)        AS orders_count
FROM analytics.orders_by_date
WHERE order_date = '2025-11-01'
GROUP BY country
ORDER BY total_revenue_usd DESC;

Behind the scenes, Hive only touches the partition directory for order_date=2025-11-01, which can be a massive win at scale.


5. Basic Hive SQL for analytics

Now that we have tables defined, we can write some straightforward Hive SQL to compute the kind of metrics stakeholders actually care about. The examples here mirror patterns in your
Python data engineering tutorial
and
Apache Spark tutorial,
but run directly in Hive on top of HDFS.

5.1 Total revenue by country

SELECT
  country,
  SUM(amount_usd) AS total_revenue_usd,
  COUNT(*)        AS orders_count
FROM analytics.orders_raw
GROUP BY country
ORDER BY total_revenue_usd DESC;

This query computes basic revenue and volume by country. Even on large datasets, Hive and Hadoop handle the heavy lifting by distributing the scan and aggregation across the cluster.

5.2 Daily revenue trend

Next, we can derive a simple daily revenue trend by extracting a date from the timestamp string. In production, you would usually normalize timestamps into proper types during ingestion, but this example keeps things inline for clarity.

SELECT
  substr(created_at, 1, 10) AS order_date,
  SUM(amount_usd)           AS total_revenue_usd,
  COUNT(*)                  AS orders_count
FROM analytics.orders_raw
GROUP BY substr(created_at, 1, 10)
ORDER BY order_date;

These kinds of queries form the backbone of many Hive-based reports. You can materialize them into tables or views that downstream tools—like data visualizations discussed in the
React dashboard tutorial—can query efficiently.


6. Simple HTML metrics table

To connect this Hive and Hadoop tutorial back to WordPress, you can represent query results as a small HTML table in your post. Suppose your Hive query for total revenue by country produces this output:

country,total_revenue_usd,orders_count
US,99.98,2
CA,19.99,1
DE,15.00,1

You can translate that into a simple table with relative bars to give readers an intuition for the differences without a full charting library:

Country Revenue (USD) Orders Relative Revenue
United States (US) $99.98 2
Canada (CA) $19.99 1
Germany (DE) $15.00 1

Even this minimal visualization demonstrates the full loop: Hadoop stores the raw data, Hive defines tables and SQL, and your front end (WordPress, React or another UI) presents the metrics in a way that non-engineers can consume.

Together, these steps form a concise but practical Hive and Hadoop tutorial for building a data warehouse on HDFS. As your needs grow, you can add more tables, introduce ORC or Parquet formats, and integrate jobs written in Spark, Python or Scala—all anchored on the same Hive/Hadoop foundation.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top