Abdullah Haggag

Posted on Oct 24, 2024

The Journey From a CSV File to Apache Hive Table

#hadoop #hive #bigdata #dataengineering

Introduction

I am Abdullah, a Data Engineer passionate about building, understanding, and experimenting with data solutions.

In my previous blog post, I introduced the Big-data Ecosystem Sandbox I’ve been building over the last two months. Today, we’ll take a deeper dive and get hands-on with the sandbox, demonstrating how to import a CSV file into a Hive table. Along the way, we will explore the various tools in the sandbox and how to work with them.

Introduction to Hadoop & Hive
Hands-On: Importing a CSV File into Hive Table

Let’s begin with a brief introduction to the core components we will be using for this demo.

Introduction to Hadoop: HDFS and YARN

What is Hadoop?

Apache Hadoop is an open-source framework that enables the distributed storage and processing of large datasets across clusters of computers. It is designed to scale from a single server to thousands of machines, each providing local computation and storage capabilities. Hadoop’s architecture is built to handle massive amounts of data efficiently.

Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used by Hadoop applications. It is designed to store large data files across a distributed system, breaking data into smaller blocks, replicating these blocks, and distributing them across multiple nodes in a cluster. This enables efficient and reliable computations.

Key features of HDFS include:

Fault tolerance: HDFS automatically replicates data to ensure fault tolerance.
Cost-efficiency: It is designed to run on commodity hardware.
High throughput: Provides high throughput access to application data.
Scalability: Can handle large datasets efficiently, even in the petabyte range.

YARN (Yet Another Resource Negotiator)

YARN is Hadoop's resource management system. It is responsible for allocating system resources to applications and scheduling tasks across a cluster, enabling better resource utilization.

Key benefits of YARN:

Improved cluster utilization: Dynamically manages resource allocation.
Scalability: Supports a large number of nodes and applications.
Multi-tenancy: Allows multiple applications to share cluster resources.
Compatibility: Works well with MapReduce and other Hadoop ecosystem projects.

Together, HDFS and YARN form the core components of Hadoop, providing a robust platform for distributed data storage and processing.

Introduction to Apache Hive

While HDFS stores large files, querying and analyzing these files efficiently requires a data warehouse system like Apache Hive. Hive provides an SQL-like interface (HiveQL) to query data stored in various file systems, including HDFS, providing users with an easier way to interact with large datasets.

Basically, HDFS Stores the data files and Hive keeps the metadata that is saying “you can find the data for this table in this directory” along with keeping some statistics and metadata about these data files.

Key Features of Hive

SQL-like queries: Allows users to write queries in HiveQL, similar to SQL.
Scalability: Hive can handle massive datasets with ease.
Compatibility: Works seamlessly with the Hadoop ecosystem.
Support for various file formats: Handles different data storage formats such as CSV, ORC, Parquet, and more.

Hands-On: Importing a CSV File to a Hive Table

This section provides a step-by-step guide to upload a CSV file into a Hive table.

Prerequisites

Docker & Docker Compose Installed
Basic knowledge of Linux Operating System & Docker

Step 1: Setup the Playground Environment on Docker

To simulate a Hadoop and Hive environment for this hands-on, we'll use a big-data sandbox that I created. You can find the setup details in the following GitHub repository:

Big Data Ecosystem Sandbox GitHub Repository

To start only the required services for this demo, follow the commands below:

git clone https://github.com/amhhaggag/bigdata-ecosystem-sandbox.git
cd bigdata-ecosystem-sandbox

docker compose up -d hive-server

This will start the following components required for Hive:

Hadoop HDFS Namenode and Datanode
YARN Resource Manager and Node Manager
PostgreSQL for Hive Metastore Database
Hive Metastore & Hive Server2

Verify that the services are running:

docker ps

Ensure the containers are up and running, as shown in the example output provided.

Step 2: Prepare the Sample CSV File

In the repository's sample-files directory, you will find a sample CSV file containing randomly generated data. Here's a glimpse of the first few records:

order_id,order_date,customer_id,product_name,product_category,product_price,items_count,total_amount
019-74-9339,2022-11-25,80129,Spinach,Vegetables,2.49,3,7.47
061-83-1476,2023-12-04,164200,Anker Soundcore Liberty Air 2 Pro,Electronics,129.99,1,129.99
...

Step 3: Copy the CSV File into the Hive Server Container

Copy the CSV file into the Hive server container:

docker cp sample-files/orders_5k.csv hive-server:/opt/

This command will transfer the orders_5k.csv file into the Hive server’s /opt/ directory.

Step 4: Create the Staging Schema & Table

Enter the hive-server container for the rest of the demo to create the tables and import the data.

docker exec -it hive-server /bin/bash

## Get into Beeline: The command line tool to interact with hive-server and write queries
beeline -u "jdbc:hive2://hive-server:10000" -n hive -p hive

Before importing the data, we'll create an external table to temporarily store the CSV data.

Managed vs. External Tables

External Table: Stores data outside Hive’s default location, typically in HDFS or other storage. Dropping the table only deletes metadata, not the actual data.
Managed Table: Stores data in Hive’s warehouse directory. Dropping the table removes both metadata and data.

Creating the Staging Table

We will create a schema and external table for staging the CSV data:

CREATE SCHEMA stg;

CREATE EXTERNAL TABLE stg.orders (
    order_id STRING,
    order_date DATE,
    customer_id INT,
    product_name STRING,
    product_category STRING,
    product_price DOUBLE,
    items_count INT,
    total_amount DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar"     = "\"",
    "escapeChar"    = "\\"
)
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count" = "1");

This creates a staging table in the stg schema. The data will be stored in a folder in HDFS corresponding to the table name.

Verifying the HDFS Directory

We should have a stg.db directory created in the /user/hive/warehouse/ which is the main hive warehouse directory.

Also, we should have a new directory orders representing the location of the external table files.

You can check the HDFS directory for the table:

hdfs dfs -ls /user/hive/warehouse/
hdfs dfs -ls /user/hive/warehouse/stg.db/

Step 5: Import CSV Data into the Staging Table

To load data into the table, copy the CSV file into the HDFS directory representing the orders table

hdfs dfs -put /opt/orders_5k.csv /user/hive/warehouse/stg.db/orders/

# Check that the file is copied correctly
hdfs dfs -ls /user/hive/warehouse/stg.db/orders/

Now, get back to beeline and validate that the data has been successfully loaded as a table and that hive is able to read it:

beeline -u "jdbc:hive2://hive-server:10000" -n hive -p hive

SELECT COUNT(*) FROM stg.orders;

This query should return 5,000 rows.

Step 6: Create the Main Schema and Table

We will now create a managed table in Hive to store the data as Parquet files:

CREATE SCHEMA retail;

CREATE TABLE retail.orders (
    order_id STRING,
    customer_id INT,
    product_name STRING,
    product_category STRING,
    product_price DOUBLE,
    items_count INT,
    total_amount DOUBLE,
    order_date DATE
)
STORED AS PARQUET;

Step 7: Move Data from Staging to Main Table

Next, move the data from the staging table to the main table:

INSERT INTO retail.orders
SELECT order_id, customer_id, product_name, product_category, product_price, items_count, total_amount, order_date
FROM stg.orders
WHERE order_id != 'order_id';

Step 8: Validate Data in the Main Table

You can now validate the data in the main table:

SELECT COUNT(*) FROM retail.orders;
SELECT * FROM retail.orders LIMIT 5;

Conclusion

In this hands-on session, we explored how to leverage the Big-data Ecosystem Sandbox to import and manage data using Hadoop and Hive. By following the steps, we:

Set up a Hadoop environment with Hive for data management.
Created external and managed Hive tables to efficiently handle and store data.
Imported a CSV file into Hive and transformed it into a more optimized format (Parquet).
Explored how Hadoop’s HDFS and Hive work together for data storage and querying.

This practical demonstration shows how to manage large datasets using familiar SQL-like commands in Hive, all while benefiting from the scalability and robustness of Hadoop. The sandbox environment offers a powerful platform for learning and experimentation, giving you a solid foundation to build your own big-data solutions.

Stay tuned for more advanced use cases and integrations with other tools in the Big-data Ecosystem Sandbox!

If you have any questions please don't hesitate to ask them in the comments below!

DEV Community