DEV Community: Abdullah Haggag

The Journey From a CSV File to Apache Hive Table

Abdullah Haggag — Thu, 24 Oct 2024 03:45:55 +0000

Introduction

I am Abdullah, a Data Engineer passionate about building, understanding, and experimenting with data solutions.

In my previous blog post, I introduced the Big-data Ecosystem Sandbox I’ve been building over the last two months. Today, we’ll take a deeper dive and get hands-on with the sandbox, demonstrating how to import a CSV file into a Hive table. Along the way, we will explore the various tools in the sandbox and how to work with them.

Introduction to Hadoop & Hive
Hands-On: Importing a CSV File into Hive Table

Let’s begin with a brief introduction to the core components we will be using for this demo.

Introduction to Hadoop: HDFS and YARN

What is Hadoop?

Apache Hadoop is an open-source framework that enables the distributed storage and processing of large datasets across clusters of computers. It is designed to scale from a single server to thousands of machines, each providing local computation and storage capabilities. Hadoop’s architecture is built to handle massive amounts of data efficiently.

Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used by Hadoop applications. It is designed to store large data files across a distributed system, breaking data into smaller blocks, replicating these blocks, and distributing them across multiple nodes in a cluster. This enables efficient and reliable computations.

Key features of HDFS include:

Fault tolerance: HDFS automatically replicates data to ensure fault tolerance.
Cost-efficiency: It is designed to run on commodity hardware.
High throughput: Provides high throughput access to application data.
Scalability: Can handle large datasets efficiently, even in the petabyte range.

YARN (Yet Another Resource Negotiator)

YARN is Hadoop's resource management system. It is responsible for allocating system resources to applications and scheduling tasks across a cluster, enabling better resource utilization.

Key benefits of YARN:

Improved cluster utilization: Dynamically manages resource allocation.
Scalability: Supports a large number of nodes and applications.
Multi-tenancy: Allows multiple applications to share cluster resources.
Compatibility: Works well with MapReduce and other Hadoop ecosystem projects.

Together, HDFS and YARN form the core components of Hadoop, providing a robust platform for distributed data storage and processing.

Introduction to Apache Hive

While HDFS stores large files, querying and analyzing these files efficiently requires a data warehouse system like Apache Hive. Hive provides an SQL-like interface (HiveQL) to query data stored in various file systems, including HDFS, providing users with an easier way to interact with large datasets.

Basically, HDFS Stores the data files and Hive keeps the metadata that is saying “you can find the data for this table in this directory” along with keeping some statistics and metadata about these data files.

Key Features of Hive

SQL-like queries: Allows users to write queries in HiveQL, similar to SQL.
Scalability: Hive can handle massive datasets with ease.
Compatibility: Works seamlessly with the Hadoop ecosystem.
Support for various file formats: Handles different data storage formats such as CSV, ORC, Parquet, and more.

Hands-On: Importing a CSV File to a Hive Table

This section provides a step-by-step guide to upload a CSV file into a Hive table.

Prerequisites

Docker & Docker Compose Installed
Basic knowledge of Linux Operating System & Docker

Step 1: Setup the Playground Environment on Docker

To simulate a Hadoop and Hive environment for this hands-on, we'll use a big-data sandbox that I created. You can find the setup details in the following GitHub repository:

Big Data Ecosystem Sandbox GitHub Repository

To start only the required services for this demo, follow the commands below:

git clone https://github.com/amhhaggag/bigdata-ecosystem-sandbox.git
cd bigdata-ecosystem-sandbox

docker compose up -d hive-server

This will start the following components required for Hive:

Hadoop HDFS Namenode and Datanode
YARN Resource Manager and Node Manager
PostgreSQL for Hive Metastore Database
Hive Metastore & Hive Server2

Verify that the services are running:

docker ps

Ensure the containers are up and running, as shown in the example output provided.

Step 2: Prepare the Sample CSV File

In the repository's sample-files directory, you will find a sample CSV file containing randomly generated data. Here's a glimpse of the first few records:

order_id,order_date,customer_id,product_name,product_category,product_price,items_count,total_amount
019-74-9339,2022-11-25,80129,Spinach,Vegetables,2.49,3,7.47
061-83-1476,2023-12-04,164200,Anker Soundcore Liberty Air 2 Pro,Electronics,129.99,1,129.99
...

Step 3: Copy the CSV File into the Hive Server Container

Copy the CSV file into the Hive server container:

docker cp sample-files/orders_5k.csv hive-server:/opt/

This command will transfer the orders_5k.csv file into the Hive server’s /opt/ directory.

Step 4: Create the Staging Schema & Table

Enter the hive-server container for the rest of the demo to create the tables and import the data.

docker exec -it hive-server /bin/bash

## Get into Beeline: The command line tool to interact with hive-server and write queries
beeline -u "jdbc:hive2://hive-server:10000" -n hive -p hive

Before importing the data, we'll create an external table to temporarily store the CSV data.

Managed vs. External Tables

External Table: Stores data outside Hive’s default location, typically in HDFS or other storage. Dropping the table only deletes metadata, not the actual data.
Managed Table: Stores data in Hive’s warehouse directory. Dropping the table removes both metadata and data.

Creating the Staging Table

We will create a schema and external table for staging the CSV data:

CREATE SCHEMA stg;

CREATE EXTERNAL TABLE stg.orders (
    order_id STRING,
    order_date DATE,
    customer_id INT,
    product_name STRING,
    product_category STRING,
    product_price DOUBLE,
    items_count INT,
    total_amount DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar"     = "\"",
    "escapeChar"    = "\\"
)
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count" = "1");

This creates a staging table in the stg schema. The data will be stored in a folder in HDFS corresponding to the table name.

Verifying the HDFS Directory

We should have a stg.db directory created in the /user/hive/warehouse/ which is the main hive warehouse directory.

Also, we should have a new directory orders representing the location of the external table files.

You can check the HDFS directory for the table:

hdfs dfs -ls /user/hive/warehouse/
hdfs dfs -ls /user/hive/warehouse/stg.db/

Step 5: Import CSV Data into the Staging Table

To load data into the table, copy the CSV file into the HDFS directory representing the orders table

hdfs dfs -put /opt/orders_5k.csv /user/hive/warehouse/stg.db/orders/

# Check that the file is copied correctly
hdfs dfs -ls /user/hive/warehouse/stg.db/orders/

Now, get back to beeline and validate that the data has been successfully loaded as a table and that hive is able to read it:

beeline -u "jdbc:hive2://hive-server:10000" -n hive -p hive

SELECT COUNT(*) FROM stg.orders;

This query should return 5,000 rows.

Step 6: Create the Main Schema and Table

We will now create a managed table in Hive to store the data as Parquet files:

CREATE SCHEMA retail;

CREATE TABLE retail.orders (
    order_id STRING,
    customer_id INT,
    product_name STRING,
    product_category STRING,
    product_price DOUBLE,
    items_count INT,
    total_amount DOUBLE,
    order_date DATE
)
STORED AS PARQUET;

Step 7: Move Data from Staging to Main Table

Next, move the data from the staging table to the main table:

INSERT INTO retail.orders
SELECT order_id, customer_id, product_name, product_category, product_price, items_count, total_amount, order_date
FROM stg.orders
WHERE order_id != 'order_id';

Step 8: Validate Data in the Main Table

You can now validate the data in the main table:

SELECT COUNT(*) FROM retail.orders;
SELECT * FROM retail.orders LIMIT 5;

Conclusion

In this hands-on session, we explored how to leverage the Big-data Ecosystem Sandbox to import and manage data using Hadoop and Hive. By following the steps, we:

Set up a Hadoop environment with Hive for data management.
Created external and managed Hive tables to efficiently handle and store data.
Imported a CSV file into Hive and transformed it into a more optimized format (Parquet).
Explored how Hadoop’s HDFS and Hive work together for data storage and querying.

This practical demonstration shows how to manage large datasets using familiar SQL-like commands in Hive, all while benefiting from the scalability and robustness of Hadoop. The sandbox environment offers a powerful platform for learning and experimentation, giving you a solid foundation to build your own big-data solutions.

Stay tuned for more advanced use cases and integrations with other tools in the Big-data Ecosystem Sandbox!

If you have any questions please don't hesitate to ask them in the comments below!

Building a Big Data Playground Sandbox for Learning

Abdullah Haggag — Thu, 17 Oct 2024 05:52:21 +0000

Introduction

As a data engineer, I'm always seeking opportunities to experiment with different data solutions. Whether it's learning a new tool, practicing a solution, or testing ideas in a safe environment, the desire to innovate never ceases. To facilitate this, I've created a personal sandbox using Docker containers, featuring various big data tools. This setup, which I call the "Big-data Ecosystem Sandbox (BES)," leverages open-source big data tools orchestrated within Docker using custom-built images.

Sandbox Components

The BES includes a comprehensive set of tools essential for big data processing and analysis:

Data Storage and Management

PostgreSQL: An open-source relational database for structured data storage and complex queries.
MinIO: A high-performance, distributed object storage system compatible with Amazon S3 API.
Hadoop: An open-source framework for distributed storage and processing of large datasets.

Data Processing and Analytics

Hive: A data warehouse infrastructure built on Hadoop for querying and managing large datasets.
Spark: A fast, distributed computing system for large-scale data processing.
Trino: A distributed SQL query engine for querying data across various sources.

Streaming and Real-time Processing

Kafka: A distributed event streaming platform for building real-time data pipelines.
Flink: A stream processing framework for real-time data processing and event-driven applications.

Data Orchestration and Management

NiFi: An easy to use, powerful, and reliable system to process and distribute data.
Airflow: A platform to programmatically author, schedule, and monitor workflows.

Getting Started

You can find the GitHub Repo through the following link:

https://github.com/amhhaggag/bigdata-ecosystem-sandbox

Setup ALL the Sandbox Tools

“Make sure that you have enough CPU and RAM”

To Setup all the sandbox tools use the following script

git clone https://github.com/amhhaggag/bigdata-ecosystem-sandbox.git
cd bigdata-ecosystem-sandbox

./bes-setup.sh

This script will do the following:

Pull the necessary Docker images from Docker Hub
- amhhaggag/hadoop-base-3.1.1
- amhhaggag/hive-base-3.1.2
- amhhaggag/spark-3.5.1
Prepare the PostgreSQL Database for Hive-Metastore Service
Add the Trino Configurations to it’s specific mounted volume (Local Directory)
Create & Start all the containers

Now, let’s explain what is included in this repository:

Sandbox Architecture

The BES uses a combination of official Docker images and custom-built images to ensure compatibility and integration between tools. The custom images include Apache Hadoop, Hive, Spark, Airflow, and Trino, built in a hierarchical manner to maintain dependencies and ensure smooth integration.

Below is a diagram illustrating the dependencies between the custom built images.

Docker Compose Overview

To be able to use the sandbox efficiently you need to have at least basic knowledge of Docker and Docker Compose. Here is a quick overview on Docker Compose

A Docker Compose file, typically named docker-compose.yml, is a YAML file that defines, configures, and runs multi-container Docker applications. It allows you to manage all your application's services, networks, and volumes in a single place, streamlining deployment and scaling processes.

Here's the general structure of a Docker Compose file:

services:
  service_name:
    image: image_name:tag  # Use an existing image
    build:
      context: ./path  # Path to the build context
      dockerfile: Dockerfile  # Dockerfile to use for building the image
    ports:
      - "host_port:container_port"  # Map host ports to container ports
    environment:
      - VARIABLE=value  # Set environment variables
    volumes:
      - host_path:container_path  # Mount host paths or volumes
    networks:
      - network_name  # Connect to specified networks
    depends_on:
      - other_service  # Specify service dependencies

networks:
  network_name:
    driver: bridge  # Specify the network driver

volumes:
  volume_name:
    driver: local  # Specify the volume driver

Key Components Explained:

services: Defines individual services (containers) that make up your application.
- service_name: A unique identifier for each service.
  - image: Specifies the Docker image to deploy.
  - build: Instructions for building a Docker image from a Dockerfile.
  - ports: Exposes container ports to the host machine.
  - environment: Sets environment variables within the container.
  - volumes: Mounts host directories or named volumes into the container.
  - networks: Connects the service to one or more networks.
  - depends_on: Specifies service dependencies to control startup order.
networks: (Optional) Defines custom networks for your services to communicate.
- network_name: The name of the network.
  - driver: The network driver to use (e.g., bridge, overlay).
volumes: (Optional) Defines named volumes for persistent data storage.
- volume_name: The name of the volume.
  - driver: The volume driver to use.

Practical Example

Below is an example of a Docker Compose file of the PostgreSQL Service:

services:
  postgres:
    image: postgres:14
    container_name: postgres
    volumes:
      - ./mnt/postgres:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: "admin"
      POSTGRES_USER: "admin"
      POSTGRES_PASSWORD: "admin"
    ports:
      - "5432:5432"

 networks:
  default:
    name: bes-network

Explanation of the Example:

Services
- Service Name: postgres
- image: the image that the container will use and deploy
- container_name: the container will be created with this name “postgres”
- volumes: the local directory “mnt/postgres” will be mounted and synced with the container directory “/var/lib/postgresql/data” to persist the data of the container in case we removed the container and started it again.
- environment: specifies the environment variables that will be passed to the container
- ports: the local port 5432 (on the left) will be mapped to the container port 5432 (on the right)
Networks
- defining a network called “bes-network” through which all the related containers on the same network will be able to communicate together.

Basic Docker Commands

Here are some fundamental Docker commands to help you interact with containers:

docker ps: List running containers Example: docker ps
docker-compose up: Create and start containers defined in docker-compose.yml Example: docker-compose up -d
docker start: Start a stopped container Example: docker start my_container
docker exec: Execute a command in a running container Example: docker exec -it my_container bash
docker logs: View the logs of a container Example: docker logs my_container
docker cp: Copy files/folders between a container and the local filesystem Example: docker cp my_container:/path/to/file.txt /local/path/
docker stop: Stop a running container Example: docker stop my_container
docker rm: Remove a container Example: docker rm my_container
docker-compose down: Stop and remove containers, networks, and volumes defined in docker-compose.yml Example: docker-compose down

These commands will help you manage your Docker containers effectively in the Big-data Ecosystem Sandbox.

Practical Applications

The BES opens up a world of possibilities for data engineering experiments and learning. Some potential use cases include:

Setting up a data lake using MinIO and processing it with Spark
Creating real-time data pipelines with Kafka and Flink
Orchestrating complex data workflows using Airflow
Performing distributed SQL queries across multiple data sources with Trino

Conclusion

The Big-data Ecosystem Sandbox provides a comprehensive environment for learning and experimenting with various big data tools. By leveraging Docker and custom integrations, it offers a flexible and powerful platform for data engineers to enhance their skills and explore new ideas.

In future posts, we'll dive deeper into specific use cases and advanced configurations to help you get the most out of your BES. Stay tuned, and happy data engineering!

DEV Community: Abdullah Haggag

The Journey From a CSV File to Apache Hive Table

Introduction

Contents

Introduction to Hadoop: HDFS and YARN

What is Hadoop?

Hadoop Distributed File System (HDFS)

YARN (Yet Another Resource Negotiator)

Introduction to Apache Hive

Key Features of Hive

Hands-On: Importing a CSV File to a Hive Table

Prerequisites

Step 1: Setup the Playground Environment on Docker

Step 2: Prepare the Sample CSV File

Step 3: Copy the CSV File into the Hive Server Container

Step 4: Create the Staging Schema & Table

Managed vs. External Tables

Creating the Staging Table

Verifying the HDFS Directory

Step 5: Import CSV Data into the Staging Table

Step 6: Create the Main Schema and Table

Step 7: Move Data from Staging to Main Table

Step 8: Validate Data in the Main Table

Conclusion

Building a Big Data Playground Sandbox for Learning

Introduction

Sandbox Components

Data Storage and Management

Data Processing and Analytics

Streaming and Real-time Processing

Data Orchestration and Management

Getting Started

Setup ALL the Sandbox Tools

Sandbox Architecture

Docker Compose Overview

Practical Example

Basic Docker Commands

Practical Applications

Conclusion