<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abdullah Haggag</title>
    <description>The latest articles on DEV Community by Abdullah Haggag (@abdullah_haggag).</description>
    <link>https://dev.to/abdullah_haggag</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2213434%2Fffe8e503-c373-4a50-80d9-7490d2c818ef.jpeg</url>
      <title>DEV Community: Abdullah Haggag</title>
      <link>https://dev.to/abdullah_haggag</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abdullah_haggag"/>
    <language>en</language>
    <item>
      <title>The Journey From a CSV File to Apache Hive Table</title>
      <dc:creator>Abdullah Haggag</dc:creator>
      <pubDate>Thu, 24 Oct 2024 03:45:55 +0000</pubDate>
      <link>https://dev.to/abdullah_haggag/the-journey-from-a-csv-file-to-apache-hive-table-45ab</link>
      <guid>https://dev.to/abdullah_haggag/the-journey-from-a-csv-file-to-apache-hive-table-45ab</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I am Abdullah, a Data Engineer passionate about building, understanding, and experimenting with data solutions.&lt;/p&gt;

&lt;p&gt;In my previous blog post, I introduced the Big-data Ecosystem Sandbox I’ve been building over the last two months. Today, we’ll take a deeper dive and get hands-on with the sandbox, demonstrating how to import a CSV file into a Hive table. Along the way, we will explore the various tools in the sandbox and how to work with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction to Hadoop &amp;amp; Hive&lt;/li&gt;
&lt;li&gt;Hands-On: Importing a CSV File into Hive Table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s begin with a brief introduction to the core components we will be using for this demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to Hadoop: HDFS and YARN
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Hadoop?
&lt;/h3&gt;

&lt;p&gt;Apache Hadoop is an open-source framework that enables the distributed storage and processing of large datasets across clusters of computers. It is designed to scale from a single server to thousands of machines, each providing local computation and storage capabilities. Hadoop’s architecture is built to handle massive amounts of data efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hadoop Distributed File System (HDFS)
&lt;/h3&gt;

&lt;p&gt;HDFS is the primary storage system used by Hadoop applications. It is designed to store large data files across a distributed system, breaking data into smaller blocks, replicating these blocks, and distributing them across multiple nodes in a cluster. This enables efficient and reliable computations.&lt;/p&gt;

&lt;p&gt;Key features of HDFS include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fault tolerance&lt;/strong&gt;: HDFS automatically replicates data to ensure fault tolerance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-efficiency&lt;/strong&gt;: It is designed to run on commodity hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High throughput&lt;/strong&gt;: Provides high throughput access to application data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Can handle large datasets efficiently, even in the petabyte range.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  YARN (Yet Another Resource Negotiator)
&lt;/h3&gt;

&lt;p&gt;YARN is Hadoop's resource management system. It is responsible for allocating system resources to applications and scheduling tasks across a cluster, enabling better resource utilization.&lt;/p&gt;

&lt;p&gt;Key benefits of YARN:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Improved cluster utilization&lt;/strong&gt;: Dynamically manages resource allocation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Supports a large number of nodes and applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy&lt;/strong&gt;: Allows multiple applications to share cluster resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatibility&lt;/strong&gt;: Works well with MapReduce and other Hadoop ecosystem projects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, HDFS and YARN form the core components of Hadoop, providing a robust platform for distributed data storage and processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to Apache Hive
&lt;/h2&gt;

&lt;p&gt;While HDFS stores large files, querying and analyzing these files efficiently requires a data warehouse system like Apache Hive. Hive provides an SQL-like interface (HiveQL) to query data stored in various file systems, including HDFS, providing users with an easier way to interact with large datasets.&lt;/p&gt;

&lt;p&gt;Basically, HDFS Stores the data files and Hive keeps the metadata that is saying “you can find the data for this table in this directory” along with keeping some statistics and metadata about these data files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Features of Hive
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQL-like queries&lt;/strong&gt;: Allows users to write queries in HiveQL, similar to SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Hive can handle massive datasets with ease.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatibility&lt;/strong&gt;: Works seamlessly with the Hadoop ecosystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support for various file formats&lt;/strong&gt;: Handles different data storage formats such as CSV, ORC, Parquet, and more.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hands-On: Importing a CSV File to a Hive Table
&lt;/h2&gt;

&lt;p&gt;This section provides a step-by-step guide to upload a CSV file into a Hive table.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Docker &amp;amp; Docker Compose Installed&lt;/li&gt;
&lt;li&gt;Basic knowledge of Linux Operating System &amp;amp; Docker&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 1: Setup the Playground Environment on Docker
&lt;/h3&gt;

&lt;p&gt;To simulate a Hadoop and Hive environment for this hands-on, we'll use a big-data sandbox that I created. You can find the setup details in the following GitHub repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/amhhaggag/bigdata-ecosystem-sandbox" rel="noopener noreferrer"&gt;Big Data Ecosystem Sandbox GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To start only the required services for this demo, follow the commands below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/amhhaggag/bigdata-ecosystem-sandbox.git
&lt;span class="nb"&gt;cd &lt;/span&gt;bigdata-ecosystem-sandbox

docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; hive-server

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the following components required for Hive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hadoop HDFS Namenode and Datanode&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;YARN Resource Manager and Node Manager&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL for Hive Metastore Database&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hive Metastore &amp;amp; Hive Server2&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verify that the services are running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ensure the containers are up and running, as shown in the example output provided.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Prepare the Sample CSV File
&lt;/h3&gt;

&lt;p&gt;In the repository's &lt;code&gt;sample-files&lt;/code&gt; directory, you will find a sample CSV file containing randomly generated data. Here's a glimpse of the first few records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;order_id,order_date,customer_id,product_name,product_category,product_price,items_count,total_amount
019-74-9339,2022-11-25,80129,Spinach,Vegetables,2.49,3,7.47
061-83-1476,2023-12-04,164200,Anker Soundcore Liberty Air 2 Pro,Electronics,129.99,1,129.99
...

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Copy the CSV File into the Hive Server Container
&lt;/h3&gt;

&lt;p&gt;Copy the CSV file into the Hive server container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;cp &lt;/span&gt;sample-files/orders_5k.csv hive-server:/opt/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command will transfer the &lt;code&gt;orders_5k.csv&lt;/code&gt; file into the Hive server’s &lt;code&gt;/opt/&lt;/code&gt; directory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Create the Staging Schema &amp;amp; Table
&lt;/h3&gt;

&lt;p&gt;Enter the hive-server container for the rest of the demo to create the tables and import the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; hive-server /bin/bash

&lt;span class="c"&gt;## Get into Beeline: The command line tool to interact with hive-server and write queries&lt;/span&gt;
beeline &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"jdbc:hive2://hive-server:10000"&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; hive &lt;span class="nt"&gt;-p&lt;/span&gt; hive

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before importing the data, we'll create an external table to temporarily store the CSV data.&lt;/p&gt;

&lt;h4&gt;
  
  
  Managed vs. External Tables
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;External Table&lt;/strong&gt;: Stores data outside Hive’s default location, typically in HDFS or other storage. Dropping the table only deletes metadata, not the actual data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed Table&lt;/strong&gt;: Stores data in Hive’s warehouse directory. Dropping the table removes both metadata and data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Creating the Staging Table
&lt;/h4&gt;

&lt;p&gt;We will create a schema and external table for staging the CSV data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;stg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;stg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_price&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;items_count&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;SERDE&lt;/span&gt; &lt;span class="s1"&gt;'org.apache.hadoop.hive.serde2.OpenCSVSerde'&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;SERDEPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;"separatorChar"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;","&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"quoteChar"&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"escapeChar"&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;TEXTFILE&lt;/span&gt;
&lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"skip.header.line.count"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a staging table in the &lt;code&gt;stg&lt;/code&gt; schema. The data will be stored in a folder in HDFS corresponding to the table name.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verifying the HDFS Directory
&lt;/h4&gt;

&lt;p&gt;We should have a &lt;code&gt;stg.db&lt;/code&gt; directory created in the &lt;code&gt;/user/hive/warehouse/&lt;/code&gt; which is the main hive warehouse directory.&lt;/p&gt;

&lt;p&gt;Also, we should have a new directory &lt;code&gt;orders&lt;/code&gt; representing the location of the external table files.&lt;/p&gt;

&lt;p&gt;You can check the HDFS directory for the table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hdfs dfs &lt;span class="nt"&gt;-ls&lt;/span&gt; /user/hive/warehouse/
hdfs dfs &lt;span class="nt"&gt;-ls&lt;/span&gt; /user/hive/warehouse/stg.db/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Import CSV Data into the Staging Table
&lt;/h3&gt;

&lt;p&gt;To load data into the table, copy the CSV file into the HDFS directory representing the &lt;code&gt;orders&lt;/code&gt; table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hdfs dfs &lt;span class="nt"&gt;-put&lt;/span&gt; /opt/orders_5k.csv /user/hive/warehouse/stg.db/orders/

&lt;span class="c"&gt;# Check that the file is copied correctly&lt;/span&gt;
hdfs dfs &lt;span class="nt"&gt;-ls&lt;/span&gt; /user/hive/warehouse/stg.db/orders/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, get back to beeline and validate that the data has been successfully loaded as a table and that hive is able to read it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;beeline&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="nv"&gt;"jdbc:hive2://hive-server:10000"&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;hive&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="n"&gt;hive&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;stg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query should return 5,000 rows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Create the Main Schema and Table
&lt;/h3&gt;

&lt;p&gt;We will now create a managed table in Hive to store the data as Parquet files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;retail&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;retail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_category&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;product_price&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;items_count&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Move Data from Staging to Main Table
&lt;/h3&gt;

&lt;p&gt;Next, move the data from the staging table to the main table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;retail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;items_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;stg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'order_id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 8: Validate Data in the Main Table
&lt;/h3&gt;

&lt;p&gt;You can now validate the data in the main table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;retail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;retail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this hands-on session, we explored how to leverage the Big-data Ecosystem Sandbox to import and manage data using Hadoop and Hive. By following the steps, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Set up a Hadoop environment with Hive for data management.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Created external and managed Hive tables to efficiently handle and store data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Imported a CSV file into Hive and transformed it into a more optimized format (Parquet).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explored how Hadoop’s HDFS and Hive work together for data storage and querying.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This practical demonstration shows how to manage large datasets using familiar SQL-like commands in Hive, all while benefiting from the scalability and robustness of Hadoop. The sandbox environment offers a powerful platform for learning and experimentation, giving you a solid foundation to build your own big-data solutions.&lt;/p&gt;

&lt;p&gt;Stay tuned for more advanced use cases and integrations with other tools in the Big-data Ecosystem Sandbox!&lt;/p&gt;

&lt;p&gt;If you have any questions please don't hesitate to ask them in the comments below!&lt;/p&gt;

</description>
      <category>hadoop</category>
      <category>hive</category>
      <category>bigdata</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Building a Big Data Playground Sandbox for Learning</title>
      <dc:creator>Abdullah Haggag</dc:creator>
      <pubDate>Thu, 17 Oct 2024 05:52:21 +0000</pubDate>
      <link>https://dev.to/abdullah_haggag/building-a-big-data-playground-sandbox-for-learning-cgi</link>
      <guid>https://dev.to/abdullah_haggag/building-a-big-data-playground-sandbox-for-learning-cgi</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;As a data engineer, I'm always seeking opportunities to experiment with different data solutions. Whether it's learning a new tool, practicing a solution, or testing ideas in a safe environment, the desire to innovate never ceases. To facilitate this, I've created a personal sandbox using Docker containers, featuring various big data tools. This setup, which I call the "Big-data Ecosystem Sandbox (BES)," leverages open-source big data tools orchestrated within Docker using custom-built images.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandbox Components
&lt;/h2&gt;

&lt;p&gt;The BES includes a comprehensive set of tools essential for big data processing and analysis:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7aq9bx4jnzu8bfja3d7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7aq9bx4jnzu8bfja3d7.png" alt="Sandbox Components" width="800" height="493"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Storage and Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL: An open-source relational database for structured data storage and complex queries.&lt;/li&gt;
&lt;li&gt;MinIO: A high-performance, distributed object storage system compatible with Amazon S3 API.&lt;/li&gt;
&lt;li&gt;Hadoop: An open-source framework for distributed storage and processing of large datasets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Processing and Analytics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hive: A data warehouse infrastructure built on Hadoop for querying and managing large datasets.&lt;/li&gt;
&lt;li&gt;Spark: A fast, distributed computing system for large-scale data processing.&lt;/li&gt;
&lt;li&gt;Trino: A distributed SQL query engine for querying data across various sources.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Streaming and Real-time Processing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Kafka: A distributed event streaming platform for building real-time data pipelines.&lt;/li&gt;
&lt;li&gt;Flink: A stream processing framework for real-time data processing and event-driven applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Orchestration and Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NiFi: An easy to use, powerful, and reliable system to process and distribute data.&lt;/li&gt;
&lt;li&gt;Airflow: A platform to programmatically author, schedule, and monitor workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;You can find the GitHub Repo through the following link:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/amhhaggag/bigdata-ecosystem-sandbox" rel="noopener noreferrer"&gt;https://github.com/amhhaggag/bigdata-ecosystem-sandbox&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup ALL the Sandbox Tools
&lt;/h3&gt;

&lt;p&gt;“Make sure that you have enough CPU and RAM”&lt;/p&gt;

&lt;p&gt;To Setup all the sandbox tools use the following script&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/amhhaggag/bigdata-ecosystem-sandbox.git
&lt;span class="nb"&gt;cd &lt;/span&gt;bigdata-ecosystem-sandbox

./bes-setup.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script will do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull the necessary Docker images from Docker Hub

&lt;ul&gt;
&lt;li&gt;amhhaggag/hadoop-base-3.1.1&lt;/li&gt;
&lt;li&gt;amhhaggag/hive-base-3.1.2&lt;/li&gt;
&lt;li&gt;amhhaggag/spark-3.5.1&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Prepare the PostgreSQL Database for Hive-Metastore Service&lt;/li&gt;
&lt;li&gt;Add the Trino Configurations to it’s specific mounted volume (Local Directory)&lt;/li&gt;
&lt;li&gt;Create &amp;amp; Start all the containers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, let’s explain what is included in this repository:&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandbox Architecture
&lt;/h2&gt;

&lt;p&gt;The BES uses a combination of official Docker images and custom-built images to ensure compatibility and integration between tools. The custom images include Apache Hadoop, Hive, Spark, Airflow, and Trino, built in a hierarchical manner to maintain dependencies and ensure smooth integration.&lt;/p&gt;

&lt;p&gt;Below is a diagram illustrating the dependencies between the custom built images.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsjrh07spoeiurt5bmgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsjrh07spoeiurt5bmgf.png" alt="Custom Images Diagram" width="800" height="853"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker Compose Overview
&lt;/h3&gt;

&lt;p&gt;To be able to use the sandbox efficiently you need to have at least basic knowledge of Docker and Docker Compose. Here is a quick overview on Docker Compose&lt;/p&gt;

&lt;p&gt;A Docker Compose file, typically named &lt;code&gt;docker-compose.yml&lt;/code&gt;, is a YAML file that defines, configures, and runs multi-container Docker applications. It allows you to manage all your application's services, networks, and volumes in a single place, streamlining deployment and scaling processes.&lt;/p&gt;

&lt;p&gt;Here's the general structure of a Docker Compose file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;service_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;image_name:tag&lt;/span&gt;  &lt;span class="c1"&gt;# Use an existing image&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./path&lt;/span&gt;  &lt;span class="c1"&gt;# Path to the build context&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Dockerfile&lt;/span&gt;  &lt;span class="c1"&gt;# Dockerfile to use for building the image&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host_port:container_port"&lt;/span&gt;  &lt;span class="c1"&gt;# Map host ports to container ports&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;VARIABLE=value&lt;/span&gt;  &lt;span class="c1"&gt;# Set environment variables&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;host_path:container_path&lt;/span&gt;  &lt;span class="c1"&gt;# Mount host paths or volumes&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;network_name&lt;/span&gt;  &lt;span class="c1"&gt;# Connect to specified networks&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;other_service&lt;/span&gt;  &lt;span class="c1"&gt;# Specify service dependencies&lt;/span&gt;

&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;network_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;  &lt;span class="c1"&gt;# Specify the network driver&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;volume_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local&lt;/span&gt;  &lt;span class="c1"&gt;# Specify the volume driver&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Components Explained:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;services&lt;/strong&gt;: Defines individual services (containers) that make up your application.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;service_name&lt;/strong&gt;: A unique identifier for each service.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;image&lt;/strong&gt;: Specifies the Docker image to deploy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;build&lt;/strong&gt;: Instructions for building a Docker image from a Dockerfile.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ports&lt;/strong&gt;: Exposes container ports to the host machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;environment&lt;/strong&gt;: Sets environment variables within the container.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;volumes&lt;/strong&gt;: Mounts host directories or named volumes into the container.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;networks&lt;/strong&gt;: Connects the service to one or more networks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;depends_on&lt;/strong&gt;: Specifies service dependencies to control startup order.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;networks&lt;/strong&gt;: (Optional) Defines custom networks for your services to communicate.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;network_name&lt;/strong&gt;: The name of the network.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;driver&lt;/strong&gt;: The network driver to use (e.g., bridge, overlay).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;volumes&lt;/strong&gt;: (Optional) Defines named volumes for persistent data storage.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;volume_name&lt;/strong&gt;: The name of the volume.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;driver&lt;/strong&gt;: The volume driver to use.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Practical Example
&lt;/h3&gt;

&lt;p&gt;Below is an example of a Docker Compose file of the PostgreSQL Service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:14&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./mnt/postgres:/var/lib/postgresql/data&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin"&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin"&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin"&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432:5432"&lt;/span&gt;

&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bes-network&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of the Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services

&lt;ul&gt;
&lt;li&gt;Service Name: postgres&lt;/li&gt;
&lt;li&gt;image: the image that the container will use and deploy&lt;/li&gt;
&lt;li&gt;container_name: the container will be created with this name “postgres”&lt;/li&gt;
&lt;li&gt;volumes: the local directory “mnt/postgres” will be mounted and synced with the container directory “/var/lib/postgresql/data” to persist the data of the container in case we removed the container and started it again.&lt;/li&gt;
&lt;li&gt;environment: specifies the environment variables that will be passed to the container&lt;/li&gt;
&lt;li&gt;ports: the local port 5432 (on the left) will be mapped to the container port 5432 (on the right)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Networks

&lt;ul&gt;
&lt;li&gt;defining a network called “bes-network” through which all the related containers on the same network will be able to communicate together.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Basic Docker Commands
&lt;/h2&gt;

&lt;p&gt;Here are some fundamental Docker commands to help you interact with containers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;docker ps&lt;/strong&gt;: List running containers
Example: &lt;code&gt;docker ps&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker-compose up&lt;/strong&gt;: Create and start containers defined in docker-compose.yml
Example: &lt;code&gt;docker-compose up -d&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker start&lt;/strong&gt;: Start a stopped container
Example: &lt;code&gt;docker start my_container&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker exec&lt;/strong&gt;: Execute a command in a running container
Example: &lt;code&gt;docker exec -it my_container bash&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker logs&lt;/strong&gt;: View the logs of a container
Example: &lt;code&gt;docker logs my_container&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker cp&lt;/strong&gt;: Copy files/folders between a container and the local filesystem
Example: &lt;code&gt;docker cp my_container:/path/to/file.txt /local/path/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker stop&lt;/strong&gt;: Stop a running container
Example: &lt;code&gt;docker stop my_container&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker rm&lt;/strong&gt;: Remove a container
Example: &lt;code&gt;docker rm my_container&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;docker-compose down&lt;/strong&gt;: Stop and remove containers, networks, and volumes defined in docker-compose.yml
Example: &lt;code&gt;docker-compose down&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These commands will help you manage your Docker containers effectively in the Big-data Ecosystem Sandbox.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Applications
&lt;/h2&gt;

&lt;p&gt;The BES opens up a world of possibilities for data engineering experiments and learning. Some potential use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up a data lake using MinIO and processing it with Spark&lt;/li&gt;
&lt;li&gt;Creating real-time data pipelines with Kafka and Flink&lt;/li&gt;
&lt;li&gt;Orchestrating complex data workflows using Airflow&lt;/li&gt;
&lt;li&gt;Performing distributed SQL queries across multiple data sources with Trino&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The Big-data Ecosystem Sandbox provides a comprehensive environment for learning and experimenting with various big data tools. By leveraging Docker and custom integrations, it offers a flexible and powerful platform for data engineers to enhance their skills and explore new ideas. &lt;/p&gt;

&lt;p&gt;In future posts, we'll dive deeper into specific use cases and advanced configurations to help you get the most out of your BES. Stay tuned, and happy data engineering!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>bigdata</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
