<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gilbert korir</title>
    <description>The latest articles on DEV Community by Gilbert korir (@gilbert_korir).</description>
    <link>https://dev.to/gilbert_korir</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3509560%2Fbbb0aead-88b6-4571-b595-20774007a6e9.png</url>
      <title>DEV Community: Gilbert korir</title>
      <link>https://dev.to/gilbert_korir</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gilbert_korir"/>
    <language>en</language>
    <item>
      <title>Python For Data Engineering</title>
      <dc:creator>Gilbert korir</dc:creator>
      <pubDate>Fri, 10 Oct 2025 09:02:59 +0000</pubDate>
      <link>https://dev.to/gilbert_korir/python-for-data-engineering-6e4</link>
      <guid>https://dev.to/gilbert_korir/python-for-data-engineering-6e4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzzihngqkspew018c0hc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzzihngqkspew018c0hc.jpg" alt=" " width="299" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Data engineers are responsible for managing, processing, and transforming raw data into valuable information that businesses can use to make decisions. &lt;br&gt;
Python allows data engineers to write clear and maintainable code, which is crucial for the complex processes involved in ETL. Python’s strong community support and rich ecosystem of libraries also provide powerful tools to simplify data extraction, transformation, and loading tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Below is how Python concepts and libraries are essential to data engineering:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Data Processing:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Python is commonly used for data manipulation, cleaning, and transformation tasks, especially when dealing with large datasets. Libraries like Pandas and NumPy are popular choices here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
def extract_data(file_path):
    # Read the CSV file into a DataFrame
    data = pd.read_csv(file_path)
    return data

# Usage
data = extract_data('data/source_data.csv')
print(data.head())  # Print the first few rows to check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Scripting and Automation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scripting involves writing small programs, or "scripts," using a scripting language (e.g., Python, Bash, PowerShell). These scripts provide instructions to a computer to perform specific actions. &lt;/p&gt;

&lt;p&gt;Python is great for writing scripts to automate data workflows, such as ETL (Extract, Transform, Load) processes or data pipeline orchestration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;etl_pipeline/
│
├── etl_pipeline.py   # Main script where we'll write our ETL code
└── data/             # Folder to store your data files (e.g., CSVs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Integration with Big Data Tools:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This involves combining data from diverse sources into a unified view for analysis and decision-making, requiring tools with extensive connectors and platforms that handle high-volume, high-velocity data streams.&lt;br&gt;
Many Big Data frameworks like Apache Spark have Python APIs (PySpark), making Python useful for working with large-scale data processing.&lt;/p&gt;

&lt;h5&gt;
  
  
  Common Integration Methods and Tools
&lt;/h5&gt;

&lt;p&gt;&lt;strong&gt;- API-Based Integration:&lt;/strong&gt; Use APIs to connect data, applications, and other services across different locations and devices, providing flexible and agile connections. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- ETL/ELT Services:&lt;/strong&gt; Leverage Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) tools and services, such as AWS Glue or Airbyte, to extract data from sources, transform it, and load it into a unified data ecosystem. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Integration Platforms as a Service (iPaaS):&lt;/strong&gt; Platforms like SnapLogic allow for faster, more agile connections, reducing the need for frequent integration adjustments. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Data Visualization Tools:&lt;/strong&gt; Tools like Tableau or KNIME offer connectors to various data sources and provide user-friendly interfaces for exploring and visualizing integrated data. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Machine Learning and Data Analysis:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Python is the language of choice for many data scientists and analysts for tasks like statistical analysis, machine learning model development, and exploratory data analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Data APIs and Web Services:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;APIs (Application Programming Interfaces) are a broad concept, representing any set of definitions and protocols for building and integrating application software. They define the methods, data formats, and rules that software components use to communicate.&lt;br&gt;
Python is often used to interact with APIs, web scraping, and integrating data from various sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;While the level of Python proficiency required can vary depending on your specific responsibilities and the tools your organization uses (like Azure services), having a good understanding of Python basics and familiarity with libraries relevant to data engineering tasks is typically expected. &lt;/p&gt;

&lt;p&gt;Python is a superb option for your ETL pipeline. Its readability, extensive library support, and flexibility make it the best language for ETL pipelines. Python also provides the tools and frameworks necessary to build efficient and scalable ETL pipelines.&lt;/p&gt;

&lt;p&gt;If you’re already comfortable with Python, continuing to build your skills in areas like data manipulation, scripting, and possibly Big Data frameworks would be beneficial. By gaining proficiency in these areas, you’ll be well-equipped to handle the various tasks and challenges that come with being a data engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning Journey&lt;/strong&gt;&lt;br&gt;
For your journey in Data engineering, explore the platforms below:&lt;br&gt;
&lt;strong&gt;Coursera, edX, and Udemy offer courses on Python for data engineering.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Happy learning &amp;amp; coding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;About me?&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/gilbertKorir" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>etl</category>
    </item>
    <item>
      <title>Database Fundamentals</title>
      <dc:creator>Gilbert korir</dc:creator>
      <pubDate>Sat, 04 Oct 2025 16:18:09 +0000</pubDate>
      <link>https://dev.to/gilbert_korir/database-fundamentals-4m0j</link>
      <guid>https://dev.to/gilbert_korir/database-fundamentals-4m0j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction to database
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is a database?
&lt;/h3&gt;

&lt;p&gt;A database is a tool for collecting and organizing information. Databases can store information about people, products, orders, or anything else.&lt;/p&gt;

&lt;p&gt;A computerized database is a container of objects. One database can contain more than one table. For example, an inventory tracking system that uses three tables is not three databases, but one database that contains three tables. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of databases&lt;/strong&gt;&lt;br&gt;
Databases can be classified into two primary types Relational (SQL) and NoSQL Databases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9p9cwql9e180dzjjvdi.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9p9cwql9e180dzjjvdi.webp" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;NoSQL is then further divided into four types: Document-oriented, Key-Value, Wide-Column, and Graph databases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9j0o83x5fevzx07z756j.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9j0o83x5fevzx07z756j.webp" alt=" " width="400" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Relational Databases (RDBMS)&lt;/strong&gt;&lt;br&gt;
Relational databases organize data into tables made up of rows (records) and columns (fields). They use schemas (blueprints) to define how data is structured and how different tables relate to each other.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strict schema-based structure.&lt;/li&gt;
&lt;li&gt;Primary Keys (unique IDs) and Foreign Keys (relationships between tables).&lt;/li&gt;
&lt;li&gt;Strong ACID compliance (Atomicity, Consistency, Isolation, Durability).&lt;/li&gt;
&lt;li&gt;Ideal for structured data.&lt;/li&gt;
&lt;li&gt;Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8qxbr1cnlijbi9vdt9n.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8qxbr1cnlijbi9vdt9n.jpg" alt=" " width="400" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. NoSQL Databases&lt;/strong&gt;&lt;br&gt;
"NoSQL" stands for "Not Only SQL". These databases are designed to handle unstructured or semi-structured data, such as text, images, videos or sensor data. They don’t rely on the traditional table format.&lt;br&gt;
Key examples include MongoDB, Cassandra, and DynamoDB.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flexible data models (no fixed schema).&lt;/li&gt;
&lt;li&gt;Scales horizontally for high-volume data.&lt;/li&gt;
&lt;li&gt;Often optimized for specific use cases like graphs or time-series data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sub-Types of NoSQL Databases are:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ww9t7k7ff8e7egcq3w6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ww9t7k7ff8e7egcq3w6.jpg" alt=" " width="245" height="206"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Document Databases – Store data as JSON-like documents. (Example: MongoDB)&lt;br&gt;
Key-Value Stores – Store simple key–value pairs for fast lookups. (Example: Redis)&lt;br&gt;
Columnar Databases – Store data by columns for analytics. (Example: Apache Cassandra)&lt;br&gt;
Graph Databases – Store nodes &amp;amp; relationships for connected data. (Example: Neo4j)&lt;/p&gt;
&lt;h2&gt;
  
  
  Database Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Uses of RDBMS.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xlnvmmrt89yo8sgoi0q.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xlnvmmrt89yo8sgoi0q.webp" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDBMS is used in &lt;a href="https://www.geeksforgeeks.org/software-engineering/customer-relationship-management-crm/" rel="noopener noreferrer"&gt;Customer Relationship Management&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;It is used in &lt;a href="https://www.geeksforgeeks.org/power-bi/what-is-business-intelligence/" rel="noopener noreferrer"&gt;Business Intelligence&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;It is used in &lt;a href="https://www.oracle.com/africa/database/what-is-a-data-warehouse/" rel="noopener noreferrer"&gt;Data Warehousing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;It is used in Online Retail Platforms.&lt;/li&gt;
&lt;li&gt;It is used in Hospital Management Systems.&lt;/li&gt;
&lt;li&gt;Banking and Finance: Handles financial transactions, account balances, and credit card processing. &lt;/li&gt;
&lt;li&gt;Healthcare: Manages patient records, medical information, lab results, and other electronic health data. &lt;/li&gt;
&lt;li&gt;Education: Stores student information, academic records, and course details. &lt;/li&gt;
&lt;li&gt;Airlines: Manages flight schedules, passenger data, and ticket information.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Use of NoSQL&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Big Data Applications:&lt;/strong&gt; Efficiently stores and processes massive amounts of unstructured and semi-structured data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Analytics:&lt;/strong&gt; Supports fast queries and analysis for use cases like recommendation engines or fraud detection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable Web Applications:&lt;/strong&gt; Handles high traffic and large user bases by scaling horizontally across servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Data Storage:&lt;/strong&gt; Manages diverse data formats (JSON, key-value, documents, graphs) without rigid schemas.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Database schemas
&lt;/h2&gt;

&lt;p&gt;A database schema provides a comprehensive blueprint for the organization of data, detailing how tables, fields, and relationships are structured. Read to learn about the schema types, such as star, snowflake, and relational schemas.&lt;/p&gt;

&lt;p&gt;example&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8f7jwg4m2r4rozu0chea.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8f7jwg4m2r4rozu0chea.jpg" alt=" " width="358" height="141"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key components are and how they contribute to the overall database schema: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Table&lt;/strong&gt; is a collection of related data organized in rows and columns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Field&lt;/strong&gt; is a column that contains information within a table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data type&lt;/strong&gt; specifies the kind of data a field can contain (e.g., integer, varchar, date).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  DDL and DML
&lt;/h2&gt;

&lt;p&gt;DDL stands for Data Definition Language and refers to SQL commands used to create, modify, and delete database structures such as tables, indexes, and views. DML stands for Data Manipulation Language and refers to SQL commands used to insert, update, and delete data within a database. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftm5p2m0ylre22h58o1da.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftm5p2m0ylre22h58o1da.png" alt=" " width="300" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;DDL Commands in SQL with Examples&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE Employees (
EmployeeID INT,
FirstName VARCHAR(255),
LastName VARCHAR(255),
Department VARCHAR(255)
);
ALTER TABLE Employees
ADD Salary INT;
DROP TABLE Employees;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;&lt;strong&gt;DML Commands in SQL with Examples&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO Employees (EmployeeID, FirstName, LastName, Department)
VALUES (1, 'John', 'Smith', 'IT');
UPDATE Employees
SET Salary = 50000
WHERE EmployeeID = 1;
SELECT * FROM Employees;
DELETE FROM Employees
WHERE EmployeeID = 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>postgressql</category>
      <category>sql</category>
      <category>nosql</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
