DEV Community: Joan

Made easy: Installing dbt and Building Your First Model 'Haay!'

Joan — Mon, 12 May 2025 19:55:40 +0000

Prequisite: Python and SQL knowledge.
Install python and dbt extension on VS code.
Steps:
open terminal;
cd <your_dir>

--create python virtual environment
python -m venv dbt_venv

--activate the env on cmd/powershell
.\dbt_venv\Scripts\activate

--to deactivate the venv
deactivate

Installing dbt; in this case I am using dbt-postgresadapter (otherwise free to use other integrations Install with pip),
together with dbt core which is an open-source tool that enables data practitioners to transform data and is suitable for users who prefer to manually set up dbt and locally maintain it.

python -m pip install dbt-core dbt-postgres

add .dbt in the users home directory, user dbt will create and maintain profiles .yml which is the dbt configuration file(db and user creadentials are stored)
mkdir $home\.dbt

initialize dbt project
dbt init and then follow to the command prompts that will appear.

Navigate to the project folder that was created:
cd dbt_project
Verify the connection to your data platform and dbt using:
dbt debug command.

Create a dbt model; an sql query that is designed to perform a certain transformation task on the data platform.
It's important to note dbt makes use of CTEs for improved readability and modularity.

create a .sql file, and making use of CTEs and save.
To run the model use:
dbt run
which creates a view in your data platform with the same name as the model.

Its also important to note that at the end of the .yml file,
the default materialization for dbt models is a view and can be updated to a table either at the .yml file or at the model

Updating materialization on the model;
{{ config(materialized = 'table')}}.

Recommendations for Normalization between OLAP and OLTP systems

Joan — Tue, 10 Sep 2024 08:53:23 +0000

OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) systems differ due to their distinct purposes and usage patterns. Here’s a breakdown:

1. Normalization in OLTP Systems:

OLTP systems focus on daily transactional data operations like inserting, updating, and deleting data quickly. Normalization in OLTP databases is critical to ensure data integrity, eliminate redundancy, and improve data efficiency.

Recommendations for OLTP:

High Normalization (3NF and above): OLTP databases should follow a highly normalized structure, often up to the Third Normal Form (3NF) or beyond. This helps reduce data redundancy, ensuring that each piece of information is stored only once. It makes updates efficient and maintains consistency across the system.
- 1NF (First Normal Form): Ensure that the table has no repeating groups, and each field contains atomic values.
- 2NF (Second Normal Form): All non-key attributes must depend on the primary key, eliminating partial dependency.
- 3NF (Third Normal Form): Eliminate transitive dependencies, where non-key attributes depend on other non-key attributes.

The goal is to make the system efficient for fast transactional operations like insertions and updates while maintaining data consistency.

2. Normalization in OLAP Systems:

OLAP systems are designed for complex queries and reporting, where data is analyzed and aggregated over time. The focus is on read-heavy operations like running complex queries for reports and trends, rather than real-time updates or inserts.

Recommendations for OLAP:

Denormalization: Unlike OLTP, OLAP systems often use denormalized structures. This means merging related tables and duplicating some data for faster querying and easier aggregation. In OLAP, data redundancy is acceptable because the focus is on optimizing read performance, not minimizing storage or maintaining quick updates.
- Star Schema: This is a common design where a central fact table is surrounded by dimension tables. Each dimension is denormalized to allow quicker joins and easier reporting.
- Snowflake Schema: A variation of the star schema, but more normalized. Dimension tables are further divided into additional related tables. This increases the complexity but reduces redundancy, offering a middle ground.

Denormalization helps OLAP systems avoid the need for multiple joins in complex queries, making analysis faster, especially with large datasets.

Key Differences:

Feature	OLTP Normalization	OLAP Denormalization
Purpose	Fast, frequent transactional operations	Complex queries, reporting, and analysis
Normalization Level	High (up to 3NF or higher)	Low (denormalized, star or snowflake schema)
Data Redundancy	Minimized	Acceptable to improve query performance
Query Complexity	Simple queries involving small datasets	Complex queries involving large datasets
Update Frequency	Frequent updates and inserts	Infrequent bulk loading and queries
Join Operations	Efficient joins due to normalized structure	Avoids multiple joins by denormalizing data

Why These Differences Matter:

OLTP: Normalization is key to ensure consistency and avoid data anomalies, especially when handling frequent updates. It also minimizes storage by eliminating redundant data.
OLAP: Denormalization is used to optimize read-heavy queries where performance is prioritized. Since updates are less frequent, maintaining multiple copies of data is not a major concern.

In summary:

OLTP systems use highly normalized structures for efficient transaction processing and data integrity.
OLAP systems use denormalized structures to optimize for complex queries and reporting performance.

Understanding data engineering with Datacamp

Joan — Wed, 09 Aug 2023 13:07:37 +0000

Data Processing: converting raw data into meaningful information.

Data processing Value:

Remove unwanted data
Optimize memory. process and network costs
Convert data from one type to another
Organize data
To fit into a schema/structure
Increase productivity

How data engineers process data:

Data manipulation, cleaning and tidying tasks e.g. dealing with missing values
Store data in a sanely structured database
Create views on top of the database tables for easy access of the database
Normalize the data
Optimize the performance of the databases e.g. indexing the data for easier retrieve.

Tools used in data processing

Data Processing:

can apply to any task listed in data processing.
Scheduling holds each piece and organize how they work together.
Runs tasks in a specific order and resolves all dependencies correctly.

Scheduling data:
Manually: manual update of the employee data
Automatically :Run at a specific time say update employee table daily at 6AM.
Automatically run if a specified condition is met known as sensor Scheduling

Data Ingestion:
Batches & Streams
Batch processing: Group records at intervals, often cheaper
Steaming: sends individual records right away into the database, new signing in.

Tools used in scheduling

Parallel computing/processing
It's the basis of modern data processing tools, necessary because of memory and processing power.
How it works:
Split tasks up into several smaller subtasks
Distribute these subtasks over several computing

Benefits and risks of parallel computing
pros

Extra processing power
reduced memory footprint cons
moving data incurs a cost
communication time

Cloud Computing vs On premises computing

servers on premises:

Incur cost for equipment's
need space
electrical and maintenance cost
enough power for peak moments
processing power unused at quieter times

Server on the cloud:

Pay as you go
No need for space
use resources we need an d when we need them
closed to the user the better latency

Cloud Computing for Data storage

pros
Database reliability: data replication

Introduction to Data Engineering in Microsoft Fabric

Joan — Wed, 02 Aug 2023 06:38:11 +0000

Data engineering in Microsoft Fabric enables users to design, build, and maintain infrastructures and systems that enable their organizations to collect, store, process, and analyze large volumes of data.

Fabric data engineering: enables you to;

Create and manage your data using a lakehouse
Design data pipelines to copy data into your lakehouse
use spark job definitions to submit batch/streaming job to spark cluster
use notebooks to write code for ELT processes

What is a Lakehouse:
Data architectures that enables organizations to store and manage structures data in a single location, using tools and frameworks to process and analyze that data e.g. SQL queries on
the SQL endpoint.

What is an Apache Spark job definition:
These are sets of instructions that define how to execute a job on a spark cluster.
For instance: input/output data source, the transformation and the configuration settings for the spark application.
Spark job definition allows data engineers to submit batch/streaming job to spark cluster, perform transformations on the data hosted in the lakehouse etc.

What is a notebook:
An interactive compute environment that allows users to create and share documents containing live code, equations visualizations, and narrative text.
Users can write code in Python, R, and Scala to perform data ingestion, preparation, analysis, and other data-related tasks.

What is a data pipeline:
Series of steps that are used to collect, process, and transform raw data to a format that can be used for analysis and decision-making.
data pipelines are crucial in that they help move data from its source to its destination in a reliable, scalable and efficient way.

Reference, Data Engineering in Microsoft Fabric.

SQL Server Recovery Model

Joan — Fri, 07 Oct 2022 12:52:51 +0000

Introduction to SQL Server Recovery Model

Recovery Model: Is a database control property that controls:

How transactions are logged
Whether the transaction log requires/ allows backing up.
What kinds of restore operations are available(Simple, Full, Bulk-logged recovery model)

Create a sample DB HR, in it create Table People and insert some values:

-- Create Database HR
CREATE DATABASE HR;

GO
-- swith the current databse to HR
USE HR;

-- Create Table Poeple in DB HR
CREATE TABLE People(
Id INT IDENTITY PRIMARY KEY,
FristName VARCHAR(50) NOT NULL,
LastName VARCHAR(50) NOT NULL,
);

--Insert some values into Poeple Table
INSERT INTO People (FristName,LastName)
    Values('John', 'Doe'),
            ('Joan', 'Njeri'),
            ('Jane', 'M'),
            ('Kyle', 'G')
GO 
-- Query all items from Table People
SELECT * FROM People;

To view the recovery model of a database use:

USE master;

GO 
/** To view Recovery model for HR DB **/

SELECT name, recovery_model_desc

FROM master.sys.databases

ORDER BY name;

Output

NOTE: It is possible to change the recovery model using;

ALTER DATABASE database_name 
SET RECOVERY recovery_model;

In this case, let's try changing the recovery model from FULL to SIMPLE

GO 
-- Change Recovery model for HR Database from FULL to SIMPLE

ALTER DATABASE HR
SET RECOVERY SIMPLE;

Output

Differences in Recovery Models

1. SIMPLE Recovery Model
SQL Server deletes transaction logs from the transaction log files at every check point. Also, this model do not store transaction records therefore making it impossible to use advanced backup strategies to minimize data loss.
Thus, use this model only if the database can be reloaded from other sources e.g. database used for reporting.

2. FULL Recovery Model
Unlike Simple recovery model, in FULL Recovery Model, SQL Server keeps the transaction log files until the BACKUP LOG statement is executed, deleting the transaction logs from the transaction log files.
Meaning, if BACKUP LOG statement is not run regularly SQL Server keeps all the transaction log files until the transaction log files are full and the database becomes inaccessible.
FULL Recovery model allows you to restore the database at any point in time.

Key Point: Schedule BACKUP LOG statement to run at regular intervals in cases of FULL Recovery Model.

2. BULK_LOGGED Recovery Model
It has almost similar behaviors to those of FULL but used in bulk-logged operations such as BULK INSERT of flat files into a database allowing recording of the operations in the transaction log files. Also, it does not allow you to perform restore of the database at any point in time.

Bulk_logged recovery model scenario:

For a periodical bulk data load that uses FULL Recovery model, SET Recovery model to BULK_LOGGED
Load the data into the DB
After data load completes, SET back the recovery model to FULL
Back up the database. For more, visit Recovery Models (SQL Server)

Introduction to Data Structures and Algorithms

Joan — Mon, 20 Jun 2022 19:20:34 +0000

Data Structures
Denotes a certain way of organizing, storing and managing data flow to increase efficiency (with respect to time and memory) of a program in a computer.
Algorithms
A set of instructions to be executed in a certain way
to get the desired output.

Classification of Data Structures

Primitive Data Structures: these are numbers and characters built in a program meaning they can be manipulated by machine level instructions. Ex. integers, characters, Booleans..
Non-Primitive Data Structures: they are derived from primitive data structures thus can not be manipulated by machine level instructions. They form a set of data elements either in homogenous(same data types) or heterogenous(different data types).

Next,
Non-Primitive Data Structures are further divided into:

Linear data structures

Elements in a linear data structure maintain a linear relationship among them and although data is arranged in a linear form, arrangement in memory may not be a sequential.

Ex. Arrays,

Non-Linear data structures

This kind of data structure data elements form a hierarchical relationship among them.

Ex. Trees and graphs

        Classification of Data Structure

Data Structures can be of two types:

Static Data Structures:

The size of this type of structure is fixed meaning data elements can be modified without changing the memory space allocated to it.

e.g. Arrays

Dynamic Data Structures:

This data structure allows changing the size of the memory allocated and contents of the structure can be modified during the operations performed to it or at runtime. e.g. Linked Lists

Comparison between Static vs Dynamic Data Structures

Static Data Structures	Dynamic Data Structures
Fixed memory size	size can be randomly updated during run time
Memory allocation done prior to program execution	Memory allocation done during program execution
Overflow is not possible to occur since memory allocation is fixed	Has possibilities Overflow or underflow since memory allocation is dynamic