DEV Community: Paige Tran

Solving the Top 5 DBT Problems with Edge Cases: Streamlining Data Build Processes

Paige Tran — Sat, 27 May 2023 19:49:54 +0000

Introduction:
DBT (Data Build Tool) is a popular SQL modeling tool that enables data engineers and analysts to build, test, and deploy reliable data pipelines. While DBT is a powerful tool, it can encounter certain challenges during its implementation and usage. In this article, we will explore the top 5 problems faced in DBT and discuss how leveraging edge cases can provide effective solutions.

1. Complex Data Transformations:
One of the common challenges in DBT is handling complex data transformations. Edge cases can be employed by creating specialized transformations for unique scenarios. By identifying and incorporating edge cases into the data build process, developers can handle complex data transformations more efficiently. These edge cases serve as specific examples that test the robustness and scalability of the transformations, ensuring accurate results.

-- Example of a complex data transformation using edge cases
-- Edge case: Handling null values in the transformation
WITH edge_case AS (
  SELECT 
    column1,
    CASE
      WHEN column2 IS NULL THEN 'N/A'
      ELSE column2
    END AS transformed_column
  FROM source_table
)
SELECT *
FROM edge_case;

2. Handling Large Volumes of Data:
Scaling DBT to handle large volumes of data can be a daunting task. Edge cases can be used to simulate and validate the performance of the data build process under extreme data conditions. By running DBT on subsets of the entire dataset or synthetic datasets representing edge cases, developers can optimize and fine-tune the SQL queries, models, and configurations to handle large-scale data processing effectively.

-- Example of optimizing a DBT model for large data volumes using edge cases
-- Edge case: Simulating large dataset for testing and performance tuning
WITH edge_case AS (
  SELECT *
  FROM source_table
  WHERE created_date >= '2023-01-01' -- Edge case: Focus on a specific date range
)
SELECT *
FROM edge_case;

3. Dependency Management:
Managing dependencies between DBT models is crucial for maintaining a well-structured and efficient data pipeline.

-- Example of resolving complex dependencies between DBT models using edge cases
-- Edge case: Addressing circular dependencies
-- Model A
SELECT *
FROM model_b

-- Model B
SELECT *
FROM model_a
JOIN model_c ON model_a.id = model_c.id;

-- Model C
SELECT *
FROM source_table;

Edge cases can be leveraged to identify and resolve complex dependencies. By introducing edge cases that cover scenarios with intricate relationships between models, developers can identify potential issues, such as circular dependencies or performance bottlenecks, and implement necessary optimizations to ensure smooth dependency management.

4. Testing and Validation:
DBT offers built-in testing capabilities, but ensuring comprehensive testing and validation can still be a challenge.

-- Example of testing and validating DBT models using edge cases
-- Edge case: Testing edge scenarios with extreme values
-- Model A
SELECT *
FROM source_table
WHERE column1 >= 0 -- Edge case: Testing positive values

-- Model B
SELECT *
FROM model_a
WHERE column2 IS NULL; -- Edge case: Testing null values

Edge cases can be utilized to design test cases that cover a wide range of scenarios, including edge scenarios that test the limits of the data build process. By incorporating edge cases into the testing strategy, developers can identify and fix issues early on, ensuring the accuracy and reliability of the data pipelines.

*5. Collaboration and Version *

Control:
Collaboration and version control are crucial aspects of maintaining a well-maintained and scalable DBT project. Edge cases can help address collaboration challenges by creating branches for different edge case scenarios. These branches can be used to test and validate alternative approaches, allowing for experimentation without affecting the main production branch. By utilizing version control systems effectively and incorporating edge cases, developers can streamline collaboration and maintain a reliable history of changes.

-- Example of using version control and collaboration with DBT using edge cases
-- Edge case: Creating a separate branch for experimenting with alternative approaches
-- Main branch
SELECT *
FROM source_table
WHERE column1 = 'A';

-- Experiment branch (edge case)
SELECT *
FROM source_table
WHERE column1 = 'B';

These coding examples illustrate how edge cases can be incorporated into DBT projects to address the top 5 problems. By considering specific scenarios, such as handling null values, simulating large datasets, resolving complex dependencies, testing extreme values, and utilizing separate branches for experimentation, developers can enhance the effectiveness and reliability of their DBT implementations.

Conclusion:
DBT is a powerful data modeling tool that streamlines data build processes, but it can encounter various challenges. Leveraging edge cases can provide effective solutions to these problems. By incorporating specialized transformations, simulating large data volumes, managing complex dependencies, implementing comprehensive testing, and optimizing collaboration and version control, developers can enhance the reliability, scalability, and efficiency of DBT projects. With the use of edge cases, data engineers and analysts can build robust data pipelines that meet the evolving needs of data-driven organizations.

Optimizing DBT Modeling Code: A Guide to DBT Setup and Hand-on commands

Paige Tran — Fri, 17 Feb 2023 13:17:33 +0000

1. What is DBT?

DBT (Data Build Tool) is an open-source tool used for data modeling, data transformation, and data management. It is designed to help data teams build, maintain, and document their data models and transformations in an organized and efficient manner.

DBT helps automate data processing pipelines and provides a framework for modeling data in a declarative way. It supports a variety of databases, including Amazon Redshift, Google BigQuery, Snowflake, and more, making it a versatile tool for managing data in various environments.

2. What code editor is better for DBT?

There are several code editors that can be used for DBT, however, the most popular ones are Sublime Text and Visual Studio Code (VS Code). Here is a comparison between the two:

I personally use VS code and I find it very useful. All in all, a combination of DBT and VS Code make a suitable environment for developing and maintaining high-quality data models with powerful debugging and testing capabilities,.

3. Setting up DBT with a data warehouse

Step 1: Install DBT
DBT is a command-line tool that can be installed using pip, a Python package manager. You can install DBT by running the following command in your terminal:

pip install dbt

Step 2: Set up your data warehouse
You will need to set up a data warehouse, such as Amazon Redshift, Google BigQuery, or Snowflake, to store your data. If you do not have a data warehouse set up yet, you will need to create one and upload your data onto it.

Step 3: Configure your data warehouse connection
In order to connect DBT to your data warehouse, you will need to create a profile in your dbt_project.yml file. This profile should include the connection details for your data warehouse, such as the host, database name, username, and password.

Step 4: Create a DBT project
You can create a new DBT project by running the following command in your terminal:
dbt init
This will create a new directory with a basic structure for a DBT project, including a dbt_project.yml file, which you can use to configure your project.

Step 5: Define your models
DBT models are defined using SQL and you can use them to perform data transformations and modeling tasks. You can define your models in .sql files in the model’s directory of your DBT project.

Step 6: Run and test DBT
Once you have defined your models, you can run DBT by running the following command in your terminal:
dbt run
This will compile your models and run any tests you have defined. If there are no errors, your models will be deployed to your data warehouse.

Over time, you may need to make changes to your models or add new models. You can use DBT to manage these changes, track the state of your data models, and ensure that your data remains accurate and up-to-date.

4. Some hand-on commands you should know for building your model

- Create a new branch
git checkout -b [new_branch_name]
In some cases, if your company is using Jira or any product management tool, you can set up a link to the tickets with your PR by including ticket numbers in the names of your new branches. This setup would enable tickets to immediately link with the PR and update other stakeholders on the progress of your changes.
For example: I have ticket spend-123 to solve the bug of transaction data. I would create a new branch by command:
git checkout -b spend-123_transaction_debug

- Switch and update the branch

You can switch your branch to master in order to update it along with the latest changes in your pipeline
git checkout master

git pull

To switch back to your branch run
git checkout [branch_name]

- Running your model
Run your built model
dbt run -m [model_name]
OR target any dataset.
For example, to run a model in preprod only, execute the following command:
dbt run -m [model_name] -t preprod

- To test your built model, execute
dbt test --models snapshot_core_transactions

*- Reset your branch with this command *
git reset --hard

*- SQL fluff fix *
SQL fluff refers to redundant or unnecessary code in SQL statements. It can come in various forms, such as unnecessary parentheses, redundant clauses, or excessive spaces. The presence of SQL fluff can make the code more difficult to read, maintain, and optimize, leading to slower performance and increased complexity.
When mondeling data with DBT, it is important to write clean, efficient, and well-structured SQL code to make sure that your data models perform well and are easy to maintain. By reducing or removing SQL fluff you can simplify your code and make it easier to understand and maintain.
In some cases, we can use this command to do a quick fix by implementing a specific rule specified in your run:
sqlfluff fix models/[folder_name]/model_name.sql –rules [rule_name]

*- Commit and push the request *
git commit -m ‘message of your commit’
git push
Copy your link URL after DBT creates your pull request and paste it. This would let your link to be submitted to your PR and fill more detailed information and requirements of your PR.

In short, these are some basic tips to set up and run your model with DBT tool.

Start With Product Management — Top Key Metrics You Should Know

Paige Tran — Mon, 03 Oct 2022 12:42:56 +0000

I know many people switch from different backgrounds to Product Management. It is no doubt that Product management has its charm and that you can learn a lot from engineering, design, and branding to analytics. However, it is indeed still very ambiguous how to start.

When I started to switch to Product, I took the first step — I believe it would be a fundamental step for anyone who wants to switch

Shifting from depth-to-breadth
In my case, I had years of working on Fraud and Payment analytics particularly. Hence, I start to learn about products by looking widely into product analytics.

Key metrics are the most simple and straightforward way to check the product's health, to know if you are building an impactful product. This article would benefit anyone who works in the product management space!

Let’s cut the chase, there are the top 3 Product metrics categories I love the most:
**

1. Product Awareness

**
This is the most important stage in your product funnel.

If you are not being aware that the product is existing, you would never use the product.

So how to measure this? The fundamental formula is equal to the number of customers who know about your product per total population you advertised or total customers you have. We commonly see it in the campaign reports. However, I personally think it is very important for anyone who builds the product to know this. There is some common metric to use in this case

Active User (DAU, WAU, MAU): Defined as the number of users who use the product in the specific given time

DAU: Daily Active User
WAU: Weekly Active User
MAU: Monthly Active User

Also, in some cases, you can expand the concept of Active users in calendar time or rolling time.

For example:

WAU Calendar: Number of Users who uses the product in a specific week from Monday to Sunday
WAU Rolling: Number of Users who use the product within the last 7 days (No matter which current day is this)
**

How active are user metrics used to measure awareness?

**
It would narrow down the scope of the population who possibly knows about the change in your product. This metric is also used for the marketing campaign to advertise to the top users about the change since they have better engagement with your product.

For example: If you want to release a button in your app. To know how many people are aware of your button existing, they would need to go to the app first. Active user metrics are the direct metric to measure this.

2. Product Adoption

**
Product adoption, or user adoption, is the moment when users start to use your product or features. At a basic level, adoption can be defined by the percentage of users who take the action on your product or feature for the first time

For example: Follow up with your new button in the app from the previous example, now you would like to know how many users tap into the button the first time.

Some of the key metrics in this category would help:

Product Funnel Conversion Rate: The ratio in each step is the straightforward way to measure the feature/product adoption. Tips are simple: Understand your funnel and when a user would use CTA (Call to Action)

Speed of Adoption: Defined as how long the user takes from the moment, they are aware of the product to the moment they act. This metric is very helpful to identify:

(1) The product problem: If your product has a technical challenge that prevents the customer from acting

(2) Product market fit: If the product does not bring benefits or is not helpful to the customer in this market

3. Product Retention & Engagement

**
After the customers used the feature/product, the important and sustainable business lies in how well we can retain customers over time.

There are 2 key metrics in this category:

Retention: this metric is defined as the percentage of customers from the moment they used the product (adoption) and still use the product over time.

Looking at the example, we see after the first CTA (Call to action), the retention rate of January is pretty bad, less than 35% actually continue to use the product feature.

Churn: One of my favourite metrics. This is a commonly used metric for measuring product engagement. It is defined as a percentage of customers that stopped using the product during a certain period. In most of the companies I worked at, this is a very important metric because we do not want to lose the customer. Churn metric usually raises multiple high concerns about the business.

For example: If you see a company with 80% growth rate from month 1 to month 2, it sounds like an amazing result. But if the churn rate is 90% of the month after, then the question is:

> “Is the company really growing?” The churn is fast as it grows.

There are tons of other metrics we should look into when building a product feature. It is definitely case by case. I hope these top product metric categories would help you approach the problem faster as a reference

I am always up for an open discussion if you have different ways and I love to learn from others too!

Building your data portfolio is like making a dinner

Paige Tran — Wed, 07 Sep 2022 13:26:06 +0000

I wrote an original article: Building your data portfolio is like making a dinner
Some people asked me to share more in detail about how to build data projects or data portfolios productively after my previous blog about 4 Self-learning steps to get started in Data Analytics

Hence, I separate this topic and write it in more detail. This article is focusing on some basic workflow that I believe would be beneficial for you to build your own portfolio. These steps are the daily process that I am applying for most of my task jobs. I will take one of my best analogies to describe it:

“How do you make a great dish for dinner?

Step 1: Collecting Data

This is the stage when you go to the supermarket and shop. Let’s imagine!

Data is your raw ingredients. Do you have fresh or spoiled meat? Do you have all or missing ingredients? All the quality items that you shopped in a supermarket would impact your dish. Sometimes, the dish is not perfect because perhaps you are missing some ingredients. Sometimes, you can also borrow or combine ingredients from the different supermarkets just like you can improvise if you possibly utilize data from other sources.

How to get the data for your practice?

As shared in the previous blog, you can visit some open source communities such as Kaggle, or Github. Some of the good references to collect the data

How to collect data for your analysis

- Tools That Make Collecting Data A Breeze (For higher advanced levels so you can practice Web Scraping and get more data for your project)

Step 2: EDA (Exploratory Data Analysis) and Cleaning Data

Whenever we talk about EDA or cleaning the data, food preparation is always one of the best examples of how we should do it. Firstly, we need a lot of patience and love for this stage! This stage is when you start cleaning your ingredient and preparing well (includes cutting, chopping, etc), understand better its quality and which part we should take, and which part we can improvise to make it better. In comparison, some of the great values in this stage for data enthusiasts:

Understand better the range of data and be able to detect outliers and anomalies
Identify the missing values
Decode and conquer imbalance classes
Re-format variable names and types And so on!

To understand better EDA and cleaning data, I would recommend reading more about:

Step by Step Guide to Exploratory Data Analysis
The Ultimate Guide to Data Cleaning

Step 3: Project techniques

This is the “cooking” stage when you select the right method to cook. Is this a stew dish or grilled dish? The decision is based mainly on the purpose of the dish (when you shopped, you target to make dish X). However, sometimes you could be more flexible based on the ingredient you got. For example, when half of the veggie is spoiled, can borrow some from a neighbour or combine it with some other veggie you have in the fridge. In comparison, sometimes to back up your analysis with a lack of data situation, you can manipulate the existing data or other market research data.

Some examples of your technique’s selection

Business analysis: Clustering. Even if you only use Excel, SQL, or PowerBI, the most important path is you are clear on the methodology you applying.
Recommendation: Collaborative Filtering
Predictions: Classification/Regression model
And so on! Some examples of good technique projects
Stock Market Analysis and Time Series Prediction
Netflix Movie Recommendation

Step 4: Insights and recommendations

I love this the most:** “How do you serve your food?”**

Many people underestimate this stage because of the quality of the cooking above. However, I believe that if you serve Phở by putting noodles, soup, and meat separately in different bowls, the dish is not the same anymore. Sometimes, it even confuses people regarding how to eat.

In comparison, when you present your results, the key question is: “SO WHAT?”

One of the greatest pieces of advice from my former boss is when you present your analysis:
**
“Focus on the storytelling and data insights, rather than the methods”**

No method is right or wrong to start learning something. The tips are from my experience. We all have a very different approach. Find the one that suits you.

I hope you find these tips are helpful

4 Self-learning steps to get started in Data Analytics

Paige Tran — Mon, 15 Aug 2022 22:39:00 +0000

A quick intro on my background, I do not study data science or computer science. Hence, I understand how hard it is for some people who want to start in this field (to become a data analyst, business analyst, data engineer, data scientist, etc) but wondering how to start.

I would share my simple formula (of course my journey had a lot more failures, and lost feelings and certainly took more time) for how should we start and how can we nurture our skillset in Data Analytics. And most importantly, finding a job that you love. By the way, I love my job 😊

Before jumping on the long journey, these steps would take months, or years (depending on your commitment level of yourself). Thus, ensure you have the strong motivation of starting something and pursuing it.

Let’s cut the chase and get into my 4 favorite self-learning steps

1. Learning Data Analytics Fundamentals

Many people ask me the first question when they want to switch their career to data analytics “what tool/certificates I should start learning?”. The answer would always be: You should start with fundamentals.

What are Data Analytics Fundamentals? Basic Statistic. This would enable you the ability to ask the right question and find the right methodology to answer this. In most case, an analyst always need to break down to smaller questions that are easier to answer

For example: How should we increase the price of the item without losing our growth? (Stop new customers coming and let our existing customers leave us and never come back). To answer this complex problem, instead of bumping into the toolkit and how to code, a simple approach can be:

What drives the price? Cost, revenue. Do we have data on it?

What drives the cost? What drives the revenue? Do we have data on it?

What is the optimal point of price that maximise the revenue and minimise the churn? What methodology should we use to predict the impact of price?

After breaking down into the smaller questions. How do we explore the data we have?

This is when most statistic courses would enable you to:

Define and recognize key descriptive statistics
Describe and distinguish between the central limit theorem and the law of large numbers
Identify strategies for constructing an unbiased sample
Where to learn statistics? if you do not learn any of this before in university or school, you can refer to some online free course

Statistic Fundamentals
Introduction to statistic

2. Start with the right Toolbox
Maybe this is the step when you should slow down and make some research on the difference in roles in data analytics. I particularly love this Infographic; this explains to me a lot about data roles and what tools I should learn. And of course, the role in each company is not always the same scope or responsibility. So, keep in mind this is a reference only.

Source: Data camp (This is also one of my favorite sites to learn tools)

Plus video (5mins) for your reference: Data Science in 5 mins

Each role would require a deeper understanding and experience of the different toolboxes. Indeed, finding the right tools after collecting data has always resonated intrinsically with my experience working in data analytics.

If you do not know what role you would follow in the future, then I would recommend starting with the top 3 below:

*BI Tools:
PowerBI and Looker are my top selection. BI tools bring huge support for delivering data insight to more people. As a data enthusiast, it is important to understand how to use this tool, and the best case to choose the visualization. This would boost your skills to explain the problem and present it to a wider group of stakeholders. There are tons of visualization tools out there (some for your reference) that you can start and practice with such as Tableau, SAS, etc. However, I particularly like PowerBI and Looker.

For PowerBI, you can easily access via Microsoft and many good practices linking with excel. I refer to BI courses to learn
For Looker, this is the tool that was acquired by Google in 2019 and become one of my favorites because of its friendly UXI and practical function. Free Looker course to kick start.

*SQL
Structured Query Language. Fundamentally, it’s used to get your question answered with data from a Relational Database (You can easily understand this term after step 1 fundamentals). This language would contribute tremendous value to answering the basic question that the visualization tool may not present or missing. This also can give you a better comprehensive picture of the current data you have. Below are some top courses I did start with to learn SQL. In my opinion, it is good to give you a strong sense of the real practical case (My top selection from a ton of courses out there).

*Python:
You must hear this programming language somewhere as its popularity in real practice. Python is particularly well suited for deploying machine learning at a large scale and very friendly application for any data exploration. This language is basically given you a deeper understanding of your raw data and how to manipulate it, answering the tougher questions and furthermore, automation. Let’s start with some of my top references:

Introduction to Python
Intermediate Python
Exploratory Data Analysis in Python
Python Data Science Toolbox (Part 1) and (Part 2)

3. Practice and Practice! Sharpen your skills

Where to practice?

Certainly, there are many websites for you to practice, but I personally love 2 communities where you not only can practice and build your portfolio but most importantly, learn from others.

Kaggle
Kaggle is the world’s largest community of data scientists and machine learning enthusiasts. This platform is the fastest way to get started on a new data science project. It also provides a very friendly interface to practice Jupyter notebook with a single click and is easy to build with the backup of the huge repository (free code and data).

Most importantly, I learned a lot from other’s works. I believe learning from people in the same community is the fastest ways to sharpen your skills.

Github
I mainly use Github to read and learn experiences from others. Different than Kaggle, Github is more focused on function. Github is one of the biggest host sites for GIT (versioning control system). Basically, it allows you and others to manage the changes in the code at the same time without conflict. I am sure you will use it at least once in a real job when you have multiple people on the same project. I know many great friends are using it and highly recommend it if you want to start your portfolio.

4. Build your own portfolio
I would highly recommend that you should start building your portfolio as soon as you plan to practice. I write another article about HOW TO BUILD A DATA PORTFOLIO and it explains the steps below further in detail. It contains a simple workflow that I apply to most of my task job.

Step 1: Collecting Data
Step 2: EDA (Exploratory Data Analysis) and Cleaning Data
Step 3: Project techniques
Step 4: Insights and recommendations

“Focus on the storytelling and data insights, rather than the methods”

What’s Next?
Learning every day!!! I build for myself the habit of reading the news daily and listening to podcasts before going to sleep. It is not easy to build but once you do, it brings you tremendous value.

Where to read?

I would write another one about this. Different topics have different great sources to read, watch or listen to. So wait for it! Meanwhile, you always can start researching by yourself and build your own habit.

No method is right or wrong to start learning something. The tips are from my experience. We all have a very different approach. Find the one that suits you. I hope these tips would be helpful somehow as references.

I am always up for an open discussion if you have different ways and I love to learn from others too!