DEV Community: Chi Cong, Nguyen

Python Best Practices for Data Engineer

Chi Cong, Nguyen — Wed, 10 Jul 2024 15:32:38 +0000

Logging, Error Handling, Environment Variables, and Argument Parsing

In this guide, we'll explore best practices for using various Python functionalities to build robust and maintainable applications. We'll cover logging, exception handling, environment variable management, and argument parsing, with code samples and recommendations for each.

1. Logging
2. Try-Exception Handling
3. Using python-dotenv for Environment Variables
1. Argparse

Logging

Logging is a crucial aspect of Python development, as it allows you to track the execution of your program, identify issues, and aid in debugging. Here's how to effectively incorporate logging into your Python projects:

# example.py
import logging

# Configure the logging format and level
logging.basicConfig(
    format='%(asctime)s - %(levelname)s - %(message)s',
    level=logging.INFO
)

# Example usage
logging.info('This is an informational message.')
logging.warning('This is a warning message.')
logging.error('This is an error message.')
logging.debug('This is a debug message.')

Best Practices:

Use meaningful log levels (debug, info, warning, error, critical) to provide valuable context. Avoid logging sensitive information, such as credentials or personal data.
Ensure log messages are concise and informative, helping you quickly identify and resolve issues.
Configure log file rotation and retention to manage storage and performance.
Use structured logging (e.g., with JSON) for better machine-readability and analysis.

Try Exception Handling

Proper exception handling is essential for creating robust and resilient applications. Here's an example of using try-except blocks in Python:

def divide_numbers(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        logging.error('Error: Division by zero')
        return None
    except TypeError:
        logging.error('Error: Invalid input types')
        return None
    except Exception as e:
        logging.error(f'Unexpected error: {e}')
        return None

Best Practices:

Catch specific exceptions (e.g., ZeroDivisionError, TypeError) to handle known issues.
Use a broad Exception block to catch any unexpected errors.
Log the exception details for better debugging and error reporting.
Provide meaningful error messages that help users understand the problem.
Consider using custom exception classes for domain-specific errors.
Implement graceful error handling to ensure your application remains responsive and doesn't crash.

Using dotenv

Environment variables are a common way to store sensitive or configuration-specific data, such as API keys, database credentials, or feature flags. The python-dotenv library makes it easy to manage these variables:

#.env
API_KEY=xxxxxxx
DATABASE_URL=xxxxxxxxx
DB_PASSWORD=xxxxxxxx
DB_NAME=xxxxxx

#main.py
from dotenv import load_dotenv
import os

# Load environment variables from a .env file
load_dotenv()

# Access environment variables
api_key = os.getenv('API_KEY')
database_url = os.getenv('DATABASE_URL')

Best Practices:

Store environment variables in a .env file, which should be excluded from version control.
Use a .env.example file to document the required environment variables.
Load environment variables at the start of your application, before accessing them.
Provide default values or raise clear errors if required environment variables are missing.
Use environment variables for sensitive data, not for general configuration.
Organize environment variables by context (e.g., DATABASE_URL, AWS_ACCESS_KEY).

Argparse

The argparse module in Python allows you to easily handle command-line arguments and options. Here's an example:

import argparse

parser = argparse.ArgumentParser(description='My Python Application')
parser.add_argument('-n', '--name', type=str, required=True, help='Name of the user')
parser.add_argument('-a', '--age', type=int, default=30, help='Age of the user')
parser.add_argument('-v', '--verbose', action='store_true', help='Enable verbose output')

args = parser.parse_args()

print(f'Name: {args.name}')
print(f'Age: {args.age}')
if args.verbose:
    logging.info('Verbose mode enabled.')

Best Practices:

Use descriptive and concise argument names.
Provide helpful descriptions for each argument.
Specify the appropriate data types for arguments (e.g., type=str, type=int).
Set required=True for mandatory arguments.
Provide default values for optional arguments.
Use boolean flags (e.g., action='store_true') for toggles or switches.
Integrate argument parsing with your application's logging and error handling.

Putting It All Together

Here's a combined code snippet that demonstrates the usage of all the functionalities covered in this guide:

import logging
from dotenv import load_dotenv
import os
import argparse

# Configure logging
logging.basicConfig(
    format='%(asctime)s - %(levelname)s - %(message)s',
    level=logging.INFO
)

# Load environment variables
load_dotenv()
api_key = os.getenv('API_KEY')
database_url = os.getenv('DATABASE_URL')

# Define command-line arguments
parser = argparse.ArgumentParser(description='My Python Application')
parser.add_argument('-n', '--name', type=str, required=True, help='Name of the user')
parser.add_argument('-a', '--age', type=int, default=30, help='Age of the user')
parser.add_argument('-v', '--verbose', action='store_true', help='Enable verbose output')
args = parser.parse_args()

# Example function with exception handling
def divide_numbers(a, b):
    try:
        result = a / b
        return result
    except ZeroDivisionError:
        logging.error('Error: Division by zero')
        return None
    except TypeError:
        logging.error('Error: Invalid input types')
        return None
    except Exception as e:
        logging.error(f'Unexpected error: {e}')
        return None

# Example usage
if __:
    name = args.name
    age = args.age
    print(f'Name: {name}')
    print(f'Age: {age}')

    if args.verbose:
        logging.info('Verbose mode enabled.')
        logging.info(f'API Key: {api_key}')
        logging.info(f'Database URL: {database_url}')

    result = divide_numbers(10, 2)
    if result is not None:
        print(f'Division result: {result}')

This code demonstrates the integration of logging, exception handling, environment variable management, and argument parsing. It includes best practices for each functionality, such as using appropriate log levels, catching specific exceptions, securely accessing environment variables, and providing helpful command-line argument descriptions.

NoSQL

Chi Cong, Nguyen — Wed, 03 Jul 2024 08:26:55 +0000

When to use a NoSQL Database

Need to be able to store different data type formats: NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL.
Large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes).
Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data
Need high throughput: While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs.
Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
Need high availability: Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time.

When NOT to use a NoSQL Database?

When you have a small dataset: NoSQL databases were made for big datasets not small datasets and while it works it wasn’t created for that.
When you need ACID Transactions: If you need a consistent database with ACID transactions, then most NoSQL databases will not be able to serve this need. NoSQL database are eventually consistent and do not provide ACID transactions. However, there are exceptions to it. Some non-relational databases like MongoDB can support ACID transactions.
When you need the ability to do JOINS across tables: NoSQL does not allow the ability to do JOINS. This is not allowed as this will result in full table scans.
If you want to be able to do aggregations and analytics
If you have changing business requirements : Ad-hoc queries are possible but difficult as the data model was done to fix particular queries
If your queries are not available and you need the flexibility : You need your queries in advance. If those are not available or you will need to be able to have flexibility on how you query your data you might need to stick with a relational database

Optimization + Tuning Spark

Chi Cong, Nguyen — Wed, 03 Jul 2024 04:00:20 +0000

Other Issues and How to Address Them

We have also touched on another very common issue with Spark jobs that can be harder to address: everything working fine but just taking a very long time. So what do you do when your Spark job is (too) slow?

Insufficient resources
Often while there are some possible ways of improvement, processing large data sets just takes a lot longer time than smaller ones even without any big problem in the code or job tuning. Using more resources, either by increasing the number of executors or using more powerful machines, might just not be possible.
When you have a slow job it’s useful to understand

how much data you’re actually processing (compressed file formats can be tricky to interpret),
if you can decrease the amount of data to be processed by filtering or aggregating to lower cardinality,
and if resource utilization is reasonable. There are many cases where different stages of a Spark job differ greatly in their resource needs: loading data is typically I/O heavy, some stages might require a lot of memory, others might need a lot of CPU. Understanding these differences might help to optimize the overall performance. Use the Spark UI and logs to collect information on these metrics.

If you run into out of memory errors you might consider increasing the number of partitions. If the memory errors occur over time you can look into why the size of certain objects is increasing too much during the run and if the size can be contained. Also, look for ways of freeing up resources if garbage collection metrics are high.

Certain algorithms (especially ML ones) use the driver to store data the workers share and update during the run. If you see memory issues on the driver check if the algorithm you’re using is pushing too much data there.

Data skew
If you drill down on the Spark UI to the task level you can see if certain partitions process significantly more data than others and if they are lagging behind. Such symptoms usually indicate a skewed data set. Consider implementing the techniques mentioned in this lesson:

add an intermediate data processing step with an alternative key
adjust the spark.sql.shuffle.partitions parameter if necessary

The problem with data skew is that it’s very specific to a data set. You might know ahead of time that certain customers or accounts are expected to generate a lot more activity but the solution for dealing with the skew might strongly depend on how the data looks like. If you need to implement a more general solution (for example for an automated pipeline) it’s recommended to take a more conservative approach (so assume that your data will be skewed) and then monitor how bad the skew really is.

Inefficient queries
Once your Spark application works it’s worth spending some time to analyze the query it runs. You can use the Spark UI to check the DAG and the jobs and stages it’s built of.

Spark’s query optimizer is called Catalyst. While Catalyst is a powerful tool to turn Python code to an optimized query plan that can run on the JVM it has some limitations when optimizing your code. It will for example push filters in a particular stage as early as possible in the plan but won’t move a filter across stages. It’s your job to make sure that if early filtering is possible without compromising the business logic than you perform this filtering where it’s more appropriate.

It also can’t decide for you how much data you’re shuffling across the cluster. Remember from the first lesson how expensive sending data through the network is. As much as possible try to avoid shuffling unnecessary data. In practice, this means that you need to perform joins and grouped aggregations as late as possible.

When it comes to joins there is more than one strategy to choose from. If one of your data frames are small consider using broadcast hash join instead of a hash join.

Further reading
Debugging and tuning your Spark application can be a daunting task. There is an ever growing community out there though always sharing new ideas and working on improving Spark itself and tooling that makes using Spark easier. So if you have a complicated issue don’t hesitate to reach out to others (via user mailing lists, forums, and Q&A sites).

You can find more information on tuning Spark and Spark SQL in the documentation.
Source udacity courses DE nano-degree

Git

Chi Cong, Nguyen — Mon, 01 Jul 2024 08:25:46 +0000

I. Add Commits to A Repo

Configuration

$ git --version

$ git config --global user.name "<NAME>"   
$ git config --global user.email "<EMAIL>"
$ git config --global color.ui auto
$ git config --global merge.conflictstyle diff3
$ git config --global core.editor "code --wait"

Check configuration

$ git config --list

To make a commit, the file or files we want committed need to be on the Staging Index. Command do we use to move files from the Working Directory to the Staging Index

$ git add

Command takes files from the Staging Index and saves them in the repository

$ git commit

Bypass The Editor With The -m Flag

$ git commit -m "Initial commit"

These Changes Were Not Committed on local let use this to know what those changes actually were

$ git diff

Good Commit Messages
Do
do keep the message short (less than 60-ish characters)
do explain what the commit does (not how or why!)
Do not
do not explain why the changes are made (more on this below)
do not explain how the changes are made (that's what git log -p is for!)
do not use the word "and"
if you have to use "and", your commit message is probably doing too many changes - break the changes into separate commits
e.g. "make the background color pink and increase the size of the sidebar"
To explain Why

II. Tagging, Branching, and Merging

Tagging

$ git tag -a v1.0

Verify tag

$ git tag

Delete tag - A Git tag can be deleted with the -d flag

$ git tag -d v1.0

Adding A Tag To A Past Commit

$ git tag -a v1.0 a87984

Branch

Verify branch

git branch

Create branch

git branch sidebar

Create Git Branch At Location

$ git branch alt-sidebar-loc 42a69f

Create branch + switch to it right after

git checkout -b <new_branch_name> <at_SHA or at <name current_branch>

Will create the alt-sidebar-loc branch and have it point to the commit with SHA 42a69f

Switch to desired branch

git checkout sidebar

_How this command works:

Remove all files and directories from the Working Directory that Git is tracking (files that Git tracks are stored in the repository, so nothing is lost)
Go into the repository and pull out all of the files and directories of the commit that the branch points to_
Show branch in log

$ git log --oneline --decorate

Show all branch in gragh

git log --oneline --decorate --graph --all

Delete branch, need to switch to other branch firstly (-D force delete)

$ git branch -d sidebar

Merge

There are two types of merges:

Fast-forward merge – the branch being merged in must be ahead of the checked out branch. The checked out branch's pointer will just be moved forward to point to the same commit as the other branch.
the regular type of merge
two divergent branches are combined
a merge commit is created

Git merge to combine branch (merging some other branch into the current (checked-out) branch)

$ git merge <name-of-branch-to-merge-in>

if you make a merge on the wrong branch, use this command to undo the merge

$ git reset --hard HEAD^

Undoing Changes

Update commit by modifying message or Add Forgotten Files To Commit
Make changes required file and do git add (if any)
Update message and commit file via $ git commit --amend

The

git revert

command is used to reverse a previously made commit:

$ git revert <SHA-of-commit-to-revert>

This command:

Will undo the changes that were made by the provided commit
creates a new commit to record the change

Reset vs Revert

Resetting Is Dangerous

Reverting creates a new commit that reverts or undos a previous commit.
Resetting, on the other hand, erases commits! (with flag --mixed (default to working dir); --soft; --hard)

However, Git does keep track of everything for about 30 days before it completely erases anything by

git reflog

command

💡 Create a backup branch on the most-recent commit so that I can get back to the commits if I make a mistake:

$ git branch backup

III. Working with remote

A remote repository is a repository that's just like the one you're using but it's just stored at a different location. To manage a remote repository, use the git remote command:

$ git remote

It's possible to have links to multiple different remote repositories.
A shortname is the name that's used to refer to a remote repository's location. Typically the location is a URL, but it could be a file path on the same computer.
git remote add is used to add a connection to a new remote repository.
git remote -v is used to see the details about a connection to a remote.

Git push (sync the remote repository with the local repositor)

$ git push <remote-shortname> <branch>

$ git push origin master

You'd like to include in your local repository, then you want to pull in those changes

$ git pull origin master

When you want to use

git fetch

rather than git pull is if your remote branch and your local branch both have changes that neither of the other ones has. In this case, you want to fetch the remote changes to get them in your local branch and then perform a merge manually. Then you can push that new merge commit back to the remote.

$ git fetch origin master

Working On Another Developer's Repository

Git log

$ git shortlog -s -n

$ git log --author=Surma

(tìm gần đúng tên)

$ git log --author="Surma Lewis"

(tìm chính xác tên)

git log --grep=bug

(Tìm text có chữ bug)

git log --grep="this bug"

(Tìm text có chữ this bug)

NOTE

Before you start doing any work, make sure to look for the project's CONTRIBUTING.md file.

Next, it's a good idea to look at the GitHub issues for the project

look at the existing issues to see if one is similar to the change you want to contribute
if necessary create a new issue
communicate the changes you'd like to make to the project maintainer in the issue
When you start developing, commit all of your work on a topic branch:

do not work on the master branch
make sure to give the topic branch clear, descriptive name
As a general best practice for writing commits:

make frequent, smaller commits
use clear and descriptive commit messages
update the README file, if necessary

Stay Syncing with source

When working with a project that you've forked. The original project's maintainer will continue adding changes to their project. You'll want to keep your fork of their project in sync with theirs so that you can include any changes they make.

To get commits from a source repository into your forked repository on GitHub you need to:

get the cloneable URL of the source repository
create a new remote with the git remote add command
use the shortname upstream to point to the source repository
provide the URL of the source repository
fetch the new upstream remote
merge the upstream's branch into a local branch
push the newly updated local branch to your origin repo

GIT standard

Commit Style Requirements

udacity.github.io

Apache Airflow

Chi Cong, Nguyen — Fri, 28 Jun 2024 09:36:36 +0000

I. Kiến trúc của Airflow

1. Các component chính:

Scheduler: Trái tim của Airflow, chịu trách nhiệm lên lịch và thực thi các DAGs. Nó liên tục kiểm tra các DAGs và xác định các tasks cần thực thi dựa trên các dependencies và thời gian lên lịch.
Executor: Chịu trách nhiệm thực thi các tasks. Airflow cung cấp nhiều loại executors như LocalExecutor (thực thi tasks trên máy chủ Airflow), CeleryExecutor (thực thi tasks trên các worker riêng biệt), KubernetesExecutor (thực thi tasks trên cluster Kubernetes), v.v.
Webserver: Cung cấp giao diện web để quản lý, theo dõi và debug các DAGs.
Metadata Database: Lưu trữ thông tin về các DAGs, tasks, logs, v.v. Airflow hỗ trợ nhiều loại database như PostgreSQL, MySQL, SQLite.
DAGs (Directed Acyclic Graphs): Đại diện cho các luồng công việc trong Airflow. Mỗi DAG bao gồm một tập hợp các tasks được kết nối với nhau theo thứ tự thực thi. Tasks: Các đơn vị công việc nhỏ nhất trong Airflow. Mỗi task đại diện cho một tác vụ cụ thể, ví dụ như chạy một script Python, gửi email, v.v.
Operators: Các lớp Python được sử dụng để định nghĩa các tasks trong DAGs. Airflow cung cấp nhiều loại operators cho các tác vụ phổ biến như BashOperator, PythonOperator, EmailOperator, v.v.

2. Deploying Airflow components

Trên Single machine chỉ gồm components scheduler và webserver
Trên Distributed environment - các components sẽ chạy trên các machines khác nhau (điều này cũng góp phần tăng tính bảo mật)

II. Cơ chế hoạt động

Scheduler (Lên lịch): liên tục kiểm tra các DAGs và xác định các tasks cần thực thi dựa trên các dependencies và thời gian lên lịch.
Executor (Thực thi): nhận các tasks từ Scheduler và thực thi chúng trên các worker.
Monitoring (Theo dõi): Airflow theo dõi tiến trình của các tasks và ghi lại thông tin vào Metadata Database.
Error Handling (Xử lý lỗi): Airflow xử lý các lỗi xảy ra trong quá trình thực thi tasks và có thể retry, skip hoặc fail các tasks dựa trên cấu hình.

References:
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html#architecture-diagrams

High Availability vs Fault Tolerance vs Disaster Recovery

Chi Cong, Nguyen — Thu, 27 Jun 2024 08:44:50 +0000

I. High Availability:

Similar to having a spare tire in a car, high availability ensures a quick recovery from a component failure. The system has a backup ready to replace the failed component, minimizing downtime.

II. Fault Tolerance:

Like an airplane with multiple engines, a fault-tolerant system can continue operating even if one or more components fail. The system is designed to have redundancy, ensuring that the loss of a single component doesn't bring the entire system down.

III. Disaster Recovery:

This is like the pilot ejecting from a failing aircraft. In a disaster scenario, the entire infrastructure is compromised. Disaster recovery focuses on saving the business's data and operations by moving them to a new, unaffected infrastructure. It's about preserving the business, not the infrastructure itself.

References
https://www.pbenson.net/2014/02/the-difference-between-fault-tolerance-high-availability-disaster-recovery/

Understanding about Spark from Data engineering POV

Chi Cong, Nguyen — Thu, 27 Jun 2024 08:30:33 +0000

Spark is currently one of the most popular tools for big data analytics
Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk whereas Spark tries to keep intermediate results in memory whenever possible.

The Hadoop ecosystem includes a distributed file storage system called HDFS (Hadoop Distributed File System). Spark, on the other hand, does not include a file storage system. You can use Spark on top of HDFS but you do not have to. Spark can read in data from other sources as well such as Amazon S3.

MapReduce

I. Spark ecosystem includes multiple components

Spark Core: The foundation for distributed data processing.
Spark SQL: Enables structured data processing using SQL-like queries. It allows you to query data stored in various formats like Hive tables, Parquet files, and relational databases.
MLlib: Provides machine learning algorithms for tasks like classification, regression, and clustering.
GraphX: A library for graph processing, enabling analysis of large-scale graphs.

--> Think of Spark as a toolbox for big data. Each component provides specialized tools for different tasks, allowing you to analyze and manipulate data efficiently and effectively.

II. Basic architecture of Apache Spark

Master Node: This node houses the "Driver Program" which contains the Spark Context. The Spark Context is responsible for initializing the Spark application and connecting to the cluster.
Cluster Manager: The Cluster Manager is responsible for allocating resources and managing the worker nodes. It can be a standalone manager or utilize systems like YARN or Mesos.
Worker Nodes: These nodes are the workhorses of the Spark cluster. They execute the tasks assigned by the Driver Program.
Tasks: These are individual units of work that are distributed across the worker nodes.
Cache: Worker nodes maintain a cache for storing frequently accessed data, speeding up processing.

Here is how it works:

The Driver Program, running on the Master Node, submits a Spark application to the Cluster Manager.
The Cluster Manager distributes the application's tasks across the worker nodes.
Worker nodes execute the tasks in parallel, leveraging their resources and the data cached on their local storage.
The Driver Program gathers and aggregates the results from the worker nodes.

Hi

Chi Cong, Nguyen — Tue, 12 Sep 2023 04:39:18 +0000

Report of airfares for global flights

Chi Cong, Nguyen — Mon, 19 Jul 2021 04:14:19 +0000

Report link
After having a look on an article about air transportation in VN vnexpress.net/khong-can-them-hang-...

I have in mind a question of how expensive the airfare is in countries around VN (cost per kilometer). Then I got the research from rome2rio that could give me a satisfied answer even more info ( link for more detail) (against on data of economy class only that I think is closer to cattle class and adding fees on everything from luggage to seat assignments).

Year of report was 2018 then I gonna list out number of carriers in south east Asia countries:

Vietnam 4 carries
Thailand 10 carries
Phillipines 6 carries
Myanmar 4 carries
Malaysia 4 carries
Indonesia 9 carries
Cambodia 4 carries

Roads quality in Asia 2006-2019

Chi Cong, Nguyen — Tue, 15 Jun 2021 04:16:40 +0000

In south asia, we could see the top one is Singapore>>Malaysia>>Thailand>>Indonesia>>Laos>>Cambodia>>VietNam

Source: https://www.theglobaleconomy.com/rankings/roads_quality/Asia/

Hello World!

Chi Cong, Nguyen — Wed, 09 Jun 2021 11:20:16 +0000

This is my first post