<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chi Cong, Nguyen</title>
    <description>The latest articles on DEV Community by Chi Cong, Nguyen (@congnguyen).</description>
    <link>https://dev.to/congnguyen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F646454%2Fa66a8ae4-d05b-44b8-a4ee-90bebfbb00a5.jpeg</url>
      <title>DEV Community: Chi Cong, Nguyen</title>
      <link>https://dev.to/congnguyen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/congnguyen"/>
    <language>en</language>
    <item>
      <title>Python Best Practices for Data Engineer</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Wed, 10 Jul 2024 15:32:38 +0000</pubDate>
      <link>https://dev.to/congnguyen/python-best-practices-for-data-engineer-epm</link>
      <guid>https://dev.to/congnguyen/python-best-practices-for-data-engineer-epm</guid>
      <description>&lt;h2&gt;
  
  
  Logging, Error Handling, Environment Variables, and Argument Parsing
&lt;/h2&gt;

&lt;p&gt;In this guide, we'll explore best practices for using various Python functionalities to build robust and maintainable applications. We'll cover logging, exception handling, environment variable management, and argument parsing, with code samples and recommendations for each.&lt;/p&gt;

&lt;p&gt;1. Logging&lt;br&gt;
2. Try-Exception Handling&lt;br&gt;
3. Using python-dotenv for Environment Variables&lt;br&gt;
1. Argparse&lt;/p&gt;
&lt;h3&gt;
  
  
  Logging
&lt;/h3&gt;

&lt;p&gt;Logging is a crucial aspect of Python development, as it allows you to track the execution of your program, identify issues, and aid in debugging. Here's how to effectively incorporate logging into your Python projects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# example.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="c1"&gt;# Configure the logging format and level
&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s - %(levelname)s - %(message)s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;This is an informational message.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;This is a warning message.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;This is an error message.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;This is a debug message.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use meaningful log levels (debug, info, warning, error, critical) to provide valuable context.
Avoid logging sensitive information, such as credentials or personal data.&lt;/li&gt;
&lt;li&gt;Ensure log messages are concise and informative, helping you quickly identify and resolve issues.&lt;/li&gt;
&lt;li&gt;Configure log file rotation and retention to manage storage and performance.&lt;/li&gt;
&lt;li&gt;Use structured logging (e.g., with JSON) for better machine-readability and analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Try Exception Handling
&lt;/h3&gt;

&lt;p&gt;Proper exception handling is essential for creating robust and resilient applications. Here's an example of using try-except blocks in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;divide_numbers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ZeroDivisionError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error: Division by zero&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error: Invalid input types&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unexpected error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catch specific exceptions (e.g., ZeroDivisionError, TypeError) to handle known issues.&lt;/li&gt;
&lt;li&gt;Use a broad Exception block to catch any unexpected errors.&lt;/li&gt;
&lt;li&gt;Log the exception details for better debugging and error reporting.&lt;/li&gt;
&lt;li&gt;Provide meaningful error messages that help users understand the problem.&lt;/li&gt;
&lt;li&gt;Consider using custom exception classes for domain-specific errors.&lt;/li&gt;
&lt;li&gt;Implement graceful error handling to ensure your application remains responsive and doesn't crash.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Using dotenv
&lt;/h3&gt;

&lt;p&gt;Environment variables are a common way to store sensitive or configuration-specific data, such as API keys, database credentials, or feature flags. The python-dotenv library makes it easy to manage these variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#.env
&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;xxxxxxx&lt;/span&gt;
&lt;span class="n"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;xxxxxxxxx&lt;/span&gt;
&lt;span class="n"&gt;DB_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;xxxxxxxx&lt;/span&gt;
&lt;span class="n"&gt;DB_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;xxxxxx&lt;/span&gt;

&lt;span class="c1"&gt;#main.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Load environment variables from a .env file
&lt;/span&gt;&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Access environment variables
&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;database_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store environment variables in a .env file, which should be excluded from version control.&lt;/li&gt;
&lt;li&gt;Use a .env.example file to document the required environment variables.&lt;/li&gt;
&lt;li&gt;Load environment variables at the start of your application, before accessing them.&lt;/li&gt;
&lt;li&gt;Provide default values or raise clear errors if required environment variables are missing.&lt;/li&gt;
&lt;li&gt;Use environment variables for sensitive data, not for general configuration.&lt;/li&gt;
&lt;li&gt;Organize environment variables by context (e.g., DATABASE_URL, AWS_ACCESS_KEY).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Argparse
&lt;/h3&gt;

&lt;p&gt;The argparse module in Python allows you to easily handle command-line arguments and options. Here's an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;

&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;My Python Application&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name of the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Age of the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-v&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--verbose&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Enable verbose output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Age: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Verbose mode enabled.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use descriptive and concise argument names.&lt;/li&gt;
&lt;li&gt;Provide helpful descriptions for each argument.&lt;/li&gt;
&lt;li&gt;Specify the appropriate data types for arguments (e.g., type=str, type=int).&lt;/li&gt;
&lt;li&gt;Set required=True for mandatory arguments.&lt;/li&gt;
&lt;li&gt;Provide default values for optional arguments.&lt;/li&gt;
&lt;li&gt;Use boolean flags (e.g., action='store_true') for toggles or switches.&lt;/li&gt;
&lt;li&gt;Integrate argument parsing with your application's logging and error handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Putting It All Together
&lt;/h3&gt;

&lt;p&gt;Here's a combined code snippet that demonstrates the usage of all the functionalities covered in this guide:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;

&lt;span class="c1"&gt;# Configure logging
&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s - %(levelname)s - %(message)s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load environment variables
&lt;/span&gt;&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;database_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define command-line arguments
&lt;/span&gt;&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;My Python Application&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name of the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Age of the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-v&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--verbose&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Enable verbose output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Example function with exception handling
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;divide_numbers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ZeroDivisionError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error: Division by zero&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Error: Invalid input types&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unexpected error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Age: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Verbose mode enabled.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;API Key: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Database URL: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;database_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;divide_numbers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Division result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code demonstrates the integration of logging, exception handling, environment variable management, and argument parsing. It includes best practices for each functionality, such as using appropriate log levels, catching specific exceptions, securely accessing environment variables, and providing helpful command-line argument descriptions.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>NoSQL</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Wed, 03 Jul 2024 08:26:55 +0000</pubDate>
      <link>https://dev.to/congnguyen/nosql-4cef</link>
      <guid>https://dev.to/congnguyen/nosql-4cef</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;When to use a NoSQL Database&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Need to be able to store different data type formats:&lt;/strong&gt; NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large amounts of data:&lt;/strong&gt; Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need horizontal scalability:&lt;/strong&gt; Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need high throughput:&lt;/strong&gt; While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need a flexible schema:&lt;/strong&gt; Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need high availability:&lt;/strong&gt; Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;When NOT to use a NoSQL Database?&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When you have a small dataset:&lt;/strong&gt; NoSQL databases were made for big datasets not small datasets and while it works it wasn’t created for that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When you need ACID Transactions:&lt;/strong&gt; If you need a consistent database with ACID transactions, then most NoSQL databases will not be able to serve this need. NoSQL database are eventually consistent and do not provide ACID transactions. However, there are exceptions to it. Some non-relational databases like MongoDB can support ACID transactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When you need the ability to do JOINS across tables:&lt;/strong&gt; NoSQL does not allow the ability to do JOINS. This is not allowed as this will result in full table scans.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;If you want to be able to do aggregations and analytics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you have changing business requirements :&lt;/strong&gt; Ad-hoc queries are possible but difficult as the data model was done to fix particular queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If your queries are not available and you need the flexibility :&lt;/strong&gt; You need your queries in advance. If those are not available or you will need to be able to have flexibility on how you query your data you might need to stick with a relational database&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Optimization + Tuning Spark</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Wed, 03 Jul 2024 04:00:20 +0000</pubDate>
      <link>https://dev.to/congnguyen/optimization-tuning-spark-1dme</link>
      <guid>https://dev.to/congnguyen/optimization-tuning-spark-1dme</guid>
      <description>&lt;p&gt;&lt;strong&gt;Other Issues and How to Address Them&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We have also touched on another very common issue with Spark jobs that can be harder to address: everything working fine but just taking a very long time. So what do you do when your Spark job is (too) slow?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Insufficient resources&lt;/strong&gt;&lt;br&gt;
Often while there are some possible ways of improvement, processing large data sets just takes a lot longer time than smaller ones even without any big problem in the code or job tuning. Using more resources, either by increasing the number of executors or using more powerful machines, might just not be possible.&lt;br&gt;
When you have a slow job it’s useful to understand&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how much data you’re actually processing (compressed file formats can be tricky to interpret),&lt;/li&gt;
&lt;li&gt;if you can decrease the amount of data to be processed by filtering or aggregating to lower cardinality,&lt;/li&gt;
&lt;li&gt;and if resource utilization is reasonable. 
There are many cases where different stages of a Spark job differ greatly in their resource needs: loading data is typically I/O heavy, some stages might require a lot of memory, others might need a lot of CPU. Understanding these differences might help to optimize the overall performance. Use the Spark UI and logs to collect information on these metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run into out of memory errors you might consider increasing the number of partitions. If the memory errors occur over time you can look into why the size of certain objects is increasing too much during the run and if the size can be contained. Also, look for ways of freeing up resources if garbage collection metrics are high.&lt;/p&gt;

&lt;p&gt;Certain algorithms (especially ML ones) use the driver to store data the workers share and update during the run. If you see memory issues on the driver check if the algorithm you’re using is pushing too much data there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data skew&lt;/strong&gt;&lt;br&gt;
If you drill down on the Spark UI to the task level you can see if certain partitions process significantly more data than others and if they are lagging behind. Such symptoms usually indicate a skewed data set. Consider implementing the techniques mentioned in this lesson:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add an intermediate data processing step with an alternative key&lt;/li&gt;
&lt;li&gt;adjust the spark.sql.shuffle.partitions parameter if necessary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem with data skew is that it’s very specific to a data set. You might know ahead of time that certain customers or accounts are expected to generate a lot more activity but the solution for dealing with the skew might strongly depend on how the data looks like. If you need to implement a more general solution (for example for an automated pipeline) it’s recommended to take a more conservative approach (so assume that your data will be skewed) and then monitor how bad the skew really is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inefficient queries&lt;/strong&gt;&lt;br&gt;
Once your Spark application works it’s worth spending some time to analyze the query it runs. You can use the Spark UI to check the DAG and the jobs and stages it’s built of.&lt;/p&gt;

&lt;p&gt;Spark’s query optimizer is called Catalyst. While Catalyst is a powerful tool to turn Python code to an optimized query plan that can run on the JVM it has some limitations when optimizing your code. It will for example push filters in a particular stage as early as possible in the plan but won’t move a filter across stages. It’s your job to make sure that if early filtering is possible without compromising the business logic than you perform this filtering where it’s more appropriate.&lt;/p&gt;

&lt;p&gt;It also can’t decide for you how much data you’re shuffling across the cluster. Remember from the first lesson how expensive sending data through the network is. As much as possible try to avoid shuffling unnecessary data. In practice, this means that you need to perform joins and grouped aggregations as late as possible.&lt;/p&gt;

&lt;p&gt;When it comes to joins there is more than one strategy to choose from. If one of your data frames are small consider using broadcast hash join instead of a hash join.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Further reading&lt;/strong&gt;&lt;br&gt;
Debugging and tuning your Spark application can be a daunting task. There is an ever growing community out there though always sharing new ideas and working on improving Spark itself and tooling that makes using Spark easier. So if you have a complicated issue don’t hesitate to reach out to others (via user mailing lists, forums, and Q&amp;amp;A sites).&lt;/p&gt;

&lt;p&gt;You can find more information on tuning &lt;a href="https://spark.apache.org/docs/latest/tuning.html"&gt;Spark&lt;/a&gt; and &lt;a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html"&gt;Spark SQL&lt;/a&gt; in the documentation.&lt;br&gt;
Source udacity courses DE nano-degree&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Git</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Mon, 01 Jul 2024 08:25:46 +0000</pubDate>
      <link>https://dev.to/congnguyen/git-2of0</link>
      <guid>https://dev.to/congnguyen/git-2of0</guid>
      <description>&lt;h2&gt;
  
  
  I. Add Commits to A Repo
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Configuration
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git --version&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git config --global user.name "&amp;lt;NAME&amp;gt;"   
$ git config --global user.email "&amp;lt;EMAIL&amp;gt;"
$ git config --global color.ui auto
$ git config --global merge.conflictstyle diff3
$ git config --global core.editor "code --wait"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;Check configuration
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git config --list&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To make a commit, the file or files we want committed need to be on the Staging Index. Command do we use to move files from the Working Directory to the Staging Index
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git add&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Command takes files from the Staging Index and saves them in the repository
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git commit&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bypass The Editor With The -m Flag
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git commit -m "Initial commit"&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;These Changes Were Not Committed on local let use this to know what those changes actually were&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git diff&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good Commit Messages&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Do&lt;br&gt;
do keep the message short (less than 60-ish characters)&lt;br&gt;
do explain what the commit does (not how or why!)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Do not&lt;br&gt;
do not explain why the changes are made (more on this below)&lt;br&gt;
do not explain how the changes are made (that's what git log -p is for!)&lt;br&gt;
do not use the word "and"&lt;br&gt;
if you have to use "and", your commit message is probably doing too many changes - break the changes into separate commits&lt;br&gt;
e.g. "make the background color pink and increase the size of the sidebar"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To explain Why &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmtt8nbrf0z4jt0rcg0nr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmtt8nbrf0z4jt0rcg0nr.png" alt="Image description" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  II. Tagging, Branching, and Merging
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Tagging&lt;/strong&gt;
&lt;/h3&gt;



&lt;p&gt;&lt;code&gt;$ git tag -a v1.0&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verify tag
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git tag&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delete tag - A Git tag can be deleted with the -d flag
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git tag -d v1.0&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding A Tag To A Past Commit
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git tag -a v1.0 a87984&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Branch&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Verify branch
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;git branch&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create branch
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;git branch sidebar&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create Git Branch At Location
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git branch alt-sidebar-loc 42a69f&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create branch + switch to it right after
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;git checkout -b &amp;lt;new_branch_name&amp;gt; &amp;lt;at_SHA or at &amp;lt;name current_branch&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Will create the &lt;code&gt;alt-sidebar-loc&lt;/code&gt; branch and have it point to the commit with SHA &lt;code&gt;42a69f&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switch to desired branch
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;git checkout sidebar&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;_How this command works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove all files and directories from the Working Directory that Git is tracking
(files that Git tracks are stored in the repository, so nothing is lost)&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Go into the repository and pull out all of the files and directories of the commit that the branch points to_&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Show branch in log&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git log --oneline --decorate&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Show all branch in gragh
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;git log --oneline --decorate --graph --all&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delete branch, need to switch to other branch firstly  (-D force delete)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git branch -d sidebar&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Merge&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;There are two types of merges:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fast-forward merge – the branch being merged in must be ahead of the checked out branch. The checked out branch's pointer will just be moved forward to point to the same commit as the other branch.&lt;br&gt;
the regular type of merge&lt;br&gt;
two divergent branches are combined&lt;br&gt;
a merge commit is created&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git merge to combine branch (merging some other branch into the current (checked-out) branch)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git merge &amp;lt;name-of-branch-to-merge-in&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if you make a merge on the wrong branch, use this command to undo the merge
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git reset --hard HEAD^&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Undoing Changes&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Update commit by modifying message or Add Forgotten Files To Commit&lt;/li&gt;
&lt;li&gt;Make changes required file and do

&lt;code&gt;git add&lt;/code&gt;

(if any)&lt;/li&gt;
&lt;li&gt;Update message and commit file via

&lt;code&gt;$ git commit --amend&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The&lt;br&gt;
&lt;br&gt;
 &lt;code&gt;git revert&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 command is used to reverse a previously made commit:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git revert &amp;lt;SHA-of-commit-to-revert&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;This command: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Will undo the changes that were made by the provided commit&lt;/li&gt;
&lt;li&gt;creates a new commit to record the change&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Reset vs Revert&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Resetting&lt;/strong&gt; Is Dangerous&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reverting creates a new commit that reverts or undos a previous commit.&lt;/li&gt;
&lt;li&gt;Resetting, on the other hand, erases commits! (with flag --mixed (default to working dir); --soft; --hard)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, Git does keep track of everything for about 30 days before it completely erases anything by&lt;br&gt;
&lt;br&gt;
 &lt;code&gt;git reflog&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 command&lt;/p&gt;

&lt;p&gt;💡 Create a backup branch on the most-recent commit so that I can get back to the commits if I make a mistake:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git branch backup&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;h2&gt;
  
  
  III. Working with remote
&lt;/h2&gt;

&lt;p&gt;A remote repository is a repository that's just like the one you're using but it's just stored at a different location. To manage a remote repository, use the git remote command:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git remote&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;It's possible to have links to multiple different remote repositories.&lt;br&gt;
A shortname is the name that's used to refer to a remote repository's location. Typically the location is a URL, but it could be a file path on the same computer.&lt;br&gt;
git remote add is used to add a connection to a new remote repository.&lt;br&gt;
git remote -v is used to see the details about a connection to a remote.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git push (sync the remote repository with the local repositor)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git push &amp;lt;remote-shortname&amp;gt; &amp;lt;branch&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;



&lt;p&gt;&lt;code&gt;$ git push origin master&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7d5b4uoxf51rm4fx7o77.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7d5b4uoxf51rm4fx7o77.png" alt="Image description" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You'd like to include in your local repository, then you want to pull in those changes&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git pull origin master&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;When you want to use&lt;br&gt;
&lt;br&gt;
 &lt;code&gt;git fetch&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 rather than git pull is if your remote branch and your local branch both have changes that neither of the other ones has. In this case, you want to fetch the remote changes to get them in your local branch and then perform a merge manually. Then you can push that new merge commit back to the remote.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git fetch origin master&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Working On Another Developer's Repository&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Git log
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;$ git shortlog -s -n&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;



&lt;p&gt;&lt;code&gt;$ git log --author=Surma&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 (tìm gần đúng tên)&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;$ git log --author="Surma Lewis"&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 (tìm chính xác tên)&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git log --grep=bug&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 (Tìm text có chữ bug)&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git log --grep="this bug"&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
 (Tìm text có chữ this bug)&lt;/p&gt;
&lt;h2&gt;
  
  
  NOTE
&lt;/h2&gt;

&lt;p&gt;Before you start doing any work, make sure to look for the project's CONTRIBUTING.md file.&lt;/p&gt;

&lt;p&gt;Next, it's a good idea to look at the GitHub issues for the project&lt;/p&gt;

&lt;p&gt;look at the existing issues to see if one is similar to the change you want to contribute&lt;br&gt;
if necessary create a new issue&lt;br&gt;
communicate the changes you'd like to make to the project maintainer in the issue&lt;br&gt;
When you start developing, commit all of your work on a topic branch:&lt;/p&gt;

&lt;p&gt;do not work on the master branch&lt;br&gt;
make sure to give the topic branch clear, descriptive name&lt;br&gt;
As a general best practice for writing commits:&lt;/p&gt;

&lt;p&gt;make frequent, smaller commits&lt;br&gt;
use clear and descriptive commit messages&lt;br&gt;
update the README file, if necessary&lt;/p&gt;
&lt;h2&gt;
  
  
  Stay Syncing with source
&lt;/h2&gt;

&lt;p&gt;When working with a project that you've forked. The original project's maintainer will continue adding changes to their project. You'll want to keep your fork of their project in sync with theirs so that you can include any changes they make.&lt;/p&gt;

&lt;p&gt;To get commits from a source repository into your forked repository on GitHub you need to:&lt;/p&gt;

&lt;p&gt;get the cloneable URL of the source repository&lt;br&gt;
create a new remote with the git remote add command&lt;br&gt;
use the shortname upstream to point to the source repository&lt;br&gt;
provide the URL of the source repository&lt;br&gt;
fetch the new upstream remote&lt;br&gt;
merge the upstream's branch into a local branch&lt;br&gt;
push the newly updated local branch to your origin repo&lt;/p&gt;
&lt;h2&gt;
  
  
  GIT standard
&lt;/h2&gt;

&lt;p&gt;Commit Style Requirements &lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;a href="https://udacity.github.io/git-styleguide/" rel="noopener noreferrer"&gt;
      udacity.github.io
    &lt;/a&gt;
&lt;/div&gt;



</description>
    </item>
    <item>
      <title>Apache Airflow</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Fri, 28 Jun 2024 09:36:36 +0000</pubDate>
      <link>https://dev.to/congnguyen/apache-airflow-56e0</link>
      <guid>https://dev.to/congnguyen/apache-airflow-56e0</guid>
      <description>&lt;h2&gt;
  
  
  I. Kiến trúc của Airflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8dxtqwubumj4woexsfxs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8dxtqwubumj4woexsfxs.png" alt="Basic architecture" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Các component chính:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler&lt;/strong&gt;: Trái tim của Airflow, chịu trách nhiệm lên lịch và thực thi các DAGs. Nó liên tục kiểm tra các DAGs và xác định các tasks cần thực thi dựa trên các dependencies và thời gian lên lịch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executor&lt;/strong&gt;: Chịu trách nhiệm thực thi các tasks. Airflow cung cấp nhiều loại executors như LocalExecutor (thực thi tasks trên máy chủ Airflow), CeleryExecutor (thực thi tasks trên các worker riêng biệt), KubernetesExecutor (thực thi tasks trên cluster Kubernetes), v.v.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Webserver&lt;/strong&gt;: Cung cấp giao diện web để quản lý, theo dõi và debug các DAGs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Database&lt;/strong&gt;: Lưu trữ thông tin về các DAGs, tasks, logs, v.v. Airflow hỗ trợ nhiều loại database như PostgreSQL, MySQL, SQLite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DAGs&lt;/strong&gt; (Directed Acyclic Graphs): Đại diện cho các luồng công việc trong Airflow. Mỗi DAG bao gồm một tập hợp các tasks được kết nối với nhau theo thứ tự thực thi.
Tasks: Các đơn vị công việc nhỏ nhất trong Airflow. Mỗi task đại diện cho một tác vụ cụ thể, ví dụ như chạy một script Python, gửi email, v.v.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operators&lt;/strong&gt;: Các lớp Python được sử dụng để định nghĩa các tasks trong DAGs. Airflow cung cấp nhiều loại operators cho các tác vụ phổ biến như BashOperator, PythonOperator, EmailOperator, v.v.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Deploying Airflow components&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trên Single machine chỉ gồm components scheduler và webserver&lt;/li&gt;
&lt;li&gt;Trên Distributed environment - các components sẽ chạy trên các machines khác nhau (điều này cũng góp phần tăng tính bảo mật)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx42rkzsw0hyzdgh5quhu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx42rkzsw0hyzdgh5quhu.png" alt="Distributed architecture" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  II. Cơ chế hoạt động
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler&lt;/strong&gt; (Lên lịch): liên tục kiểm tra các DAGs và xác định các tasks cần thực thi dựa trên các dependencies và thời gian lên lịch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executor&lt;/strong&gt; (Thực thi): nhận các tasks từ Scheduler và thực thi chúng trên các worker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt; (Theo dõi): Airflow theo dõi tiến trình của các tasks và ghi lại thông tin vào Metadata Database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Handling&lt;/strong&gt; (Xử lý lỗi): Airflow xử lý các lỗi xảy ra trong quá trình thực thi tasks và có thể retry, skip hoặc fail các tasks dựa trên cấu hình.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;&lt;em&gt;References:&lt;/em&gt;&lt;/u&gt;&lt;br&gt;
&lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html#architecture-diagrams"&gt;https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html#architecture-diagrams&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>High Availability vs Fault Tolerance vs Disaster Recovery</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Thu, 27 Jun 2024 08:44:50 +0000</pubDate>
      <link>https://dev.to/congnguyen/high-availability-vs-fault-tolerance-vs-disaster-recovery-2m</link>
      <guid>https://dev.to/congnguyen/high-availability-vs-fault-tolerance-vs-disaster-recovery-2m</guid>
      <description>&lt;h1&gt;
  
  
  I. High Availability:
&lt;/h1&gt;

&lt;p&gt;Similar to having a spare tire in a car, high availability ensures a quick recovery from a component failure. The system has a backup ready to replace the failed component, minimizing downtime.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96g1mtc1a1jciy80q4ly.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F96g1mtc1a1jciy80q4ly.png" alt="High Availability" width="396" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  II. Fault Tolerance:
&lt;/h1&gt;

&lt;p&gt;Like an airplane with multiple engines, a fault-tolerant system can continue operating even if one or more components fail. The system is designed to have redundancy, ensuring that the loss of a single component doesn't bring the entire system down.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xv1wjvy2wsqbuvepe1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xv1wjvy2wsqbuvepe1l.png" alt="Fault Tolerance" width="644" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  III. Disaster Recovery:
&lt;/h1&gt;

&lt;p&gt;This is like the pilot ejecting from a failing aircraft. In a disaster scenario, the entire infrastructure is compromised. Disaster recovery focuses on saving the business's data and operations by moving them to a new, unaffected infrastructure. It's about preserving the business, not the infrastructure itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpj4n95ps8pq2tnvim8gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpj4n95ps8pq2tnvim8gi.png" alt="Disaster Recovery" width="644" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;References&lt;br&gt;
&lt;a href="https://www.pbenson.net/2014/02/the-difference-between-fault-tolerance-high-availability-disaster-recovery/"&gt;https://www.pbenson.net/2014/02/the-difference-between-fault-tolerance-high-availability-disaster-recovery/&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding about Spark from Data engineering POV</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Thu, 27 Jun 2024 08:30:33 +0000</pubDate>
      <link>https://dev.to/congnguyen/understanding-about-spark-from-data-engineering-pov-4hlp</link>
      <guid>https://dev.to/congnguyen/understanding-about-spark-from-data-engineering-pov-4hlp</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu77z5rmv4fe0mr6gxudp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu77z5rmv4fe0mr6gxudp.png" alt="Image description" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Spark is currently one of the most popular tools for big data analytics&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Spark is generally faster than Hadoop. This is because Hadoop &lt;em&gt;writes intermediate results to disk&lt;/em&gt; whereas Spark tries to &lt;em&gt;keep intermediate results in memory&lt;/em&gt; whenever possible.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Hadoop ecosystem includes a distributed file storage system called HDFS (&lt;strong&gt;H&lt;/strong&gt;adoop &lt;strong&gt;D&lt;/strong&gt;istributed &lt;strong&gt;F&lt;/strong&gt;ile &lt;strong&gt;S&lt;/strong&gt;ystem). Spark, on the other hand, does not include a file storage system. You can use Spark on top of HDFS but you do not have to. Spark can read in data from other sources as well such as Amazon S3.&lt;/p&gt;

&lt;h2&gt;
  
  
  MapReduce
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3hf9pyowzj3ihq3texl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3hf9pyowzj3ihq3texl.png" alt="Image description" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  I. Spark ecosystem includes multiple components
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2u7z7bmrvzw8ns0f7gyz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2u7z7bmrvzw8ns0f7gyz.png" alt="Spark ecosystem" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spark Core&lt;/strong&gt;: The foundation for distributed data processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark SQL&lt;/strong&gt;: Enables structured data processing using SQL-like queries. It allows you to query data stored in various formats like Hive tables, Parquet files, and relational databases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLlib&lt;/strong&gt;: Provides machine learning algorithms for tasks like classification, regression, and clustering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphX&lt;/strong&gt;: A library for graph processing, enabling analysis of large-scale graphs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;--&amp;gt; Think of Spark as a toolbox for big data. Each component provides specialized tools for different tasks, allowing you to analyze and manipulate data efficiently and effectively.&lt;/p&gt;

&lt;h1&gt;
  
  
  II. Basic architecture of Apache Spark
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbmjwkaqjn8gs1kt0ins.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbmjwkaqjn8gs1kt0ins.png" alt="Basic architecture of Apache Spark" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Master Node&lt;/strong&gt;: This node houses the "Driver Program" which contains the Spark Context. The Spark Context is responsible for initializing the Spark application and connecting to the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Manager&lt;/strong&gt;: The Cluster Manager is responsible for allocating resources and managing the worker nodes. It can be a standalone manager or utilize systems like YARN or Mesos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Worker Nodes&lt;/strong&gt;: These nodes are the workhorses of the Spark cluster. They execute the tasks assigned by the Driver Program.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tasks&lt;/strong&gt;: These are individual units of work that are distributed across the worker nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache&lt;/strong&gt;: Worker nodes maintain a cache for storing frequently accessed data, speeding up processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Here is how it works:&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Driver Program, running on the Master Node, submits a Spark application to the Cluster Manager.&lt;/li&gt;
&lt;li&gt;The Cluster Manager distributes the application's tasks across the worker nodes.&lt;/li&gt;
&lt;li&gt;Worker nodes execute the tasks in parallel, leveraging their resources and the data cached on their local storage.&lt;/li&gt;
&lt;li&gt;The Driver Program gathers and aggregates the results from the worker nodes.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>Hi</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Tue, 12 Sep 2023 04:39:18 +0000</pubDate>
      <link>https://dev.to/congnguyen/hi-5229</link>
      <guid>https://dev.to/congnguyen/hi-5229</guid>
      <description></description>
    </item>
    <item>
      <title>Report of airfares for global flights</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Mon, 19 Jul 2021 04:14:19 +0000</pubDate>
      <link>https://dev.to/congnguyen/report-of-airfares-for-global-flights-194d</link>
      <guid>https://dev.to/congnguyen/report-of-airfares-for-global-flights-194d</guid>
      <description>&lt;p&gt;&lt;a href="https://www.rome2rio.com/labs/2018-global-flight-price-ranking/#ranking"&gt;Report link&lt;/a&gt;&lt;br&gt;
After having a look on an article about air transportation in VN vnexpress.net/khong-can-them-hang-...&lt;/p&gt;

&lt;p&gt;I have in mind a question of how expensive the airfare is in countries around VN (cost per kilometer). Then I got the research from rome2rio that could give me a satisfied answer even more info ( link for more detail) (against on data of economy class only that I think is closer to cattle class and adding fees on everything from luggage to seat assignments).&lt;/p&gt;

&lt;p&gt;Year of report was 2018 then I gonna list out number of carriers in south east Asia countries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vietnam 4 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mp6s35u30jold0l5oj2.png" alt="Alt Text" width="800" height="188"&gt;
&lt;/li&gt;
&lt;li&gt;Thailand 10 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F40jphal3hexgez9b6d5a.png" alt="Alt Text" width="800" height="400"&gt;
&lt;/li&gt;
&lt;li&gt;Phillipines 6 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zwb1gh6qek2chd10bm8.png" alt="Alt Text" width="800" height="275"&gt;
&lt;/li&gt;
&lt;li&gt;Myanmar 4 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp9azjwn89nlaos1znu4.png" alt="Alt Text" width="800" height="153"&gt;
&lt;/li&gt;
&lt;li&gt;Malaysia 4 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34pojzb1ko7ogjm9jbm3.png" alt="Alt Text" width="800" height="196"&gt;
&lt;/li&gt;
&lt;li&gt;Indonesia 9 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5em4jtxdzueievm9be2.png" alt="Alt Text" width="800" height="362"&gt;
&lt;/li&gt;
&lt;li&gt;Cambodia 4 carries
&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67d0lxpgqg4u36pwsoan.png" alt="Alt Text" width="800" height="142"&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Roads quality in Asia 2006-2019</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Tue, 15 Jun 2021 04:16:40 +0000</pubDate>
      <link>https://dev.to/congnguyen/roads-quality-in-asia-2006-2019-24cj</link>
      <guid>https://dev.to/congnguyen/roads-quality-in-asia-2006-2019-24cj</guid>
      <description>&lt;p&gt;In south asia, we could see the top one is Singapore&amp;gt;&amp;gt;Malaysia&amp;gt;&amp;gt;Thailand&amp;gt;&amp;gt;Indonesia&amp;gt;&amp;gt;Laos&amp;gt;&amp;gt;Cambodia&amp;gt;&amp;gt;VietNam&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62sdt2c579rqt790wa9g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62sdt2c579rqt790wa9g.jpg" alt="Alt Text" width="500" height="901"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://www.theglobaleconomy.com/rankings/roads_quality/Asia/"&gt;https://www.theglobaleconomy.com/rankings/roads_quality/Asia/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>logistics</category>
    </item>
    <item>
      <title>Hello World!</title>
      <dc:creator>Chi Cong, Nguyen</dc:creator>
      <pubDate>Wed, 09 Jun 2021 11:20:16 +0000</pubDate>
      <link>https://dev.to/congnguyen/hello-world-1lom</link>
      <guid>https://dev.to/congnguyen/hello-world-1lom</guid>
      <description>&lt;p&gt;This is my first post&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
