<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bradley Kipkoech</title>
    <description>The latest articles on DEV Community by Bradley Kipkoech (@bradley_kipkoech_f8d47c35).</description>
    <link>https://dev.to/bradley_kipkoech_f8d47c35</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3381012%2F3ecd81ee-9f0a-4647-adc8-3b81a0c63c01.jpg</url>
      <title>DEV Community: Bradley Kipkoech</title>
      <link>https://dev.to/bradley_kipkoech_f8d47c35</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bradley_kipkoech_f8d47c35"/>
    <language>en</language>
    <item>
      <title>DATA ANALYSIS</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Wed, 20 May 2026 11:58:54 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/data-analysis-mhf</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/data-analysis-mhf</guid>
      <description>&lt;p&gt;&lt;strong&gt;What is data analysis?&lt;/strong&gt;&lt;br&gt;
This is basically data cleaning, analysis, reporting and dashboard development.&lt;br&gt;
Data analysis revolves around data quality, data warehouses, dashboarding, documentation and using data to improve decision making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a data warehouse?&lt;/strong&gt;&lt;br&gt;
A data warehouse is a centralized system where data from different sources is collected, cleaned, organized and stored for reporting, analysis, dashboards and decision making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does data quality mean?&lt;/strong&gt;&lt;br&gt;
It mean ensuring data is fit for use. Good quality data should be complete, accurate, consistent, timely, valid and unique. This are the data quality dimensions.&lt;br&gt;
Complete - is the required data available?&lt;br&gt;
Accuracy - is the data correct?&lt;br&gt;
Consistency - does the data match across systems?&lt;br&gt;
Timeliness - was the data submitted on time?&lt;br&gt;
Validity - is it in the right format?&lt;br&gt;
Uniqueness - are there any duplicates?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to clean a dataset&lt;/strong&gt;&lt;br&gt;
First you need to understand the structure of the dataset and the expected fields. Then check for missing values, duplicates, incorrect formats, inconsistent names, outliers, and invalid outputs(will be covered in excel and python article)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What makes a good dashboard?&lt;/strong&gt;&lt;br&gt;
A good dashboard should be simple, accurate, interactive, and action-oriented. It should show the most important indicators clearly, allow users to filter by relevant categories, and help them identify area that need attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you develop a dashboard?&lt;/strong&gt;&lt;br&gt;
Start by understanding the users' information needs and key indicators they want to monitor. Then prepare and clean the data, model relationships between tables, create measures where needed, and design visuals that clearly communicate performance. After building the dashboard, validate the numbers against the source data and collect feedback from the users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a data dictionary?&lt;/strong&gt;&lt;br&gt;
It is a documentation that explains the fields in a dataset. It usually includes the column name, description, data type, allowed values, source, and business rules. It helps users understand and interpret data consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  SQL FOR DATA ANALYSIS
&lt;/h2&gt;

&lt;p&gt;SQL is a language used to communicate with a database.&lt;br&gt;
&lt;strong&gt;GOAl&lt;/strong&gt;&lt;br&gt;
Knowing how to explain and use SQL to extract, summarize, join, and validate data from a database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SELECT&lt;/strong&gt;&lt;br&gt;
Clause used to select either every column from a table or specific columns from a table.&lt;br&gt;
i.e &lt;strong&gt;SELECT *&lt;br&gt;
    FROM table_name;  this selects everything.&lt;/strong&gt;&lt;br&gt;
    &lt;strong&gt;SELECT &lt;br&gt;
      column1, column2&lt;br&gt;
      FROM table_name;  this selects specific columns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHERE&lt;/strong&gt;&lt;br&gt;
This filters rows into your specifications&lt;br&gt;
i.e &lt;strong&gt;SELECT *&lt;br&gt;
      FROM table_name&lt;br&gt;
      WHERE column1&amp;lt;80;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ORDER BY&lt;/strong&gt;&lt;br&gt;
It sorts data either in ascending or descending order.&lt;br&gt;
i.e &lt;strong&gt;SELECT &lt;br&gt;
      column1, column2&lt;br&gt;
      FROM table_name&lt;br&gt;
      ORDER BY column1 ASC/DESC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GROUP BY&lt;/strong&gt;&lt;br&gt;
It groups selected data.&lt;br&gt;
i.e &lt;strong&gt;SELECT *&lt;br&gt;
      FROM table_name&lt;br&gt;
      GROUP BY column1;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HAVING&lt;/strong&gt;&lt;br&gt;
It filters rows after grouping&lt;br&gt;
i.e &lt;strong&gt;SELECT *&lt;br&gt;
      FROM table_name&lt;br&gt;
      GROUP BY column1&lt;br&gt;
      HAVING column1&amp;lt;80;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AGGREGATE FUNCTIONS&lt;/strong&gt;&lt;br&gt;
SUM - adds up a selected column.&lt;br&gt;
COUNT(*) - counts the number of rows in a column.&lt;br&gt;
AVG - finds the average of  a selected column.&lt;br&gt;
GROUP BY - Groups rows.&lt;br&gt;
HAVING - filters rows after grouping.&lt;br&gt;
MIN - finds the minimum value in a row.&lt;br&gt;
MAX - finds the maximum value in a row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JOINS&lt;/strong&gt;&lt;br&gt;
They help us connect tables.&lt;br&gt;
For instance you have three tables.&lt;br&gt;
&lt;strong&gt;types&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Inner join&lt;/strong&gt;&lt;br&gt;
Returns only rows that have matching values in both tables.&lt;br&gt;
i.e &lt;strong&gt;SELECT column1&lt;br&gt;
      FROM table_name1&lt;br&gt;
      INNER JOIN table_name2&lt;/strong&gt;&lt;br&gt;
      ON table_name1.column1 = table_name2.column1;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Left join&lt;/strong&gt;&lt;br&gt;
It takes all the records on the left table and attaches matching records from the right table&lt;br&gt;
i.e &lt;strong&gt;SELECT column1, column2, column3&lt;br&gt;
      FROM table_name1&lt;br&gt;
      LEFT JOIN table_name2&lt;br&gt;
        ON table_name1_id = table_name2_id;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right join&lt;/strong&gt;&lt;br&gt;
Returns all the rows from the right table and only matching rows from the left table.&lt;br&gt;
i.e &lt;strong&gt;SELECT column1&lt;br&gt;
      FROM table_name1&lt;br&gt;
      RIGHT JOIN table_name2&lt;br&gt;
      ON table_name1.column = table_name2.column&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full join&lt;/strong&gt;&lt;br&gt;
Returns all rows when there is match either in the left or the right table.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>python</category>
      <category>powerbi</category>
      <category>excel</category>
    </item>
    <item>
      <title>DOCKER</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Tue, 26 Aug 2025 05:56:26 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/docker-1lio</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/docker-1lio</guid>
      <description>&lt;p&gt;&lt;u&gt;DOCKER&lt;/u&gt;&lt;br&gt;
we first need to understand what docker is&lt;br&gt;
Docker is an open platform that enables developers to build, ship and run applications within an isolated, lightweight environments called containers.it simplifies the process of deploying and managing applications by packaging application and all its dependencies(libraries, system tools, code, runtime) into a single portable unit.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;key concepts&lt;/u&gt;&lt;br&gt;
&lt;u&gt;containers&lt;/u&gt;&lt;br&gt;
this are standalone, executable packages that contain everything an application needs to run.they share the host operating system's kernel, thus making them more lightweight and efficient.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;images&lt;/u&gt;&lt;br&gt;
this are read-onlt templates used to create containers.They act as blueprint, defining the application's environment and dependencies.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Dockerfile&lt;/u&gt;&lt;br&gt;
it is a text file containing instructions for building a docker image. it specifies the base image, dependencies and commands needed to set up the application's environment.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Docker engine&lt;/u&gt;&lt;br&gt;
this is the core component of docker which is responsible for building and running containers. it includes a server(daemon) , APIs and a command-line interface.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Benefits of docker&lt;/u&gt;&lt;br&gt;
portability&lt;br&gt;
efficiency&lt;br&gt;
isolation&lt;br&gt;
scalability&lt;br&gt;
simplified deployment&lt;/p&gt;

</description>
    </item>
    <item>
      <title>DE CONCEPTS</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Tue, 26 Aug 2025 05:44:21 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/de-concepts-2fjo</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/de-concepts-2fjo</guid>
      <description>&lt;p&gt;&lt;u&gt;Columnar vs Row-based storage&lt;/u&gt;&lt;br&gt;
row-based storage stores the entire records together, making it efficient for transactional workloads that frequently access complete rows.Optimal for oltp where you typically need all columns of specific records.&lt;br&gt;
Columnar storage groups data by columns, enabling efficient compression and fast analytical queries that only access specific columns.ideal for olap workloads, data warehousing, and scenarios with selective column access patterns.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Partitioning&lt;/u&gt;&lt;br&gt;
Divides large datasets into smaller, managealbe segments based on specific criteria like data ranges, geogrphic regions or hash values. this improves query performance and enables parallel processing and simplifies data management tasks like archiving and backup.  &lt;/p&gt;

&lt;p&gt;&lt;u&gt;CAP theorem&lt;/u&gt;&lt;br&gt;
states that distributed systems can guarantee at most two of the three properties, consistency-all nodes see the same data, availability-system remains operation and partition torelance-system continues despite network failure.Modern systems often provide tunable consisitency levels, allowing different guarantees for different use cases within the same system&lt;/p&gt;

&lt;p&gt;&lt;u&gt;windowing in streaming&lt;/u&gt;&lt;br&gt;
divides continuous data streams into finite chunks for processing, tumbling windows are fixed-size, non-overlapping time intervals.Sliding windows overlap and move continuously. session windows group events based on activity periods with gaps indicating session boundaries&lt;br&gt;
inludes handling late-arriving data, watermark for determining window completeness and trigger conditions for window evaluation.system like apache flink and kafka streams provide sophisticated windowing capabilities with configurable lateness and results updating strategies.proper windowing enbales meaningful aggregations and analysis over unbounded data streams while managing memory usage and computational complexity.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Retry logic &amp;amp; Dead letter queues&lt;/u&gt;&lt;br&gt;
retry logic automatically reattempts failed operations with strategies like exponential backoff, fixed delays, or linear backoff. it handles transient failures and must be implemented carefully to avoid overwhelming systems or creating infinite loops&lt;br&gt;
DLQs capture messages that cannot be processed after exhausting retry attempts. they prevent message loss, enable failure analysis and allow for manual intervention or atlernatives.eg categorizing errors (transient vs permanent), implementing circuit breakers, adding jitter to prevent thundering herd problems, and monitoring retry patterns to identify systemic issues requiring architectural changes.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Backfilling and reprocessing&lt;/u&gt;&lt;br&gt;
processes historical data to populate new datasets or fill gaps in existing ones. common when introducing new features, fixing data quality issues or migrating systems, backfill jobs often process data in reverse chronological order to provide recent data first.&lt;/p&gt;

&lt;p&gt;reprocessing reruns data pipelines on existing data, typically after fixing bugs, updating business logic, or recovering from failures. it requires carefull consideration of downstream impacts and often involves versioning strategies to manage different data generations&lt;br&gt;
Challenges include managing computational resources, ensuring data consistency during the process, handling schema evolution, and coordinating with downstream consumers to prevent conflicts or inconsistencies.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Time travel &amp;amp; data versioning&lt;/u&gt;&lt;br&gt;
Time travel allows quering historical versions of data, enabling analysis of changes over time and recovery from accidental modifications. systems like snowflake, bigquery and delta lake provide built-in time travel capabilities with configurable retention periods&lt;br&gt;
data versioning tracks changes of datasets and schemas, similar to version control for code. it enables reproducible analytics, A/B testing of data transformations and rollback capabilities when issues are discovered.&lt;/p&gt;

&lt;p&gt;Implementation approaches include snapshot-based versioning, log-based change tracking, and copy-on-write mechanisms. These features are crucial for data debugging, compliance auditing, and maintaining data science experiment reproducibility.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Distributed processing concepts&lt;/u&gt;&lt;br&gt;
Distributed processing enables handling large-scale data by spreading computation across multiple machines. Key concepts include data locality (processing data where it's stored), fault tolerance through replication and checkpointing, and coordination mechanisms for task distribution.&lt;br&gt;
Frameworks like Apache Spark use concepts such as resilient distributed datasets (RDDs), lazy evaluation for optimization, and automatic task scheduling. Map-reduce paradigms break complex operations into parallelizable steps, while more modern frameworks support iterative algorithms and real-time processing.&lt;br&gt;
Challenges include managing data shuffling costs, handling stragglers (slow tasks), ensuring fault tolerance, and optimizing resource utilization across the cluster while maintaining data consistency and system reliability.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>DATA WAREHOUSES AND LAKE HOUSES</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Mon, 11 Aug 2025 23:01:11 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/data-warehouses-and-lake-houses-282p</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/data-warehouses-and-lake-houses-282p</guid>
      <description>&lt;p&gt;&lt;u&gt;Data warehouses&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Data warehouse is a centralized repository which is designed to store, manage and analyze large volumes of current and historical data from various departments on an organization. They are optimized for analytical processing and business intelligence&lt;/p&gt;

&lt;p&gt;&lt;u&gt;characteristics of a warehouse&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;they can handle massive amounts of data&lt;br&gt;
they not only deal with current data but also historical data&lt;br&gt;
the system architecture prioritizes query performance and data retrieval speed over transactional processing, making complex analytical queries execute efficiently&lt;br&gt;
they consolidate information from multiple data sources&lt;br&gt;
business intelligence support because of its ability to analyze and visualize data&lt;br&gt;
the structure allows the data in it to be easily accessible.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;benefits&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;improved decision making&lt;br&gt;
enhanced performance&lt;br&gt;
data quality and consistency&lt;/p&gt;

&lt;p&gt;&lt;u&gt;OLTP VS OLAP&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Online transactional processing systems are designed to handle real-time transactional operations that occur in day-to-day business activities.eg banking systems.&lt;/p&gt;

&lt;p&gt;online analytical processing are optimized for complex analysis, reporting and business intelligence activities.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Key Differences and Implications&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Transaction vs. Analysis: OLTP systems excel at processing individual transactions quickly and accurately, while OLAP systems specialize in analyzing patterns across large datasets.&lt;/p&gt;

&lt;p&gt;Data Freshness: OLTP systems work with real-time data, whereas OLAP systems typically work with data that may be hours or days old, depending on the ETL schedule.&lt;/p&gt;

&lt;p&gt;Concurrency Requirements: OLTP systems must handle many simultaneous users performing transactions, while OLAP systems typically serve fewer concurrent users running complex queries.&lt;/p&gt;

&lt;p&gt;Failure Impact: OLTP system downtime directly affects business operations, while OLAP system unavailability impacts reporting and analysis capabilities.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;data modelling&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;data modelling is the systematic process of creating abstract representation of data structures, relationships and constraints to support a specific business requirement and analytical needs.&lt;br&gt;
data modelling is like creating an architectural blueprint before constructing a blueprint.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;types of models&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;conceptual model - provides high-level, business oriented view of data requirements without technical implementation details.&lt;br&gt;
Example Elements:&lt;/p&gt;

&lt;p&gt;Customer entity related to Order entity&lt;br&gt;
Product entity connected to Category entity&lt;br&gt;
Employee entity associated with Department entity&lt;/p&gt;

&lt;p&gt;logical model - this adds technical detail while remaining independent to specific database management systems.&lt;br&gt;
Additional Elements:&lt;/p&gt;

&lt;p&gt;Customer_ID (Integer, Primary Key)&lt;br&gt;
Customer_Name (VARCHAR(100), NOT NULL)&lt;br&gt;
Order_Date (DATE, NOT NULL)&lt;/p&gt;

&lt;p&gt;Physical Model - this specifies how data will be stored in a particular database system&lt;/p&gt;

&lt;p&gt;&lt;u&gt;dimensional modelling for data warehousing&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;it is a preferred approach of designing a data warehouse structure because it optimizes for analytical query performance while maintaining business user comprehension.&lt;/p&gt;

&lt;p&gt;fact tables - serves as the central repo for measurable business metrics and form the core dimensional tables.&lt;/p&gt;

&lt;p&gt;dimensional tables - it provides the descriptive context that makes fact table measurements meaningful and analyzable&lt;/p&gt;

</description>
    </item>
    <item>
      <title>CHANGE DATA CAPTURE</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Mon, 11 Aug 2025 22:15:32 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/change-data-capture-51b1</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/change-data-capture-51b1</guid>
      <description>&lt;p&gt;CDC indentifies and captures changes made to data in source. it enables incremental data synchronization.it tracks insert, update and deletion operations.involves reading database transaction logs,using database triggers and tracking modification timestamps.Crutial for maintaining data warehouses enabling real-time replication.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Idempotency&lt;/u&gt;&lt;br&gt;
this ensures that performning the same operation multiple times produces identical results. this prevents duplicate records and maintains data consistency when duplicate messages are processed&lt;br&gt;
implementation include using unique keys, upsert operations, deduplication windows in streaming systems and maintaining processing state.eg using combination of timestamp and transactional id as a composite key ensuring reprocessing the same event doesn't create duplicates.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>BATCH VS STREAM INGENSTION</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Mon, 11 Aug 2025 22:02:30 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/batch-vs-stream-ingenstion-hcg</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/batch-vs-stream-ingenstion-hcg</guid>
      <description>&lt;p&gt;Batch ingestion processes data in discrete chunks at scheduled intervals.Mostly historical data in large volumes.Suitable where latency isn't critical.Involves etl jobs processing transaction log.&lt;br&gt;
Streaming ingestion processes data as it arrives,enabling rea time analytics for real time analytics and immediate response.essential for fraud dectection, monitoring systems and live dashboards.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>dataengineering</category>
      <category>streaming</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Data Governance Framework and Data Security.</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Mon, 11 Aug 2025 21:54:47 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/data-governance-framework-and-data-security-43pj</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/data-governance-framework-and-data-security-43pj</guid>
      <description>&lt;p&gt;&lt;u&gt;Data Governance&lt;/u&gt;&lt;br&gt;
it is a framework that defines how an organization manages, protects and derive value from its data assets.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Objectives&lt;/u&gt;&lt;br&gt;
Data quality management&lt;br&gt;
Data security and privacy&lt;br&gt;
Data stewardship&lt;br&gt;
Regulatory compliance&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Business value of data goverenance&lt;/u&gt;&lt;br&gt;
Risk mitigation&lt;br&gt;
Improved decision making&lt;br&gt;
operational efficiency&lt;br&gt;
Competitive advantage&lt;br&gt;
Cost reduction&lt;/p&gt;

</description>
      <category>data</category>
      <category>security</category>
      <category>privacy</category>
      <category>compliance</category>
    </item>
    <item>
      <title>ETL pipeline and airflow</title>
      <dc:creator>Bradley Kipkoech</dc:creator>
      <pubDate>Wed, 23 Jul 2025 12:13:03 +0000</pubDate>
      <link>https://dev.to/bradley_kipkoech_f8d47c35/etl-pipeline-2677</link>
      <guid>https://dev.to/bradley_kipkoech_f8d47c35/etl-pipeline-2677</guid>
      <description>&lt;p&gt;Extraction, transform, load is a process of extracting data from various data sources, transforming them to clean formats which then can be loaded in a database.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Extraction &amp;amp; transformation&lt;/u&gt;&lt;br&gt;
in extraction we basically get data from various data sources, the data sources could include databases, api keys, images, videos and various files i.e json, xml.&lt;br&gt;
i use python to extract the files form the data sources. i recently worked with api key where i get the url and assign a variable to it and used the requests library to get the data from the api. Then i change the data into a json format. i the used pandas library to create a data frame for the data i have. which covers for the transformation part.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;load&lt;/u&gt;&lt;br&gt;
loading the data into the database using the engine library where i had to input my database connection string, which includes the database name, the host plus port and the password.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Automation&lt;/u&gt;&lt;br&gt;
say one is trying to get weather data for a particular place, now the weather changes every moment, one needs to run the etl process every now and then in order to gets the latest data. It might be hectic to keep running the file manually, that is where airflow comes in, airflow is an apache which automates an etl process​ A DAG (directed acyclic graph) is a mathematical structure consisting of nodes and edges. In Airflow, a DAG represents a data pipeline or workflow with a start and an end. The mathematical properties of DAGs make them useful for building data pipelines: Directed means there is a clear direction of flow between tasks.&lt;br&gt;
&lt;u&gt;all you need to know about airflow&lt;/u&gt;&lt;br&gt;
first you need to create a virtual environment. In python we use the following command: python -m venv venv(name of you environment) the you need to activate it by the following command: source venv/bin/activate.&lt;br&gt;
You the need to install airflow using the following command: pip install apache airflow==2.8.0(this is the version), if you're using a linux os or wsl you might need to use sudo. After installation is done, you need to adjust some config files to suit you to use. So you need to change path to your airflow folder i.e cd airflow, then you open the configuration file which is normally saved as airflow.cfg by using the following command: nano airflow.cfg. What we need to edit in this file first is the executor we change it to localexecutor because we want our processes to be executed locally, this works when you are not using sqlite as your database. We also edit to show examples and we set it to false so our airflow user interface won't show examples of dags. &lt;br&gt;
the next is the db config where we need to add our connection string, for me i used a postgres database and this is the syntax:db_url = "postgresql://postgres:your password@localhost:5432/db name".&lt;br&gt;
we need to save the file the exit back to our folder, where we can run airflow db migrate to allow the airflow to move from the default sqlite to your preferred database. we then run airflow db init to initialize our database and airflow webserver and follow the port to local host to enable us view our ui.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
