DEV Community: Gacheri-Mutua

Building a simple ETL pipeline with python and PostgreSQL.

Gacheri-Mutua — Mon, 15 Jun 2026 11:45:41 +0000

Data is the backbone of modern applications, but it rarely arrives in a perfectly structured format ready for analysis. Data engineering is the development, execution, and support of systems and processes that take in raw data and produce high-quality, consistent information that supports analysis and decision making.

An ETL (Extract, Transform, Load) pipeline will extract data (from one or multiple sources), transform it (by cleaning or changing the format) and save it in a database. The data pipeline illustrated in this article follows the three phases:
Prerequisite Note: This project assumes a basic familiarity with Linux environments, specifically using the terminal to manage directories and configure environment variables.

Extract: Fetch live news articles about "Apple" via an HTTP GET request and parse the JSON response.

# importing relevant packages
import requests
import os
from dotenv import load_dotenv
import pandas as pd
from sqlalchemy import create_engine



load_dotenv()

url = ('https://newsapi.org/v2/everything?'
       'q=Apple&'
       'from=2026-05-11&'
       'sortBy=popularity&'
       'apiKey=' + os.getenv('NEWS_API_KEY'))

response = requests.get(url)
print (response.json())

#converting the json file into  dictionary
data = response.json()
type(data)
data

#getting the list
articles =data['articles']
type(articles)
articles

Transform: Filter out unnecessary fields and convert the raw JSON into a structured pandas dataframe.

#converting to dataframe
articles_df = pd.DataFrame(articles)
articles_df.head()

#cleaning
articles_df.drop(columns={'source', 'urlToImage'}, inplace=True)
articles_df = articles_df.drop(columns={'source','urlToImage'})

Load: Connect to a PostgreSQL database using SQLAlchemy and save the cleaned data.

DB_HOST = os.getenv('DB_HOST')
DB_USER = os.getenv('DB_USER')
DB_NAME = os.getenv('DB_NAME')
DB_PASSWORD = os.getenv('DB_PASSWORD')
DB_PORT = os.getenv('DB_PORT')

engine = create_engine(f'postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}')

articles_df.to_sql('articles', con=engine,if_exists='replace', index=False)

Code mechanics
The pipeline into distinct extract(), transform(), and load() functions orchestrated by a main(). If the data source changes the extract() function needs adjustment without losing the connection to the database.

The transformed data is successfully loaded into a PostgreSQL and the pipeline uses a .env file in your root directory to securely store your API keys and database credentials and requires to have the necessary libraries installed before writing the script. Having done this, we have successfully built a modular ETL pipeline.

Conclusion
This project gave a hands-on view of how a data pipeline is built and covers the full lifecycle of Extract → Transform → Load.

## Supervised learning — focus on classification

Gacheri-Mutua — Mon, 29 Sep 2025 19:48:34 +0000

Supervised learning is a family of machine learning methods where models learn a mapping from inputs to known outputs using labeled examples. You train a model on a dataset of input features paired with target labels so it can predict labels for new, unseen inputs. The supervision (labels) guides the model to discover patterns, relationships, or decision boundaries that connect features to outcomes.

How classification works

Classification is the branch of supervised learning where the target is categorical (discrete classes). At a high level classification proceeds in these steps:

Data collection and labeling — gather feature vectors and assign class labels.
Preprocessing — clean data, handle missing values, encode categorical variables, scale numeric features, and split into train/validation/test sets.
Model selection and training — pick a classifier and fit it to the training data by minimizing a suitable loss (e.g., cross-entropy, hinge loss) using optimization methods.
Evaluation — measure performance with metrics appropriate for the task (accuracy, precision, recall, F1, ROC AUC, confusion matrix), using validation/test data and possibly cross-validation.
Calibration and thresholding — for probabilistic classifiers, convert scores to calibrated probabilities or choose decision thresholds to trade off precision vs recall.
Deployment and monitoring — deploy the model and monitor drift, performance degradation, and data quality.

Common classification models

Logistic Regression: simple, interpretable, probabilistic linear classifier; effective when classes are linearly separable or after appropriate feature transforms.
Support Vector Machine (SVM): maximizes margin; kernel SVM handles nonlinearity; effective on medium-sized datasets.
Decision Tree: interpretable rules, handles mixed data types, prone to overfitting unless pruned.
Random Forest: ensemble of trees; strong baseline, robust to overfitting, handles missing values and categorical features well.
Gradient Boosting Machines (XGBoost, LightGBM, CatBoost): high-performance tree ensembles, excellent for tabular data.
k-Nearest Neighbors (knn): simple, nonparametric, effective for low-dimensional data but costly at inference for large datasets.
Naive Bayes: fast, works well with high-dimensional sparse data (e.g., text), assumes feature independence.
Neural Networks / Deep Learning: from shallow MLPs to CNNs/RNNs/Transformers; state-of-the-art on images, text, speech, and complex structured data when large labeled datasets are available.
Calibrated and probabilistic variants: Platt scaling, isotonic regression, Bayesian classifiers, and more for uncertainty estimates.

Model selection considerations

Data size and dimensionality: simple models (logistic regression, naive Bayes) often suffice for small datasets; tree ensembles or deep nets require more data.
Feature types: trees handle mixed types and missingness; linear models require careful encoding/scaling.
Interpretability: logistic regression and shallow trees are easier to explain; deep models and ensembles are less transparent.
Latency and resource constraints: k-NN and large ensembles can be slow at inference; model compression or simpler models may be needed.
Imbalanced classes: prefer metrics beyond accuracy (precision/recall, F1, ROC-AUC) and use resampling, class-weighting, focal loss, or one-vs-rest schemes as appropriate.

My views and insights

Start simple and iterate: I find starting with a well-regularized logistic regression or a small decision tree gives a quick baseline, reveals data issues, and informs feature engineering. Only escalate to complex models when simpler baselines plateau.
Feature engineering often matters more than model choice for tabular data: creating informative features, careful encoding, handling missing values, and employing domain knowledge frequently produce larger gains than swapping classifiers.
Ensembles are powerful but come with cost: random forests and gradient boosting reliably boost performance, but they reduce interpretability and increase inference cost; use them when the performance gain justifies complexity.
Probabilities and calibration are underappreciated: in many applications (medical, finance), well-calibrated probabilities matter more than raw accuracy. Calibration methods and evaluating with proper scoring rules (Brier score, log loss) should be standard practice.
Evaluation must align with the real objective: optimize and validate against business or safety-relevant metrics (e.g., cost-sensitive measures, recall at fixed precision) rather than generic accuracy.
Reproducible pipelines win long-term: automated preprocessing, clear train/validation splits (time-based when applicable), and versioned datasets/models reduce surprises when models are deployed.

Challenges I’ve faced with classification

High-dimensional sparse data: in text or categorical-heavy datasets, feature explosion makes some models slow or prone to overfitting; dimensionality reduction or regularization is required.
Overfitting and generalization: tuning complex models without robust validation induces overfitting. Cross-validation, nested CV for hyperparameter tuning, and simple baselines mitigate this.

Practical checklist for a classification project

Split data respecting temporal or group structure if present.
Baseline with simple models (e.g logistic regression).
Engineer and validate features; encode categorical data sensibly.
Choose evaluation metrics that reflect business needs; use cross-validation.
Try robust models (random forest, gradient boosting) and calibrate probabilities.

Rags for dummies

Gacheri-Mutua — Mon, 29 Sep 2025 19:12:25 +0000

Retrieval augmented generation (RAG) is an artificial intelligence (AI) architecture that incorporates external knowledge sources to enhance the capabilities of large language models (LLMs). RAG pulls in relevant information from outside databases and amplifies input from LLM so that the output can include more relevant, accurate, and contextually appropriate responses. This in turn makes LLMs more powerful by combining them with the ability to retrieve real-time data.

How do rags work?

The process begins with an input query or prompt. This could be a question, a statement, or any text that requires a response. The model first analyzes this input to understand its context and intent.
When prompted, the system searches a large set of documents (such as PDFs, FAQs, web pages, or databases) using a retriever model, often based on semantic similarity or keyword matching and selects the most relevant pieces of content.

RAG integrates this information and original input query and the embeddings of the retrieved documents are combined to form a comprehensive context for the generative model.

The retrieved documents are passed to a generator model which uses them to craft a coherent, contextually accurate response. In this way, the responses are plausible and grounded in real data.

Unsupervised Learning: A Focus on Clustering

Gacheri-Mutua — Sat, 27 Sep 2025 19:21:22 +0000

Unsupervised learning is a type of machine learning that deals with data that does not have labeled responses. Unlike supervised learning, where the model is trained on a dataset with known outputs, unsupervised learning aims to find hidden patterns or intrinsic structures in the input data.

In unsupervised learning, the algorithm analyzes the input data to identify patterns or groupings without any prior knowledge of the outcomes. The process typically involves the following steps:

Data Input: The algorithm receives a dataset containing multiple features.
Pattern Recognition: The model processes the data to identify similarities and differences among the data points.
Clustering: Based on the identified patterns, the algorithm groups the data points into clusters, where points in the same cluster are more similar to each other than to those in other clusters.

The primary goal is to explore the data and uncover its structure, which can lead to insights that inform further analysis or decision-making.
Several models are commonly used in clustering within unsupervised learning:

K-Means Clustering: This algorithm partitions the dataset into K distinct clusters based on feature similarity. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence.

Hierarchical Clustering: This method builds a hierarchy of clusters through either agglomerative (bottom-up) or divisive (top-down) approaches. It creates a dendrogram that visually represents the relationships between clusters.

One of the most compelling aspects of unsupervised learning is it allows for the exploration of data without preconceived notions. This can lead to surprising insights that might not have been considered initially. For instance, clustering algorithms can reveal natural groupings in customer data, enabling businesses to tailor their marketing strategies more effectively.

This can serve as a powerful complement to supervised learning where clustering can be used to preprocess data by identifying groups that can then be labeled for supervised learning tasks.

Excel; still an enigma, for now.

Gacheri-Mutua — Wed, 11 Jun 2025 21:47:52 +0000

by someone who only wanted to do summation.

My interaction with Excel this week has shown me that working with it is similar to knowing that one person who doesn't say much but has layers to them that you begin to notice after one interaction.

Initial interaction with excel

On the surface, Excel is a spreadsheet that you organise and format data for storage; however, you find it appealing. It is quite humbling to think you are tech-savvy, then spending the next two hours trying to convert the percentages to integers.

This cordial spreadsheet holds the power to analyse, automate, model, present data, and, for the unfortunate, cause despair. One missing comma and you can't format the whole column in a dataset. Excel is excellent at data analysis, allowing users to explore trends, patterns, and outliers in datasets, which is particularly useful for departments such as sales, marketing, and finance to make informed forecasts.

Exciting features

One feature that stuck out to me was pivot tables. They are easy to navigate and are useful for grouping, filtering, and comparing different information across multiple capacities. Conditional formatting is another handy tool that highlights cells based on specific rules to gain targeted information, such as identifying underperforming sectors in a given field. The most exciting part is that with Excel, different tools can be used in combinations and permutations according to how much you know Excel.

And now,

The more I learn about Excel, the more I see data differently since structures and hidden relationships that went unnoticed are being revealed. Instead of feeling overwhelmed, you begin to ask better questions, and that is a fine indicator that you are engaging with data analytically for interpretation. Then, Excel becomes a lens to spot the missing links, or what escapes the untrained eye.