DEV Community: Mubarak Mohamed

Why Decision Trees Don't Need Feature Scaling (And Why This Matters)

Mubarak Mohamed — Tue, 24 Feb 2026 12:40:43 +0000

Ever spent hours normalizing your dataset only to wonder if it was really necessary? If you're using tree-based algorithms, I've got news for you...

TL;DR

Decision Trees, Random Forests, XGBoost, and LightGBM don't need feature scaling
Distance-based algorithms (k-NN, SVM, Neural Networks) absolutely do
Why? Trees use threshold comparisons, not distance calculations

Let's dig into why this is the case and prove it with code!

Wait, What's Feature Scaling Again?

Feature scaling transforms your numerical variables to a common scale. The two most popular methods:

Min-Max Scaling → squashes values between 0 and 1
Standardization (Z-score) → centers data around 0 with std dev of 1

Quick example:

# Before scaling
salary = [25000, 50000, 75000, 100000]
age = [22, 30, 45, 60]

# After Min-Max scaling
salary_scaled = [0.0, 0.33, 0.67, 1.0]
age_scaled = [0.0, 0.21, 0.61, 1.0]

🌲 How Decision Trees Actually Work

Here's the key insight: Decision Trees make decisions based on threshold comparisons, not distances.

At each node, a tree asks questions like:

Is salary > 50000?
  ├─ YES → Is age > 35?
  │        ├─ YES → Prediction A
  │        └─ NO → Prediction B
  └─ NO → Prediction C

The algorithm:

Tests every possible threshold on every feature
Calculates a purity metric (Gini, Entropy, or Variance)
Picks the split that best separates the data

The purity metrics:

Gini Impurity (classification):

Gini = 1 - Σ(p_i²)

Entropy (classification):

Entropy = -Σ(p_i × log₂(p_i))

Variance Reduction (regression):

Variance = (1/n) × Σ(y_i - ȳ)²

Critical point: None of these calculations involve distances between observations!

The Magic: Why Scaling Doesn't Matter

Let's say we're testing a split on salary:

Original data: salary > 60000
Scaled data: salary_scaled > 0.5

These two conditions separate the exact same observations! 🎯

Here's why:

Scaling is a monotonic transformation - it preserves the order of values.

# Original
[30000, 45000, 60000, 75000, 90000]

# After Min-Max scaling  
[0.00, 0.25, 0.50, 0.75, 1.00]

The order stays the same: 30000 < 45000 < 60000 → 0.00 < 0.25 < 0.50

Since trees test all possible thresholds, they'll find the same optimal split regardless of scale!

Proof Time: Let's Code!

Let's prove this with a real experiment:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Set random seed
np.random.seed(42)

# Generate dataset with wildly different scales
X, y = make_classification(n_samples=1000, n_features=5, 
                          n_informative=3, random_state=42)

# Create different scales intentionally
X[:, 1] = X[:, 1] * 100      # Scale: 0-100
X[:, 2] = X[:, 2] * 10000    # Scale: 0-10000
X[:, 4] = X[:, 4] * 2000 + 3000  # Scale: 1000-5000

print("Feature scales:")
print(f"Feature 0: {X[:, 0].min():.2f} to {X[:, 0].max():.2f}")
print(f"Feature 1: {X[:, 1].min():.2f} to {X[:, 1].max():.2f}")
print(f"Feature 2: {X[:, 2].min():.2f} to {X[:, 2].max():.2f}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model 1: WITHOUT scaling

dt_raw = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_raw.fit(X_train, y_train)
acc_raw = accuracy_score(y_test, dt_raw.predict(X_test))

cv_raw = cross_val_score(dt_raw, X_train, y_train, cv=5)

print(f"WITHOUT scaling: {acc_raw:.4f}")
print(f"CV score: {cv_raw.mean():.4f} (+/- {cv_raw.std():.4f})")

Model 2: WITH scaling

dt_scaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, dt_scaled.predict(X_test_scaled))

cv_scaled = cross_val_score(dt_scaled, X_train_scaled, y_train, cv=5)

print(f"WITH scaling: {acc_scaled:.4f}")
print(f"CV score: {cv_scaled.mean():.4f} (+/- {cv_scaled.std():.4f})")

Results:

WITHOUT scaling: 0.9400
CV score: 0.9200 (+/- 0.0183)

WITH scaling: 0.9400
CV score: 0.9200 (+/- 0.0183)

Identical performance! 🎉

All Tree-Based Algorithms Follow This Rule

This applies to the entire tree family:

Algorithm	Needs Scaling?	Why Not?
Decision Tree	❌	Threshold comparisons
Random Forest	❌	Ensemble of decision trees
Extra Trees	❌	Random threshold selection
Gradient Boosting	❌	Sequential tree building
XGBoost	❌	Optimized tree splits
LightGBM	❌	Binning preserves order
CatBoost	❌	Categorical encoding + tree splits

But These Algorithms DO Need Scaling

For contrast, here's why distance-based algorithms are picky:

k-Nearest Neighbors (k-NN)

Uses Euclidean distance:

distance = √[(salary₁ - salary₂)² + (age₁ - age₂)²]

With salary: 50000-51000 and age: 30-50:

distance = √[(50000-51000)² + (30-50)²]
distance = √[1000000 + 400] ≈ 1000

Age is completely dominated by salary! Without scaling, age becomes irrelevant.

Let's prove it:

from sklearn.neighbors import KNeighborsClassifier

# k-NN without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
acc_knn_raw = accuracy_score(y_test, knn_raw.predict(X_test))

# k-NN with scaling
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_knn_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f"k-NN WITHOUT scaling: {acc_knn_raw:.4f}")
print(f"k-NN WITH scaling: {acc_knn_scaled:.4f}")

Results:

k-NN WITHOUT scaling: 0.8800
k-NN WITH scaling: 0.9633

Massive 21.67% improvement! Scaling is critical for k-NN.

Other sensitive algorithms:

SVM → Optimizes geometric margins

Logistic Regression → Gradient descent sensitive to magnitude

Neural Networks → Gradient stability requires normalized inputs

🤓 Edge Cases: When You Might Still Scale Trees

While not necessary, scaling can help in these scenarios:

1. Feature Importance Interpretation

Some implementations calculate importance based on total criterion reduction. Variables with larger ranges might appear artificially more important.

Impact: Usually negligible, but worth checking in extreme cases (0-1 vs 0-1000000)

2. Regularization in Advanced Models

XGBoost and LightGBM offer L1/L2 regularization:

import xgboost as xgb

model = xgb.XGBClassifier(
    reg_alpha=0.1,   # L1 
    reg_lambda=1.0   # L2
)

These penalties can be slightly sensitive to scale, though impact is marginal.

3. Mixed Model Pipelines

When combining algorithms:

from sklearn.ensemble import VotingClassifier

pipeline = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier()),  # Doesn't need scaling
        ('svm', SVC()),                     # Needs scaling
        ('lr', LogisticRegression())        # Needs scaling
    ]
)

Solution: Scale everything - won't hurt the Random Forest!

The Bottom Line

When working with trees:

✅ Skip scaling → Save computation time

✅ Focus on feature engineering → 100x more impact

✅ Tune hyperparameters → max_depth, learning_rate, etc.

✅ Handle missing values → Still critical!

When working with distance/gradient-based models:

Always scale → Non-negotiable

Standardization usually better than Min-Max

Check your pipeline → Ensure consistent preprocessing

Key Takeaways

Trees compare thresholds, not distances → Scaling is irrelevant
Monotonic transformations preserve order → Same splits regardless of scale
k-NN, SVM, Neural Nets need scaling → Distance/gradient calculations are sensitive
Feature engineering > Scaling → Focus your efforts where they matter

🔗 Want to Go Deeper?

Here are some great resources:

Have you ever wasted time scaling data for tree models? What's your preprocessing workflow? Drop a comment below! 👇

If this helped you, consider:

❤️ Giving it a like
🔖 Bookmarking for later
🔄 Sharing with your team

Happy coding! 🎉

Found a typo or have a suggestion? Leave a comment or reach out!

📊 Quick Reference Cheatsheet

# ❌ Don't waste time on this for trees
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Unnecessary!
rf = RandomForestClassifier()
rf.fit(X_scaled, y)

# Just do this instead
rf = RandomForestClassifier()
rf.fit(X, y)  # Works perfectly!

# But DO scale for these
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Critical!

knn = KNeighborsClassifier()
knn.fit(X_scaled, y)

Part of my Machine Learning Fundamentals series. Follow for more deep dives! 🚀

2026: The Year Data Science Changed Forever (And What It Means for You)

Mubarak Mohamed — Tue, 10 Feb 2026 12:08:16 +0000

I've been in Data Science for 5 years, and 2026 feels different. Not "new tool different" — fundamentally different.

Last week, I watched a marketing manager with zero coding experience build a customer churn prediction model in 20 minutes using a conversational AI interface. Three years ago, that would've taken my team two weeks.

This isn't just about tools getting better. The entire data profession is being redefined, and if you're not paying attention, you might miss the shift.

🚨 Why 2026 Actually Matters

Let me be clear: I'm not here to tell you "AI is taking your job" (it's not). But ignoring what's happening would be like a web developer ignoring JavaScript frameworks in 2015.

Three seismic shifts are converging right now:

1. Generative AI isn't just answering questions anymore

It's writing production SQL queries, generating entire analysis pipelines, and explaining statistical concepts better than most tutorials.

# What we used to do (2023)
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('sales.csv')
# ... 50 lines of data cleaning ...
# ... 30 lines of visualization code ...

# What happens now (2026)
# Prompt: "Analyze sales.csv, clean the data, and show me regional trends"
# AI generates the entire pipeline + explains every decision

2. AutoML reached production maturity

Platforms like DataRobot and H2O.ai don't just train models — they:

Handle feature engineering automatically
Select optimal algorithms
Deploy to production with monitoring
Explain predictions in plain language

The technical barrier to ML just collapsed.

3. No-code ate the analytics market

Your CEO can now ask Tableau: "Why did Q1 revenue drop in the Southeast?" and get a structured answer with visualizations. No SQL. No Python. No data analyst in the loop.

Does this mean Data Analysts are obsolete? Absolutely not. But the job description just changed radically.

🔍 The 6 Trends You Can't Ignore

Trend 1: AI Copilots in Every Tool

Every major BI platform now has a conversational interface. This isn't a gimmick — it's changing who can do analytics.

Impact: Your value shifts from creating dashboards to interpreting insights and guiding strategy.

Trend 2: Real-Time Analytics Becomes Standard

Streaming data (Kafka, Flink) + cloud infrastructure means:

Dynamic pricing models that adjust in real-time
Instant fraud detection
Live personalization engines

The batch processing era is ending.

Trend 3: Augmented Analytics (AI That Thinks Ahead)

This goes beyond automation. The system:

Suggests analyses you didn't think to run
Detects anomalies proactively
Predicts questions before they're asked

It's like having a junior data scientist monitoring everything 24/7.

Trend 4: The Explainability Mandate

With EU AI Act and similar regulations worldwide, "black box" models are becoming liabilities.

New essential skill: Being able to explain why the model made a decision, not just what it predicted.

Trend 5: Data Governance Isn't Optional Anymore

Privacy regulations (GDPR, CCPA, etc.) + AI ethics requirements mean:

You need to track data lineage
You must prevent algorithmic bias
Transparency is legally required

This is creating entirely new roles (AI Ethics Officer, Data Governance Specialist).

Trend 6: Role Evolution is Accelerating

Old Role	New Focus
Data Analyst	Strategic advisor + AI orchestrator
Data Scientist	Complex problems + research + innovation
New: Analytics Engineer	Bridge between data eng and analysis
New: AI Product Manager	Build data-driven products

💡 What This Means For You

If you're a Data Analyst

Don't panic. Your job is evolving, not disappearing.

✅ What to learn:

Prompt engineering (seriously, it's a skill)
Business acumen + domain expertise
Data storytelling and communication
Critical thinking to validate AI outputs

❌ What's becoming less valuable:

Purely technical skills (SQL, Python) without context
Repetitive dashboard creation
Manual data cleaning

If you're learning Data Science

Great timing actually. The barrier to entry is lower, but the skill ceiling is higher.

You can now:

Start with no-code tools to learn concepts
Gradually add technical depth where needed
Focus on business impact from day one

Hot take: You might become more valuable by mastering Tableau + business strategy than by grinding LeetCode for 6 months.

If you're hiring

Stop asking for "5 years Python + PhD in Statistics".

Start looking for people who can:

Translate business problems into data questions
Critically evaluate AI-generated insights
Communicate findings to non-technical stakeholders
Navigate ethical implications of data use

🤔 "Should I Still Learn Data Science in 2026?"

Yes. But differently.

The essential skills now are:

Technical foundation (still needed, just less time):

SQL + data manipulation
Statistical thinking
One visualization tool deeply

New essentials (invest heavily here):

Prompt engineering for data tasks
Model interpretation & validation
Communication & storytelling
Ethics & governance fundamentals
Business domain knowledge

Pro tip: Spend 40% of your learning time on technical skills, 60% on context, communication, and judgment.

🎯 Key Takeaways

📌 Remember:

2026 marks a structural shift, not just new tools
Technical tasks are simplifying, strategic thinking is more valuable
Accessibility is increasing (good for beginners)
New roles are emerging faster than old ones are disappearing
The skill gap is widening between "technical operators" and "strategic data professionals"

🗣️ Let's Discuss

I'm curious about your experience:

Have you used AI to generate code or analysis? What worked? What didn't?
Data professionals: How has your role changed in the past year?
Beginners: Does this make you more or less excited to enter the field?

Drop your thoughts in the comments. I genuinely want to hear different perspectives on this.

P.S. If you found this valuable, I write deep dives like this regularly on coachdata.dev. We focus on practical skills and career navigation in the evolving data landscape.

Happy learning 🚀

codecrafters-io / build-your-own-x

Master programming by recreating your favorite technologies from scratch.

Build your own <insert-technology-here>

This repository is a compilation of well-written, step-by-step guides for re-creating our favorite technologies from scratch.

What I cannot create, I do not understand — Richard Feynman.

It's a great way to learn.

Tutorials

Build your own `3D Renderer`

View on GitHub

What are you learning in 2026? Share your data journey below 👇

L'Arsenal du Data Analyst en 2025 : Maîtriser les Outils, les Données et les Tendances pour se démarquer

Mubarak Mohamed — Tue, 02 Sep 2025 10:18:39 +0000

Le métier de Data Analyst est en constante évolution, et en 2025, il est plus que jamais un rôle crucial au sein des entreprises. Ce n'est plus seulement une question de manipuler des chiffres, mais de transformer des montagnes de données brutes en informations stratégiques, de raconter des histoires claires et de guider les prises de décision. Pour exceller dans ce domaine, il ne suffit pas d'avoir de solides compétences techniques ; il faut aussi savoir naviguer dans un écosystème en perpétuelle mutation. De la maîtrise des outils classiques aux dernières innovations en matière d'IA et de cloud, le Data Analyst moderne se doit d'être polyvalent et de se former en continu.

Ce guide est un véritable tour d'horizon des ressources indispensables qui façonnent le quotidien d'un Data Analyst en 2025. Nous explorerons les outils incontournables, les meilleures sources de données, les plateformes de formation, les communautés à suivre et les grandes tendances qui transforment la profession.

Outils Incontournables : Les Fondations de l'Analyse de Données

Un Data Analyst est avant tout un artisan des données, et son efficacité dépend directement de la qualité de ses outils. En 2025, l'arsenal s'est enrichi et complexifié.

1. Les basiques toujours puissants : Excel et SQL

Microsoft Excel : Loin d'être obsolète, Excel reste un outil de choix pour des analyses exploratoires rapides, la gestion de petites à moyennes bases de données et la création de visualisations simples. La maîtrise des tableaux croisés dynamiques, des fonctions comme RECHERCHEV ou INDEX+EQUIV, et des macros en VBA est toujours d'une grande utilité pour automatiser des tâches répétitives ou nettoyer des données.
SQL (Structured Query Language) : Le langage universel des bases de données relationnelles. C'est la porte d'entrée pour interroger, extraire et manipuler les données. La maîtrise de SQL est non-négociable. En 2025, il est crucial de savoir écrire des requêtes complexes, d'utiliser des fonctions de fenêtrage (WINDOW FUNCTIONS) pour des calculs avancés et de comprendre l'optimisation des requêtes.

2. L'ère de la Business Intelligence (BI)

Les plateformes de BI sont au cœur du métier. Elles permettent de créer des tableaux de bord dynamiques et des rapports interactifs pour raconter une histoire avec les données.

Power BI : L'outil de Microsoft est un incontournable. Il est puissant, s'intègre parfaitement à l'écosystème Microsoft (Excel, Azure) et propose des fonctionnalités de modélisation de données (langage DAX) et de visualisation robustes. Il est idéal pour les entreprises déjà basées sur l'écosystème Microsoft et offre une courbe d'apprentissage relativement douce.

Tableau : Connu pour sa capacité à créer des visualisations esthétiques et percutantes, Tableau reste une référence. Il est particulièrement apprécié pour sa simplicité d'utilisation en glisser-déposer et sa capacité à se connecter à une multitude de sources de données.

3. Le pouvoir du code : Python et ses bibliothèques

Python s'est imposé comme le langage de référence pour l'analyse de données avancée, le machine learning et l'automatisation.

Pandas : La bibliothèque indispensable pour manipuler et analyser des données tabulaires (DataFrames).
Matplotlib et Seaborn : Pour des visualisations de données personnalisées et plus complexes que ce que proposent les outils de BI.
Jupyter Notebook : L'environnement interactif par excellence pour l'exploration de données. Il permet de combiner du code, des visualisations et du texte explicatif, créant ainsi des analyses claires et reproductibles.

4. L'IA au service du Data Analyst

Les copilotes et outils basés sur l'IA générative sont les nouvelles stars de l'arsenal du Data Analyst.

ChatGPT et autres assistants de code (ex. : Copilot) : Ces outils ne remplacent pas l'analyste, mais augmentent considérablement sa productivité. Ils peuvent aider à écrire des requêtes SQL complexes, générer des snippets de code Python, expliquer des concepts statistiques ou même créer des résumés d'analyses. Par exemple, vous pouvez demander à ChatGPT d'écrire une requête pour "calculer le chiffre d'affaires total par région et par mois pour l'année 2024" et obtenir un code prêt à l'emploi que vous n'aurez qu'à adapter.

Sources de Données : Le Combustible de l'Analyse

Sans données, il n'y a pas d'analyse. Le Data Analyst en 2025 sait où chercher des données de qualité, qu'elles soient publiques ou privées.

Kaggle : Bien plus qu'une simple plateforme de compétitions de data science, Kaggle est une mine d'or de datasets de haute qualité, couvrant des sujets variés, du COVID-19 aux données sur les films. C'est l'endroit parfait pour pratiquer ses compétences sur des projets réels et découvrir comment d'autres analystes ont résolu des problèmes similaires.
Google Dataset Search : Ce moteur de recherche dédié aux datasets vous permet de trouver des jeux de données pertinents sur le web, qu'ils soient publiés par des gouvernements, des universités ou des particuliers.
Sites Open Data nationaux et internationaux : Chaque pays possède son portail de données ouvertes.
- France : data.gouv.fr est la référence pour les données publiques françaises. Vous y trouverez des informations sur la démographie, la santé, le transport, etc.
- États-Unis : data.gov regroupe les datasets du gouvernement américain.
- Union Européenne : data.europa.eu est le portail officiel des données ouvertes de l'UE.

Formations et Veille : L'Apprentissage Continu, une Nécessité

Le monde de la donnée évolue si rapidement que l'apprentissage ne s'arrête jamais. Pour rester pertinent, il faut s'engager dans une démarche de formation continue.

Plateformes de formation en ligne :
- Coursera : Propose des spécialisations de haut niveau en partenariat avec des universités prestigieuses. Par exemple, le certificat "Google Data Analytics Professional Certificate" est une excellente porte d'entrée.
- Udemy : Idéal pour les formations plus courtes, axées sur des compétences spécifiques (ex. : un cours sur Power BI ou l'automatisation avec Python).
- DataCamp et Dataquest : Spécialisées dans la data, ces plateformes offrent des parcours interactifs pour apprendre SQL, Python ou R directement dans le navigateur.
Newsletters spécialisées : S'abonner à quelques newsletters de qualité est la meilleure façon de faire de la veille sans y passer trop de temps.
- Data Elixir : Une sélection hebdomadaire des meilleurs articles, outils et tutoriels sur la data science et l'analyse de données.
- Data Is Plural : Une newsletter qui partage des datasets intéressants chaque semaine. C'est une excellente source pour trouver de nouveaux projets à explorer.
Podcasts : Écouter des experts échanger sur le sujet est une façon pratique de se tenir au courant des dernières tendances.
- DataGen : Un podcast français qui interviewe des professionnels de la data.
- Super Data Science : Des entretiens avec des leaders du domaine de la data, de la science des données et de l'IA.

Communautés et Réseaux : Construire Son Capital Humain

L'échange avec d'autres professionnels est une ressource inestimable pour résoudre des problèmes, trouver un emploi ou simplement rester motivé.

LinkedIn : Le réseau professionnel est une plateforme de choix pour la veille et le networking. Suivre des leaders d'opinion, rejoindre des groupes dédiés à la data, partager ses projets et ses réflexions est essentiel pour construire sa marque personnelle et rester visible dans la communauté.
Slack et Discord : De nombreuses communautés se retrouvent sur ces plateformes pour échanger en temps réel. Des serveurs comme The Data Science Community ou les canaux Slack d'entreprises spécialisées permettent de poser des questions, d'obtenir de l'aide sur des problèmes techniques et de partager ses découvertes.
GitHub : C'est le portfolio du Data Analyst moderne. Y héberger ses projets (analyses, notebooks Jupyter, scripts Python) est un excellent moyen de montrer ses compétences à des recruteurs et de collaborer avec d'autres développeurs.
Forums spécialisés : Les plateformes comme Stack Overflow sont des ressources de premier plan pour trouver des solutions à des problèmes de codage ou de modélisation spécifiques.

Tendances 2025 : Anticiper le Futur de la Data

Le Data Analyst de demain doit non seulement maîtriser les outils d'aujourd'hui, mais aussi anticiper les tendances de demain.

L'Automatisation et l'IA Générative : Ces technologies transforment la façon dont les analyses sont menées. Au lieu de passer des heures à nettoyer des données, les analystes utiliseront de plus en plus des outils automatisés (AutoML) et des copilotes pour les tâches répétitives. Le métier se déplace vers des activités à plus forte valeur ajoutée : la compréhension métier et le "storytelling".
Le Cloud Computing : Les entreprises migrent leurs infrastructures de données vers le cloud. La maîtrise d'outils comme BigQuery (Google Cloud) ou Snowflake est désormais un atout majeur. Ces plateformes permettent de gérer et d'analyser des pétabits de données à grande vitesse, sans se soucier de l'infrastructure sous-jacente.
Le Data Storytelling : Il ne suffit plus de produire des rapports complexes. Le Data Analyst en 2025 est un narrateur. La capacité à transformer des chiffres et des graphiques en une histoire convaincante, claire et adaptée à un public non technique est devenue une compétence cruciale. Utiliser des outils comme Power BI ou Tableau pour créer des "storyboards" est une pratique de plus en plus courante.

L'Aventure de l'Apprentissage Continu

Le métier de Data Analyst en 2025 est une aventure passionnante, mais exigeante. Les outils évoluent, les technologies changent et de nouvelles méthodes émergent chaque jour. La polyvalence, la curiosité et l'envie d'apprendre sont les qualités qui feront la différence.

Cultivez votre boîte à outils en maîtrisant les fondamentaux comme SQL et Python, mais ne craignez pas de vous aventurer sur des plateformes cloud comme BigQuery. Nourrissez votre esprit en vous formant en continu via des plateformes comme Coursera ou en écoutant des podcasts spécialisés. Et surtout, n'hésitez jamais à vous appuyer sur la communauté, à partager vos projets sur GitHub et à échanger sur LinkedIn. Votre succès en tant que Data Analyst ne dépendra pas seulement de ce que vous savez faire, mais de votre capacité à évoluer avec le monde des données.

Data Mesh: The Decentralized Revolution That Will Transform Your Data Architecture

Mubarak Mohamed — Mon, 01 Sep 2025 10:47:55 +0000

Imagine your data team as a bottleneck. Every time a business team needs to access, analyze, or update data, the request goes through this central team, causing delays, frustration, and a loss of agility. This model is the data monolith, often embodied by a single, centralized data lake or data warehouse that quickly becomes unmanageable.

Product teams are ready to innovate, but they are slowed down by a dependency on a single source of truth, a single team, and a rigid process. The company's speed is a drag. So, how do we solve this puzzle? Should we just add more people to the central team? Or is the problem deeper, related to the very structure of our architecture?

Goodbye Monolith, Hello Mesh

Data Mesh is not a new technology. It is a paradigm shift in architecture and organization. The idea is simple but powerful: instead of centralizing all data, why not decentralize it and organize it by business domain?

Inspired by the Microservices Architecture, Data Mesh proposes treating data not as a passive resource, but as a living product. Each business domain (customers, products, logistics, etc.) becomes the owner and steward of its own data.

This model is based on four fundamental principles that transform how we manage and use data.

1. Domain-oriented Data Ownership

This is the core of Data Mesh. Instead of a central team that ingests all the organization's data, the business teams themselves are responsible for their data. The team in charge of products is responsible for product data. The marketing team manages data on advertising campaigns.

This promotes greater accountability and a better understanding of the data's semantics. The people who create the data are also the ones who manage it, ensuring better quality.

2. Data as a Product

In a Data Mesh architecture, data is not just files in a data lake. It's treated as a product in its own right, with clear characteristics and a focus on consumability. A data product must be:

Discoverable: Easy to find in a data catalog.
Addressable: Accessible via a simple interface (API, Kafka stream, etc.).
Interoperable: With clear semantics and rich documentation.
Trustworthy & high quality: Tested and maintained by the producing team.
Secure: Compliant with security and governance policies.

This principle ensures that data is no longer a chore but a valuable, ready-to-use resource for any other team.

3. A Self-Serve Data Platform

For business teams to be truly autonomous, they need tools. A Data Mesh requires a self-serve data platform that provides the necessary infrastructure to create, manage, and expose their data products. This platform serves as an abstraction, allowing teams to focus on business logic without worrying about the technical details of the underlying infrastructure. It provides tools for ingestion, storage, processing, and governance but manages the complexity for end-users.

4. Federated Computational Governance

If every team does what it wants, it's chaos. This is where the last principle comes in. Governance is not centralized; it is federated. A governance group defines global standards and rules (e.g., metadata formats, security policies) but the application of these rules is decentralized. "Computational" governance tools automate the application of these rules. This ensures consistency and security while maintaining team autonomy.

A Concrete Business Case: Data Mesh at an E-Commerce Company

Let's take the example of a large e-commerce platform. Traditionally, all sales, inventory, and customer data are centralized. With a Data Mesh, the organization could be divided into domains:

"Products" Domain: The team responsible for the product catalog owns the product data. It creates a "Catalog" data product that includes descriptions, prices, categories, etc.
"Customers" Domain: The customer relationship team manages data on customer behavior. It produces a "Customer Behavior" data product containing purchase history, clicks, and reviews.
"Logistics" Domain: The supply chain team is responsible for inventory and delivery data. It exposes an "Inventory Status" data product updated in real-time.

Each team exposes its data products in a standardized way (via REST APIs, Kafka streams, shared tables). The marketing team, for example, can consume the "Customer Behavior" data product to personalize campaigns and the "Inventory Status" data product to ensure they don't promote out-of-stock products. All this without going through a central team, in an autonomous and fast way.

The Data Mesh Tool Kit 🛠️

Implementing a Data Mesh requires an appropriate technical architecture. Here are the types of tools needed, without limiting yourself to a single solution:

Ingestion and Streaming Tools

To create and consume data products in real-time.

Apache Kafka: The basis for most streaming architectures.
Confluent: An enterprise platform built on Kafka, with connectors and simplified management.

Data Platforms

For data storage and processing. Each domain can have its own space, but it must be interoperable.

Databricks: A powerful data processing engine that unifies data warehousing and machine learning.
Snowflake: A data cloud that allows for great scalability for storage and analysis.

Data Catalogs and Governance

For data products to be discoverable and manageable.

Amundsen: An open-source data catalog developed by Lyft.
Collibra: An enterprise data governance and management platform.

Orchestration Tools

To automate data pipelines within each domain.

Dagster: A modern orchestrator focused on managing data products.
Prefect: Another orchestration tool that focuses on flexibility and ease of use.

Data Mesh is a concrete response to the limitations of traditional data architectures. By decentralizing data ownership, treating it as a product, and providing a self-serve platform, companies can unlock unprecedented agility and scalability.

It's not a simple project and requires a cultural transformation. But the investment is worth it to free up your teams, accelerate innovation, and make data a true strategic asset.

And you, how do you manage data in your organization? Would Data Mesh be a solution for your daily challenges? Share your thoughts in the comments below! 👇

Google AI Studio: A Free Playground to Experiment with Gemini AI

Mubarak Mohamed — Fri, 29 Aug 2025 15:38:44 +0000

Artificial intelligence is evolving fast, and both developers and creators are looking for tools that make it easier to test, prototype, and integrate AI into their projects. Google AI Studio is Google’s answer: a free, accessible platform that lets you experiment with Gemini AI models — all without writing a single line of code.

If you’re curious about AI but find integration too complex, or if you’re a developer who wants to quickly validate ideas before going into production, this tool is definitely worth a try.

Key Features of Google AI Studio

Google AI Studio comes with four main capabilities that unlock a wide range of possibilities.

1. Chat with AI

A simple chat interface to interact with Gemini models, test their capabilities, and explore use cases like code explanations, brainstorming ideas, or assisted writing.

2. Real-Time Stream

The stream mode displays responses as they’re generated in real time. Perfect for:

Writing a story or script and watching dialogues unfold live.
Designing interactive tutorials where answers adapt as you type.
Providing instant assistance in an application (e.g., customer support chatbots).

3. Generate Media

The media generation feature allows you to create images or videos from simple text prompts. Example: “Create a futuristic illustration of a smart city at sunset” → Google AI Studio generates a ready-to-use image.

4. Build AI-Powered Apps

The Build tab lets you turn experiments into fully working AI applications. With customizable run settings, you can choose:

Which Gemini model to use,
Output formats (text, JSON, etc.),
Voice and resolution for multimedia content.

This makes it easy to create no-code or low-code AI-powered projects.

Practical Use Cases

With these features, Google AI Studio can be applied to many scenarios:

Content creators: generate visuals for blog posts, scripts for YouTube videos, or dialogues for narrative games.
Educators & trainers: design interactive tutorials where AI guides learners step by step.
Entrepreneurs & startups: quickly prototype a chatbot, customer support interface, or decision-making assistant.
Developers: test Gemini’s API before integrating it into larger applications.

Who Is Google AI Studio For?

Beginners: The intuitive interface makes it possible to experiment without any coding knowledge. You can generate content or set up an assistant in just a few clicks.
Experienced developers: A fast sandbox to test prompts, configure advanced parameters, and validate ideas before implementing them in production.

Why You Should Try Google AI Studio Now

Google AI Studio lowers the barrier to entry for Gemini models: free, intuitive, and powerful enough to cover a wide range of needs. Whether you’re a developer, data analyst, content creator, or simply curious about AI, it’s an ideal starting point to explore how AI can enhance your projects.

👉 Start experimenting today: Google AI Studio is live and ready to use.

Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts

Mubarak Mohamed — Thu, 28 Aug 2025 10:27:10 +0000

Before making predictions, we need to make sure our data is ready.
A raw time series often contains trends or fluctuations that can mislead a forecasting model.

The ARIMA model has one key requirement: it only works properly with stationary series.

A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.
A non-stationary series, on the other hand, changes significantly (for example, with a strong trend or seasonality). Without this preparation, ARIMA may produce biased or unreliable forecasts.

In the previous article (How Time Series Reveal the Future: An Introduction to the ARIMA Model), we explored what a time series is, its components (trend, seasonality, noise), and the intuition behind ARIMA.
We also visualized the AirPassengers dataset, which showed a steady upward trend and yearly seasonality.

👉 But for ARIMA to work, our data must satisfy one key condition: stationarity.
That’s exactly what this article is about: transforming a non-stationary series into a stationary one using simple techniques (differencing, statistical tests).
In other words: after observing, we now move on to preparing.

Simplified Theory

What is stationarity? A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time. 👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).

A non-stationary series changes too much over time:

*Trend *(e.g., constant increase in smartphone sales).
*Seasonality *(e.g., ice cream sales peaking every summer).

ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to biased forecasts.

Differencing To make a series stationary, we use differencing: Yt′=Yt−Yt−1Y't = Y_t - Y{t-1} In other words, each value is replaced by the *change between two successive periods. *
This removes linear trends.
For strong seasonality, we can apply seasonal differencing (e.g., difference with the value one year before).

Example: instead of analyzing raw monthly sales, we analyze the month-to-month change.

Statistical tests (ADF & KPSS) To check if a series is stationary, we use two complementary tests: ADF (Augmented Dickey-Fuller Test)
Null hypothesis (H₀): the series is non-stationary.
If p-value < 0.05 → reject H₀ → the series is stationary. KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)
Null hypothesis (H₀): the series is stationary.
If p-value < 0.05 → reject H₀ → the series is non-stationary.

In practice:

We apply both tests for robustness.
If ADF and KPSS disagree, we refine with additional transformations.

Hands-on in Python

We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.datasets import nile

# Load Nile dataset
data = nile.load_pandas().data
data.index = pd.date_range(start="1871", periods=len(data), freq="Y")
series = data['volume']

# Plot series
plt.figure(figsize=(10,4))
plt.plot(series)
plt.title("Annual Nile River Flow (1871–1970)")
plt.show()

Check stationarity with ADF & KPSS

def adf_test(series):
    result = adfuller(series, autolag='AIC')
    print("ADF Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] < 0.05:
        print("✅ The series is stationary (ADF).")
    else:
        print("❌ The series is NON-stationary (ADF).")

def kpss_test(series):
    result = kpss(series, regression='c', nlags="auto")
    print("KPSS Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] < 0.05:
        print("❌ The series is NON-stationary (KPSS).")
    else:
        print("✅ The series is stationary (KPSS).")

print("ADF Test")
adf_test(series)
print("\nKPSS Test")
kpss_test(series)

Apply differencing

series_diff = series.diff().dropna()

plt.figure(figsize=(10,4))
plt.plot(series_diff)
plt.title("Differenced series (1st difference)")
plt.show()

print("ADF Test after differencing")
adf_test(series_diff)
print("\nKPSS Test after differencing")
kpss_test(series_diff)

Understanding ARIMA and its parameters

The ARIMA(p, d, q) model combines three parts:

AR (AutoRegressive, p)
Uses past values to predict the future.
Example: if
𝑝 = 2
p=2, the current value depends on the last 2 values.
Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t
I (Integrated, d)
Number of differences applied to make the series stationary.
Example: d=0d = 0 → no differencing.
d=1d = 1 → one difference applied.
MA (Moving Average, q)
Uses past errors (residuals) for prediction.
Example: if 𝑞 = 1, the prediction depends on the last error.
Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t

In short:

p = past values memory
d = differencing degree
q = past errors memory

Example in Python

from statsmodels.tsa.arima.model import ARIMA

# ARIMA(1,1,1)
model = ARIMA(series, order=(1,1,1))
fit = model.fit()

print(fit.summary())

Typical output:

AR (p) → past values effect
I (d) → differencing applied
MA (q) → past errors effect
AIC/BIC → model quality (lower = better)

Choosing the best parameters (p,d,q)

One of the main challenges with ARIMA is selecting the right p, d, q.

Choosing the best parameters (p,d,q)

Choosing p and q with ACF & PACF

ACF → helps to choose q (MA part).
PACF → helps to choose p (AR part).

Simple rules:

PACF cutoff → good candidate for p.
ACF cutoff → good candidate for q.

How Time Series Reveal the Future: An Introduction to the ARIMA Model

Mubarak Mohamed — Wed, 27 Aug 2025 12:12:13 +0000

Imagine you manage a supermarket. Every Monday, you must decide how much milk, rice, soap, and fruit to order for the week. Too little? Stockouts and unhappy customers. Too much? Excess inventory, waste, and unnecessary costs.
So, how can you predict tomorrow’s demand from yesterday’s purchases?

That’s the realm of time series—data ordered over time (day by day, month by month). And among the most widely used methods to forecast the future, there’s ARIMA — AutoRegressive Integrated Moving Average.
ARIMA is popular because it’s both interpretable and effective across many real-world domains: sales, weather, energy, healthcare, finance, and more.

In this opening article, you will:
learn what a time series is and its components (trend, seasonality, noise);
grasp the intuition behind ARIMA (AR, I, MA) without heavy math;
plot a first series in Python to visually detect these patterns.

Ready? Let’s take it step by step. 👇

Transition

This article is the first episode of the series “Mastering ARIMA for Time Series Analysis and Forecasting.”
The goal is clear: to guide you step by step, from the basics to practical applications on real-world projects.

In every article, you will find:

a simplified theoretical section (to understand without heavy math),
a hands-on Python example (to directly work with data),
a small practical project (to apply your knowledge to a real case). This way, you’ll progress in a logical, gradual, and practical manner.

Simplified Theory (What is a time series? + ARIMA intuition)?

What is a time series?

A time series is a sequence of data collected at regular intervals.
Examples:

daily sales in a supermarket,
hourly temperature,
stock prices every minute,
monthly internet subscriptions.

Three main components describe a time series:

Trend: the long-term overall direction. Example: gradual increase in internet subscribers.
S*easonality:* recurring, regular fluctuations. Example: ice cream sales peaking every summer.
Noise: unpredictable, random variations. Example: a sudden sales spike due to an unexpected event.

The intuition behind ARIMA
The ARIMA model combines three simple yet powerful ideas:

AR (Auto-Regressive): future values depend on past values. Example: today’s sales are partly influenced by yesterday’s sales.
I (Integrated): to make the series more stable, we remove trends through differencing (computing the changes between periods).
MA (Moving Average): future values adjust based on past forecast errors.

👉 In short:
ARIMA = memory of the past + trend stabilization + error correction.

Hands-on Python

Before diving into ARIMA, let’s take the first step: visualizing a time series.
We’ll use the famous AirPassengers dataset (monthly airline passengers from 1949 to 1960).

import pandas as pd
import matplotlib.pyplot as plt

# Load AirPassengers dataset
url = "dataset/airline-passengers.csv"
data = pd.read_csv(url, parse_dates=['Month'], index_col='Month')

# Preview first rows
print(data.head())

# Visualization
plt.figure(figsize=(10,5))
plt.plot(data, label='Number of passengers')
plt.title("AirPassengers - Monthly Airline Passengers (1949-1960)")
plt.xlabel("Date")
plt.ylabel("Passengers")
plt.legend()
plt.show()

Expected result:

A curve showing:

a rising trend (more passengers over the years),
a yearly seasonality (summer peaks, winter drops). This first visualization is essential: it helps us spot the patterns that ARIMA will later model.

Real-world use cases

Time series models like **ARIMA **are widely used to forecast the future based on past data. Here are some concrete examples:

Finance:

Predict stock prices or market indices.
Anticipate trends to make better investment decisions.

Sales / Retail:

Estimate product demand to avoid shortages or excess stock.
Plan inventory and promotions according to seasonality.

Public Health:

Track the progression of epidemics such as seasonal flu or COVID-19.
Forecast medical resource needs.

Weather / Energy:

Predict temperatures, rainfall, or electricity consumption.
Help companies and municipalities manage resources efficiently.

Transport / Logistics:

Forecast traffic or public transport passenger numbers.
Optimize schedules and resource allocation.

ARIMA **is not just theory—it’s a **practical tool to solve real problems across almost all sectors.

valuation / Results

At this stage, we haven’t applied the ARIMA model yet, but we can already draw important insights from visual exploration:

Identifying the trend:

The AirPassengers chart clearly shows a steady increase in passengers over the years.
Understanding this trend helps prepare future forecasts.

Identifying seasonality:

Every year, there are recurring summer peaks, typical of yearly seasonality.
Seasonality must be considered for accurate predictions.

Recognizing noise:

Unpredictable variations appear: some months deviate from the trend or seasonality.
ARIMA will later help correct past errors and reduce the impact of noise.

Exploratory analysis is the first crucial step in any time series modeling.
Before modeling with ARIMA, it’s essential to understand the series’ structure: trend, seasonality, and noise.

Conclusion and recap

In this introductory article, you’ve learned the basics to understand and explore time series:
What a time series is: data collected at regular intervals with trend, seasonality, and noise.

The intuition behind ARIMA:

AR (Auto-Regressive): memory of the past,
I (Integrated): series stabilization,
MA (Moving Average): error correction.

Python visualization: using the AirPassengers dataset to observe trend and seasonality.
Real-world use cases: finance, sales, health, weather, transport… ARIMA is everywhere forecasting is needed.
Exploratory evaluation: visualization already helps understand patterns and prepares for modeling.

This first step is crucial: understanding your data comes before modeling. The quality of your forecasts depends directly on this understanding.

Now that we’ve explored and understood the time series, it’s time to move to the next step: preparing data for ARIMA.

In the next article, we’ll cover:

how to check if a series is stationary,
how to apply **differencing **to make a series stationary,
which statistical tests to use: ADF (Augmented Dickey-Fuller) and KPSS,
and how to interpret these tests to decide the parameters of our ARIMA model.

This step is crucial: ARIMA works best on stationary series, and proper data preparation ensures more accurate forecasts.

See you in the next article to move from exploration to modeling!

Building an ETL Pipeline with Python Using CoinGecko API

Mubarak Mohamed — Thu, 20 Feb 2025 08:20:02 +0000

Extract, Transform, Load (ETL) is a fundamental process in data engineering used to collect data from various sources, process it, and store it in a structured format for analysis. In this tutorial, we will build a simple ETL pipeline using Python and the CoinGecko API to extract cryptocurrency market data, transform it into a structured format, and load it into a SQLite file for further use.

Prerequisites

To follow along, you need to have:

Python installed (>=3.7)
requests and pandas libraries installed
A CoinGecko API key (optional but recommended for higher request limits)

You can install the required libraries using:

pip install requests

Step 1: Extract Data from CoinGecko API

The extraction phase involves fetching cryptocurrency market data from the CoinGecko API.

import requests
import pandas as pd

def extract_data_from_api():
    url = "https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd"

    headers = {
        "accept": "application/json",
        "x-cg-demo-api-key": "YOUR_API_KEY_HERE",  # Replace with your API key
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        return pd.json_normalize(data)  # Convert JSON response to DataFrame
    else:
        raise Exception("Error fetching data from API")

Step 2: Transform the Data

Transformation is necessary to clean and structure the data before loading it. We'll select relevant columns and rename them for clarity.

def transform_data(df):
    df_transformed = df[["id", "symbol", "name", "current_price", "market_cap", "total_volume", "price_change_percentage_24h", "last_updated"]].copy()
    df_transformed.columns = ["id", "symbol", "name", "price_usd", "market_cap_usd", "volume_24h_usd", "price_change_24h_percent", "date"]
    df_transformed["date"] = pd.to_datetime(df_transformed["date"]).dt.date
    df_transformed = df_transformed.fillna(0)
    return df_transformed

Step 3: Load Data into SQLite Database

The final step is to store the processed data into an SQLite database for further analysis.

import sqlite3

def load_data_to_sqlite(df, db_file, table_name):
    # Connect to the database
    conn = sqlite3.connect(db_file)
    cursor = conn.cursor()

    # Create the table if it does not exist
    cursor.execute(f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            id TEXT PRIMARY KEY,
            symbol TEXT, 
            name TEXT, 
            price_usd REAL, 
            market_cap_usd REAL, 
            volume_24h_usd REAL, 
            price_change_24h_percent REAL
        )
    """)

    # Load the data into the table
    df.to_sql(table_name, conn, if_exists='replace', index=False)

    # Commit and close the connection
    conn.commit()
    conn.close()
    print(f"Data successfully loaded into table '{table_name}' in database {db_file}")

Step 4: Putting It All Together

Now, we can orchestrate the entire ETL process using a main function.

def etl_pipeline():
    # Extraction 
    df_crypto = extract_data_from_api()
    print("Extraction successful")

    # Transformation
    df_transformed = transform_data(df_crypto)
    print("Transformation successful")

    # Loading
    db_file = "database.db"
    table_name = "crypto_data"
    load_data_to_sqlite(df_transformed, db_file, table_name)
    print("Loading successful")

if __name__ == "__main__":
    etl_pipeline()

This tutorial demonstrated how to build a simple ETL pipeline in Python using the CoinGecko API. We covered extracting data from the API, transforming it into a structured format, and loading it into an SQLite database. This pipeline can be extended to store data in a cloud database, automate execution using cron jobs, or integrate with data visualization tools.

Happy coding!

Understanding Python Terminology: Module, Package, Library, and Framework

Mubarak Mohamed — Thu, 26 Dec 2024 11:50:26 +0000

When starting to learn a programming language, one of the first challenges is getting familiar with the terminology. In Python, terms like module, package, library, and framework are commonly used, but their distinctions aren’t always clear to beginners. This article aims to explain these concepts clearly and highlight their differences with examples.

1. The Module

A module in Python is simply a file that contains Python code. This file has a .py extension and can include functions, classes, variables, and executable code. Modules allow you to reuse code by importing it into other files.

Example:

Let’s create a file math_utils.py:

# math_utils.py
def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

This module can then be imported and used in another script:

from math_utils import add

result = add(5, 3)
print(result)  # Outputs 8

2. The Package

A package is a folder containing multiple modules and a special file named __init__.py. This file allows Python to treat the folder as a package. Packages are used to organize code by grouping related modules.

Example:

Package structure:

math_tools/
    __init__.py
    algebra.py
    geometry.py

algebra.py:

def solve_linear(a, b):
    return -b / a

geometry.py:

def area_circle(radius):
    from math import pi
    return pi * radius ** 2

Usage:

from math_tools.algebra import solve_linear
from math_tools.geometry import area_circle

print(solve_linear(2, -4))  # Outputs 2.0
print(area_circle(3))      # Outputs 28.27

3. The Library

The term library is often used to describe a collection of ready-to-use packages or modules. A library can contain several packages serving various purposes.

For example, Requests is a popular Python library for making HTTP requests. It includes several internal modules and packages working together to provide a user-friendly interface.

Example:

import requests

response = requests.get('https://api.example.com')
if response.status_code == 200:
    print(response.json())

Note: Some people use the terms library and package interchangeably, and this confusion is understandable. The difference often lies in the scale and context of use.

4. The Framework

A framework is a structured library designed with a specific purpose. Unlike a simple library that provides tools, a framework enforces an architecture and a way of working. In Python, frameworks are commonly used for web development, data analysis, or artificial intelligence.

Example: Flask (Web Framework)

from flask import Flask

app = Flask(__name__)

@app.route('/')
def home():
    return "Welcome to my website!"

if __name__ == '__main__':
    app.run(debug=True)

Flask imposes a minimalist structure but provides essential tools to develop a web application.

Summary of Differences

Term	Description	Example
Module	Single Python file containing code.	`math_utils.py`
Package	Folder containing multiple modules and an `__init__.py` file.	`math_tools/`
Library	Collection of modules or packages for various needs.	`Requests`, `NumPy`
Framework	Structured library with an enforced architecture.	`Flask`, `Django`

These distinctions are essential to better understand the Python ecosystem and organize your projects effectively. However, the boundary between some terms, such as library and package, can be blurry, and their usage may vary from person to person.

I am open to discussions and debates if you have a different perspective or points to add. Feel free to share your ideas or ask questions!

10 Statistical Terms to Know as a Data Analyst

Mubarak Mohamed — Sat, 21 Dec 2024 06:57:13 +0000

As a data analyst, mastering statistical concepts is essential to explore, interpret, and effectively present data. Here are 10 key terms explained concisely with practical examples to illustrate their utility.

1. Mean (or Arithmetic Mean)

The mean is calculated by dividing the sum of all values by the total number of values. It represents a central tendency.

Example: Suppose the daily sales of a product are: 100, 120, 140, 160, 180. The mean is:
Mean = (100 + 120 + 140 + 160 + 180)/5 = 140.
Utility: The mean helps determine a representative value, for example, the average revenue per customer in a business.

2. Median

The median is the middle value of a sorted dataset. If the number of values is even, it is the average of the two middle values.

Example: For salaries of €1500, €2000, €2500, €3000, €8000, the median is €2500. It is not influenced by the extreme value of €8000.

Utility: The median is useful for analyzing skewed data, such as salaries, often biased by high values.

3. Variance and Standard Deviation

Variance: Measures the spread of data relative to the mean.
Standard Deviation: The square root of the variance, expressed in the same unit as the data.

Example: If the usage times of a mobile app are: 10, 12, 10, 8, 15 minutes, a high standard deviation would indicate that the times vary greatly around the mean.

Utility: These measures help understand performance stability, such as website loading times.

4. Normal Distribution

A symmetrical bell-shaped distribution around the mean.

Example: Human heights often follow a normal distribution: most people have a height close to the mean, with fewer people being very tall or very short.

Utility: Useful for predicting typical behaviors and applying statistical tests like the t-test.

5. Correlation

Correlation measures the relationship between two variables, expressed between -1 (perfect negative correlation) and +1 (perfect positive correlation).

Example: A company may find a positive correlation between advertising budget and sales.

Utility: Identifying potential relationships to make strategic decisions, such as optimizing marketing campaigns.

6. Probability

Probability assesses the chance of an event occurring, expressed between 0 (impossible) and 1 (certain).

Example: If an e-commerce site has 500 visitors and 50 make a purchase, the probability of conversion is:
P(Conversion) = 50/500 = 0.1 = 10%.
Utility: Estimating the likelihood of success for an action, such as click-through rates for an ad campaign.

7. P-value

The p-value is the probability of observing results as extreme as those obtained if the null hypothesis is true.

Example: In an A/B test, if the p-value is less than 0.05, the null hypothesis (both versions are the same) is rejected.

Utility: Validates the effectiveness of a change (e.g., a design modification).

8. Histogram

A graph representing the distribution of a variable using value ranges (bars).

Example: A histogram can show the number of users by age range (20-30 years, 30-40 years, etc.).

Utility: Quickly visualize data distribution to identify trends or anomalies.

9. Binomial Distribution

Models the number of successes in a series of independent trials with two possible outcomes (success/failure).

Example: If a product has a 20% chance of being defective, the binomial distribution can predict how many out of 100 will be defective.

Utility: Predict outcomes in repetitive processes, such as quality tests.

10. Hypothesis Testing

A statistical process to evaluate whether a hypothesis about a population is true or not.

Example: A company tests if a new interface increases the conversion rate. The null hypothesis is: "The new interface does not improve the conversion rate."

Utility: Enables data-driven decision-making while minimizing bias.

Conclusion

These 10 statistical terms are fundamental for a data analyst. Mastering these concepts allows for effective understanding and communication of analysis results, facilitating data-driven decision-making.

Build Your Own Artificial Neuron: A Practical Guide for AI Beginners

Mubarak Mohamed — Wed, 24 Jul 2024 21:50:50 +0000

Artificial intelligence (AI) is ubiquitous in our daily lives, from product recommendations on e-commerce websites to virtual assistants on our smartphones. But behind these sophisticated technologies lies a fundamental structure: the artificial neuron. Understanding and developing an artificial neuron is a crucial step for anyone looking to dive into the fascinating world of AI. In this article, we will guide you step-by-step through the process of creating your own artificial neuron, breaking down complex concepts into simple terms and providing concrete examples. Whether you're a curious beginner or a technology enthusiast, this practical guide will open the doors to a new dimension of innovation. Get ready to transform your understanding of AI and discover the limitless potential of this rapidly growing field.

To develop our artificial neuron program, we will start with a dataset containing 100 rows and 2 columns. This dataset can be likened to plants with the width and length of their leaves. Our goal here is to train our program to recognize toxic and non-toxic plants using this data. To achieve this, we will follow these steps.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

Step 1: Data Acquisition (X, y)
X: The input data
This is the raw information that the model will process.
Example: For an image recognition model, X could be an array of pixels representing an image. For a house price prediction model, X could include variables such as area, number of rooms, location, etc.
y: The labels
These are the correct answers associated with the input data.
Example: For image recognition, y would be the digit represented in the image. For house price prediction, y would be the actual price of the house.

X, y = make_blobs(n_samples=100, n_features=2, centers=2, random_state=0)
y = y.reshape((y.shape[0], 1))

print('dimensions de X:', X.shape)
print('dimensions de y:', y.shape)

plt.scatter(X[:,0], X[:, 1], c=y, cmap='summer')
plt.show()

Step 2: Initialization
initialisation(X):
This step allows you to initialize the parameters W and b

def initialisation(X):
    W = np.random.randn(X.shape[1], 1)
    b = np.random.randn(1)
    return (W, b)

Step 3: Model Construction
Model(X, W, b):
The model is a mathematical function that takes the input data X and the model parameters (W and b) to produce a prediction.
W: Weight matrix
Determines the relative importance of each input feature.
b: Bias vector
Allows the model output to be adjusted independently of the inputs.
Activation function:
Transforms the linear output of the model into a non-linear output, allowing for modeling complex relationships.

def model(X, W, b):
    Z = X.dot(W) + b
    A = 1 / (1 + np.exp(-Z))
    return A

Step 4: Error Calculation
Cost(A, y):
The cost function measures the discrepancy between the model's predictions (A) and the true labels (y).
Examples of cost functions:
Mean Squared Error (MSE): Used for regression problems.
Cross-entropy: Used for classification problems.

def log_loss(A, y):
    return 1 / len(y) * np.sum(-y * np.log(A) - (1 - y) * np.log(1 - A))

Step 5: Model Optimization
Gradients(A, X, y):
The gradients indicate the direction in which the parameters W and b should be modified to minimize the cost function.
Update(W, b, dW, db):
The parameters are iteratively updated by following the opposite direction of the gradient.
Optimization algorithms:
Stochastic Gradient Descent (SGD): Updates the parameters at each training example.
Batch Gradient Descent: Updates the parameters at each mini-batch of examples.
Mini-batch Gradient Descent: A combination of the previous two.

def gradients(A, X, y):
    dW = 1 / len(y) * np.dot(X.T, A - y)
    db = 1 / len(y) * np.sum(A - y)
    return (dW, db)

def update(dW, db, W, b, learning_rate):
    W = W - learning_rate * dW
    b = b - learning_rate * db
    return (W, b)

Step 6: Model Evaluation

from sklearn.metrics import accuracy_score

def predict(X, W, b):
    A = model(X, W, b)
    # print(A)
    return A >= 0.5

def artificial_neuron(X, y, learning_rate = 0.1, n_iter = 100):
    # initialisation W, b
    W, b = initialisation(X)

    Loss = []

    for i in range(n_iter):
        A = model(X, W, b)
        Loss.append(log_loss(A, y))
        dW, db = gradients(A, X, y)
        W, b = update(dW, db, W, b, learning_rate)

    y_pred = predict(X, W, b)
    print(accuracy_score(y, y_pred))

    plt.plot(Loss)
    plt.show()
    return (W, b)

W, b = artificial_neuron(X, y)

Decision boundary

fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(X[:,0], X[:, 1], c=y, cmap='summer')

x1 = np.linspace(-1, 4, 100)
x2 = ( - W[0] * x1 - b) / W[1]

ax.plot(x1, x2, c='orange', lw=3)

Time Series in Data Science: Analysis of Bitcoin and Ethereum

Mubarak Mohamed — Fri, 19 Jul 2024 11:14:46 +0000

Time series play a crucial role in Data Science, especially when analyzing financial data. The price variations of cryptocurrencies like Bitcoin and Ethereum offer an excellent opportunity to explore time series. In this article, we will analyze the price variations of Bitcoin and Ethereum in euros, using datasets ranging from 2012 to 2019 for Bitcoin and from 2015 to 2019 for Ethereum. We will also illustrate the use of some basic time series techniques with concrete examples and practical recommendations.

Importing Libraries and Loading Data
Before diving into the analysis, we need to import the necessary libraries and load the datasets.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Loading Bitcoin data
btc = pd.read_csv("BTC-EUR.csv", index_col='Date', parse_dates=True)
btc.head()

Data Exploration
Let's take a look at the first few rows of the data to get an idea of its structure.

btc.head()

This allows us to verify that the data has been correctly loaded and that date indexing has been successfully applied.

Weekly Variation Analysis
Now, let's analyze the weekly variations of Bitcoin's closing prices.

btc['Close'].resample('W').agg(['mean', 'std'])

Recommendation: Resampling is a powerful technique to summarize data at different frequencies (daily, weekly, monthly, etc.). It helps to reveal hidden trends and patterns.

Data Visualization
Visualizing data is crucial to understand trends and anomalies. Let's start by plotting Bitcoin's closing prices.

btc['Close'].plot(figsize=(9, 6))

The first time I plotted financial data, I was surprised at how much detail can be hidden in a simple curve.

Specific Period Data Analysis
We can also focus on specific periods for more detailed analysis.

btc['2019']['Close'].plot(figsize=(9, 6))

And for an even shorter period:

btc['2019-09']['Close'].plot(figsize=(9, 6))

Comparison of Monthly and Weekly Averages
For deeper analysis, let's compare the monthly and weekly averages of closing prices for the year 2017.

plt.figure(figsize=(12, 9))
btc['2017']['Close'].plot()
btc.loc['2017','Close'].resample("M").mean().plot(label='Moyenne par mois ', lw=2, ls=':', alpha=0.8)
btc.loc['2017', 'Close'].resample("W").mean().plot(label='Moyenne par semaine ', lw=2, ls='--', alpha=0.8)
plt.legend()

Recommendation: Comparing averages at different frequencies can reveal seasonal trends or economic cycles.

Ethereum Analysis
Now, let's analyze the Ethereum data.

eth = pd.read_csv('ETH-EUR.csv', index_col='Date', parse_dates=True)
eth.head()

Merging Bitcoin and Ethereum Data
For comparative analysis, we will merge the Bitcoin and Ethereum data.

btc_eth = pd.merge(btc, eth, how='inner', on='Date', suffixes=('_btc', '_eth'))
btc_eth.head()

Comparative Visualization of Variations
Finally, let's visualize the variations of both cryptocurrencies.

btc_eth[['Close_btc', 'Close_eth']].plot(figsize=(12, 8), subplots=True)
plt.show()

Comparing data from different cryptocurrencies can give us insight into their relative behavior and correlation.

Time series analysis is an indispensable tool in Data Science, particularly for financial data. By using resampling, visualization, and comparison techniques, we can uncover trends and patterns hidden in the data. Cryptocurrencies, with their volatility and growing popularity, offer an ideal learning ground for these techniques.

DEV Community: Mubarak Mohamed

Why Decision Trees Don't Need Feature Scaling (And Why This Matters)

TL;DR

Wait, What's Feature Scaling Again?

Quick example:

🌲 How Decision Trees Actually Work

The purity metrics:

The Magic: Why Scaling Doesn't Matter

Here's why:

Proof Time: Let's Code!

Model 1: WITHOUT scaling

Model 2: WITH scaling

Results:

All Tree-Based Algorithms Follow This Rule

But These Algorithms DO Need Scaling

k-Nearest Neighbors (k-NN)

Let's prove it:

Other sensitive algorithms:

🤓 Edge Cases: When You Might Still Scale Trees

1. Feature Importance Interpretation

2. Regularization in Advanced Models

3. Mixed Model Pipelines

The Bottom Line

When working with trees:

When working with distance/gradient-based models:

Key Takeaways

🔗 Want to Go Deeper?

📊 Quick Reference Cheatsheet

2026: The Year Data Science Changed Forever (And What It Means for You)

🚨 Why 2026 Actually Matters

1. Generative AI isn't just answering questions anymore

2. AutoML reached production maturity

3. No-code ate the analytics market

🔍 The 6 Trends You Can't Ignore

Trend 1: AI Copilots in Every Tool

Trend 2: Real-Time Analytics Becomes Standard

Trend 3: Augmented Analytics (AI That Thinks Ahead)

Trend 4: The Explainability Mandate

Trend 5: Data Governance Isn't Optional Anymore

Trend 6: Role Evolution is Accelerating

💡 What This Means For You

If you're a Data Analyst

If you're learning Data Science

If you're hiring

🤔 "Should I Still Learn Data Science in 2026?"

The essential skills now are:

🎯 Key Takeaways

🗣️ Let's Discuss

codecrafters-io / build-your-own-x

Master programming by recreating your favorite technologies from scratch.

Build your own <insert-technology-here>

Tutorials

Build your own 3D Renderer

L'Arsenal du Data Analyst en 2025 : Maîtriser les Outils, les Données et les Tendances pour se démarquer

Outils Incontournables : Les Fondations de l'Analyse de Données

1. Les basiques toujours puissants : Excel et SQL

2. L'ère de la Business Intelligence (BI)

3. Le pouvoir du code : Python et ses bibliothèques

4. L'IA au service du Data Analyst

Sources de Données : Le Combustible de l'Analyse

Formations et Veille : L'Apprentissage Continu, une Nécessité

Communautés et Réseaux : Construire Son Capital Humain

Tendances 2025 : Anticiper le Futur de la Data

L'Aventure de l'Apprentissage Continu

Data Mesh: The Decentralized Revolution That Will Transform Your Data Architecture

Goodbye Monolith, Hello Mesh

1. Domain-oriented Data Ownership

2. Data as a Product

3. A Self-Serve Data Platform

4. Federated Computational Governance

A Concrete Business Case: Data Mesh at an E-Commerce Company

The Data Mesh Tool Kit 🛠️

Ingestion and Streaming Tools

Data Platforms

Data Catalogs and Governance

Orchestration Tools

Google AI Studio: A Free Playground to Experiment with Gemini AI

Key Features of Google AI Studio

1. Chat with AI

2. Real-Time Stream

Build your own `3D Renderer`