DEV Community: Arpit Kadam

🚀 6 Python Libraries to Perform EDA with One Line of Code 📊

Arpit Kadam — Tue, 07 Jan 2025 20:25:08 +0000

Author: Arpit Kadam

Exploratory Data Analysis (EDA) is the foundation of any successful data science project. It's where you dig into your dataset, uncover its hidden nuances, identify patterns, and understand the relationships between different variables – all before even thinking about modeling. But let’s be honest, EDA can be a time-consuming endeavor. This is precisely why automated EDA libraries are a game-changer! 🤯

In this post, I'll introduce you to six powerful Python libraries that can automate the EDA process, allowing you to extract meaningful insights with just a single line of code. These libraries are a fantastic starting point for any data project, and will save you time while increasing your productivity. The libraries we’ll cover are:

📊 Pandas Profiling
🍭 Sweetviz
📈 Autoviz
🕸️ D-Tale
📑 Dataprep
👓 Pandas Visual Analysis

I'll provide a quick overview of each library, including installation instructions, usage examples, and their key features. Let's dive in! 👇

1. `📊` Pandas Profiling

Pandas Profiling is an open-source powerhouse for automated EDA. It generates comprehensive HTML reports packed with information about your dataset, including descriptive statistics, variable properties, and correlation insights.

Installation

pip install pandas-profiling

Usage

from pandas_profiling import ProfileReport
report = ProfileReport(df)
report.to_notebook_iframe()

Features

✅ Detailed dataset overview
✅ Variable interaction and correlation analysis
✅ Missing value identification
✅ Visualization of variable distributions

GitHub Repository for Pandas Profiling

2. `🍭` Sweetviz

Sweetviz excels at generating visually rich and interactive HTML reports for your data. It shines when comparing different datasets, making it perfect for train-test analysis or before-and-after comparisons.

Installation

pip install sweetviz

Usage

import sweetviz as sv
report = sv.analyze(df)
report.show_html('report.html')

Features

🎨 High-density, visually appealing visualizations
💪 Powerful dataset comparison functionality
🧮 Analysis of both categorical and numerical variables

GitHub Repository for Sweetviz

3. `📈` Autoviz

Autoviz is your go-to library when you need a wide range of visualizations to uncover hidden relationships in your data. It intelligently chooses the appropriate visualization based on the variable types, helping you explore your data efficiently.

Installation

pip install autoviz

Usage

from autoviz.AutoViz_Class import AutoViz_Class
autoviz = AutoViz_Class().AutoViz(df)

Features

📉 Scatter plots for continuous variables
📊 Distribution analysis for categorical variables
🔥 Heatmaps for correlation matrices

GitHub Repository for Autoviz

4. `🕸️` D-Tale

D-Tale offers a unique, interactive, web-based interface for data exploration. You can manipulate your data, create custom filters, and export the code behind your analysis all within the browser.

Installation

pip install dtale

Usage

import dtale
dtale.show(df)

Features

🖱️ Real-time data interaction within a web browser
🎛️ Custom filtering and data type highlighting
💻 Code export capabilities for every analysis step

GitHub Repository for D-Tale

5. `📑` Dataprep

Dataprep focuses on generating concise and highly readable reports with a strong emphasis on data quality and summary statistics. It helps you quickly understand your data's key characteristics.

Installation

pip install dataprep

Usage

from dataprep.eda import create_report
create_report(df).show_browser()

Features

🌐 Interactive visualizations in a browser
🔢 Summary statistics for each variable
🔗 Correlation matrices

GitHub Repository for Dataprep

6. `👓` Pandas Visual Analysis

Pandas Visual Analysis bridges the gap between exploratory data analysis and interactive visualization. It provides a user-friendly, real-time interface for exploring your data and creating insightful plots.

Installation

pip install pandas-visual-analysis

Usage

from pandas_visual_analysis import VisualAnalysis
VisualAnalysis(df)

Features

⌚ Real-time interaction with the data
✨ Automated interactive visualization dashboard

GitHub Repository for Pandas Visual Analysis

Conclusion

Automated EDA libraries are incredibly powerful tools for speeding up your data analysis workflows. While traditional EDA allows for more granular control, these libraries are fantastic for quickly gaining an understanding of new datasets or generating initial insights into complex data.

Among the libraries we've covered, D-Tale stands out for its interactive features and code export capabilities, which can be very useful when sharing your work. For beginners, I'd recommend starting with Pandas Profiling or Sweetviz because of their user-friendliness and comprehensive reports. They provide a great overview and a good starting point to then dig deeper.

Ultimately, the best library depends on your specific needs and project. Experiment with a few and see which one fits best into your workflow. Happy exploring! 🚀

References

This article is inspired by a piece from Towards Data Science.

Building an ETL Pipeline with Airflow, Docker, and Astro

Arpit Kadam — Tue, 24 Dec 2024 21:04:19 +0000

Efficient data management is a cornerstone of modern analytics and decision-making. In this blog, we will explore how to build a scalable ETL (Extract, Transform, Load) pipeline using Apache Airflow, Docker, and Astro. This project is designed to simplify workflow orchestration, enhance reproducibility, and ensure seamless deployment for better data handling.

GitHub link:- https://github.com/ArpitKadam/airflow-etl-pipeline.git

Understanding ETL

ETL stands for Extract, Transform, and Load. It’s a process where data is:

Extracted from various sources (APIs, databases, flat files, etc.).
Transformed into a consistent format that is easy to analyze.
Loaded into a database or data warehouse for downstream analysis.

This process automates handling and processing of large datasets, ensuring that valuable data is readily available for reporting, analysis, and decision-making.

Highlights of the Project

This project focuses on creating an automated ETL pipeline with the following key features:

Workflow Automation with Airflow: Apache Airflow is used to schedule and monitor ETL tasks. Airflow simplifies managing complex workflows by providing an intuitive user interface for tracking the execution status of tasks.
Containerized Development with Docker: Docker is used to containerize the project, ensuring consistency across development, testing, and production environments. This makes managing dependencies easier and ensures that the pipeline behaves the same regardless of the environment.
Astro Deployment: Astro offers a user-friendly interface for managing and scaling Apache Airflow pipelines. With Astro, deploying the pipeline to the cloud becomes seamless, while also enabling efficient monitoring and scalability.

Project Structure

The repository contains several essential components to ensure the pipeline works smoothly:

DAGs: Directed Acyclic Graphs (DAGs) in Airflow that define the ETL workflow, including tasks like data extraction, transformation, and loading.
Dockerfile: Defines the environment setup for the project, ensuring all dependencies are installed and the Airflow instance is properly configured.
docker-compose.yml: Configures the Airflow environment locally, making it easier to set up and run the entire pipeline without worrying about individual dependencies.
requirements.txt: Lists the Python dependencies required to run the project, including packages for data transformation and database connections.
tests/: Contains unit tests that verify the integrity and correctness of the data processed through the pipeline.

How It Works

Data Extraction: The pipeline connects to external APIs or databases to pull raw data. This step ensures that the required data is available for further processing.
Data Transformation: Using Python scripts and data manipulation libraries like Pandas, the raw data is cleansed, filtered, and transformed into a standardized format that is ready for analysis.
Data Loading: The transformed data is loaded into a target data store, such as a PostgreSQL database or cloud storage, enabling it to be used for downstream analysis.

Once the pipeline is set up, Apache Airflow takes over the task of automating and monitoring the entire workflow. Airflow’s intuitive UI allows users to track the progress of each task and intervene if necessary, ensuring that the process runs smoothly.

Why Use Docker and Astro?

Docker: Docker ensures consistency across different environments, whether on local machines or cloud-based deployments. By containerizing the environment, we ensure that all dependencies, configurations, and setups are the same no matter where the pipeline is run.
Astro: Astro simplifies deployment to the cloud. It provides tools to easily monitor, manage, and scale your Airflow pipelines. Whether running the pipeline locally or in production, Astro ensures seamless deployment and robust scalability.

Challenges and Learnings

While building this project, a few challenges were encountered:

Integration between Airflow and Docker: Ensuring smooth integration of Airflow with Docker was initially tricky. However, with careful configuration of the Dockerfile and docker-compose setup, we achieved a stable environment.
Resource Management in Cloud Deployments: Deploying the pipeline to the cloud required optimizing resource usage. Balancing resource allocation and ensuring efficient execution were key takeaways.

The experience underscored the importance of modular design, testing, and scalability when building real-world data solutions. Thorough testing was essential to handle various data edge cases and ensure the pipeline performs efficiently under different conditions.

How to Get Started

Clone the Repository: Start by cloning the repository to your local machine:

   git clone https://github.com/ArpitKadam/airflow-etl-pipeline.git

Build and Start the Docker Containers Use Docker to build the necessary containers:

   docker-compose up -d

Deploy the Pipeline Using Astro Deploy your pipeline to Astro for cloud management, monitoring, and scalability. Alternatively, you can run the pipeline locally using

   docker-compose.

Follow the README Detailed setup instructions are provided in the README file to help you configure and run the pipeline on your system.

This project provides a robust foundation for automating and scaling data pipelines using modern tools like Apache Airflow, Docker, and Astro. It showcases the importance of effective workflow orchestration and the power of containerization for data engineering.

** Images:- **

🚀 Starting Your Journey as an AI/ML Engineer: My Roadmap and Insights

Arpit Kadam — Sun, 20 Oct 2024 15:14:54 +0000

Hey Dev community! 👋

I’m Arpit Kadam, a third-year AIML student passionate about all things artificial intelligence🤖 and machine learning 📊. I’ve learned quite a bit along the way. Today, I want to share my experience and the roadmap that has helped me grow as an AI/ML engineer, hoping it will serve as a useful guide for anyone starting out!🌱

📚Start with the Basics: Build a Strong Foundation
Before diving into the complex world of AI and ML, it’s crucial to have a solid understanding of programming fundamentals, mathematics, and statistics 🧠.

🧑‍💻Learn Core Machine Learning Concepts
Once you’ve got the basics down, it’s time to get hands-on with machine learning algorithms🛠️.

🔑 Real-World Projects: The Key to Mastery
Theory is important, but nothing beats learning from real-world projects✨. Feel free to check out my projects on my GitHub Page => https://github.com/ArpitKadam

🔬Stay Curious: Explore Advanced Topics
I’m constantly trying to enhance my knowledge, not just to shine in projects but also to contribute more effectively to the field.

🤝 Share and Collaborate
Finally, I can’t stress this enough: Document your work and share it! 📢Whether through blog posts, GitHub repositories, or presentations, sharing not only helps you retain knowledge but also opens doors to networking and collaboration.

📫 Let's Connect!
If you’d like to discuss AI/ML, collaborate on a project, or just want to chat about tech, feel free to reach out to me! Here’s where you can find me:

Email: arpitkadam922@gmail.com 📧
GitHub: https://github.com/ArpitKadam 💻
Phone: +91-8767375722 📞
Instagram: https://www.instagram.com/arpit__kadam/ 📸
I’m always open to learning, connecting with like-minded individuals, and collaborating on interesting projects! Feel free to ping me on any of the platforms above. 😊

DEV Community: Arpit Kadam

🚀 6 Python Libraries to Perform EDA with One Line of Code 📊

1. 📊 Pandas Profiling

2. 🍭 Sweetviz

3. 📈 Autoviz

4. 🕸️ D-Tale

5. 📑 Dataprep

6. 👓 Pandas Visual Analysis

Conclusion

Building an ETL Pipeline with Airflow, Docker, and Astro

Understanding ETL

Highlights of the Project

Project Structure

How It Works

Why Use Docker and Astro?

Challenges and Learnings

How to Get Started

🚀 Starting Your Journey as an AI/ML Engineer: My Roadmap and Insights

1. `📊` Pandas Profiling

2. `🍭` Sweetviz

3. `📈` Autoviz

4. `🕸️` D-Tale

5. `📑` Dataprep

6. `👓` Pandas Visual Analysis