DEV Community: DCAI Community

Coding Wonderland: Contribute to YData Profiling and YData Synthetic in this Open Source Advent

DCAI Community — Tue, 05 Dec 2023 13:50:51 +0000

The holiday season is upon us and so is the fantastic Open Source Advent Game!

To escape the data scientists' naughty list, the Data-Centric AI Community decided to come together and contribute to our favorite data science open-source projects -- ydata-profiling and ydata-synthetic -- and here's why you should to:

For each contribution, you can gather points and have the chance to win a fantastic swag package from Zilliz and all the remaining participants!

Not sure how to contribute? Don't hang up your socks just yet!

You don't need to be a super-experienced programmer to start contributing! In open-source, all is fair game and there are plenty of ways to help the developer community:

Send us your North ⭐️: "On the first day of Christmas, my true contributor gave to me..." a star in my GitHub tree! 🎵 If you love these projects too, star ydata-profiling or ydata-synthetic and let your friends know why you love it so much!
Leave some footprints in the Snow with awesome tutorials ❄️ Other developers will surely appreciate your guidance and learnings. Why not contribute with a short blogpost? Here some of the best contributions the community is making with ydata-profiling -- YData Profiling: Streamlining Data Analysis -- and ydata-synthetic -- Fraud Machine Learning Modeling Improvement with Synthetic Data
How about a Holly Jolly repo? That's right! You also win points for creating a repo using the open source projects and putting them on GitHub! Here are some examples of a Demo for quick EDA with ydata-profiling and some experiments with synthetic data with ydata-synthetic
Dazzle it up with an App or a Magic PR: If you feel adventurous, both projects have plenty of issues up for grabs (ydata-profiling issues | ydata-synthetic issues). If that feels a bit overwhelming, why not try to improve the docs? There's no such thing as "too many examples"!

How can you get started?

There are plenty of resources for inspiration:

Check this video of how to install ydata-profiling and some examples to give you some ideas!
Installing ydata-synthetic is also super easy. And you know what? You can also use the Streamlit app if you're more into the UI experience. If you like hand-on coding, have a crack at these examples.

Your sleigh got stuck somewhere? Join the Office Hours!

Throughout your Open Source Advent, you can get office hours from the YData team directly:

Ask questions anytime in the Data-Centric AI Community Discord throughout the Advent!
Join the Open Source Advent on Discord and meet us on Dec 6 between 6-7PM EST to get your questions answered live;

Remember, knowledge is a gift that keeps on giving: may your data be clean, your analyses insightful, and your holiday season filled with the joy of data science! 🎁🎄

How to Do an EDA for Time-Series

DCAI Community — Tue, 20 Dec 2022 19:14:20 +0000

Original post by Fabiana Clemente, Chief Data Officer at YData

Exploring pandas-profiling for time-series exploratory analysis

One of the early steps in the data science development cycle is to understand and explore the data for the problem you're solving.

Exploratory Data Analysis (EDA) is a crucial step for a better data science workflow, and pandas-profiling has been our preferred choice to have in done quickly and with a single line of code, while providing the outputs to better understand the data and uncover meaningful insights.

You have probably been using pandas-profiling for structured tabular data, which is commonly the first type of data that we learn to explore (we all now the Iris dataset right? 😁) However, in real-world applications, there's another type of data structure that we can commonly find in our day to day: time-series data! From traffic data to our daily trajectories or even our electricity and water consumption, all of them have one thing in common — temporal dependency.

In this blogpost, I'll be exploring some key steps in the analysis of a dataset, while leveraging the time-series features of pandas-profiling. The dataset explored refers to the Air Quality in the USA and can be download from EPA website.

The full code and examples can be found in this GitHub repository so you can follow along the tutorial.

The nature of time-series data

Time-series or sequential data has become one of the most valuable commodities in a world that is more and more data driven, which makes the need to perform EDA and mine time-series data a much needed skill for data science practitioners.

Due to the nature of time series data, and when exploring the dataset, the type of analysis is different from when the dataset records are considered to be all independent. The complexity of the analysis grows with the addition of more than one entity within the same dataset.

Analyzing multiple entities in a time-series dataset

The data description says it's the air quality data collected at outdoor monitors across the United States, Puerto Rico, and the U. S. Virgin Islands. With that information, we understand this is a multivariate time-series data that has several entities that we will need to take into consideration.

Knowing this, I have some follow-up questions:

How many are the locations available in what concerns the pollutants measures? Do all the sensors collect the same amount data throughout the same timespan? How are the collected measures distributed in time and location?

Some of theses questions can be easily answered with an heatmap comparing all the measurements and locations against time, as depicted by the code snippet and image below:

from pandas_profiling.visualisation.plot import timeseries_heatmap

timeseries_heatmap(dataframe=df, entity_column='Site Num', sortby='Date Local')

USA Air Quality dataset heatmap

The diagram above showcases the data points for each entity over time. We can see that not all stations have started collecting data at the same time, and based on the intensity of the heatmap, we can realize that some stations have more data points than others for a given time period. This means that, when modeling the time series, having dynamic timestamps for the training and test datasets might be better than having pre-determined timestamps. We also will have to further investigate the missing records and the scope for imputing records.

With that basic understanding of what our entities time distribution looks like, we can start deep-diving into the data profiling for more insights. Since there are multiple time series, let’s have a look into each entity behavior.

A dive into time-series metrics

If you were using pandas-profiling already, you probably know how to generate the profile report.

The support for time series can be enabled by passing the parameter tsmode=true, and the library will automatically identify the presence of features with autocorrelation (more on this later). For the analysis work properly, the dataframe needs to be sorted by entity columns and time, otherwise you can always leverage the sortby parameter.

The code for this is as simple as:

from pandas_profiling import ProfileReport

profile = ProfileReport(filtered_time_series_data, tsmode=True, sortby="Date Local")
profile.to_file('profile_report.html')

Here’s a preview of the output report using the time-series mode:

Seasonal and Non-stationary alerts

Specific to time-series analysis, we can spot 2 new warnings — NON_STATIONARY and SEASONAL. The easiest way to have a quick grasp on your time-series is by having a look into the warnings section. For this particular use case, each profile report will depict the particular behavior of each USA location in what concerns pollutants measurements.

Here’s how the warnings look in our report:

A time series is said to be stationary when its statistical properties (such as mean and variance) do not change over the time at which the series is observed. Conversely, a time series is non-stationary when its statistical properties depend on time. For instance, time series with trends and seasonality (more on this later) is not stationary — these phenomena can affect the value of the time series at different times.

Stationary processes are comparatively easier to analyze as there is a static relationship between the time and the variables. In fact, stationary has become a common assumption for most time-series analysis.

While there are models for non-stationary time series, most ML algorithms do expect a static relationship between the input features and the output. When the time-series is not stationary, a model's accuracy modeled from the data will vary at different points. This means the modeling choices are affected by the stationary/non-stationary nature of the time-series, and different data preparation steps should be applied when you want to convert the time-series into a stationary one.

So this alert will help you identify such columns and pre-process the time series accordingly.

Seasonality in time series is a scenario in which the data experiences regular and predictable changes that recur over a defined cycle. This seasonality may obscure the signal that we wish to model when time-series modeling, and even worse, it may provide a strong signal to the models. This alert can help you identify such columns and alert you to fix the seasonality.

More information on the time-dependent features

The first difference you will notice is that the line plot will replace the histogram for the column that was identified as time-dependent. Using the line plot, we can better understand the trajectory and the nature of the selected column. For this NO2 mean line plot, we see a downward trend in the trajectory, with continuous seasonal variations, with the maximum value recorded in the initial stages of the series.

Feature details of a column

Next, when we toggle for more details of the column (as shown in the figure above), we’ll see a new tab with autocorrelation and partial-autocorrelation plots.

For time series, autocorrelations show how the relationship of a time series at its present value relates to its previous values. Partial autocorrelation is the autocorrelation of a time series after removing the effect of previous time lags. Which means these plots are crucial to provide information regarding the autocorrelation degrees of the series under analysis, as well as the moving average degree.

The above ACF and PACF plots are a bit ambiguous as expected. Looking throughout our warnings, we can see that NO2 mean is a non-stationary time variable, which removes the interpretability of these plots. Nevertheless, the ACF plot is useful to confirm us what we already suspected — NO2 mean is non-stationary — as the ACF plot values decrease very slowly instead of dropping quickly to zero as expected for the case of stationary series.

The information gathered from the data profiling, the nature of time-series, and the alerts such as non-stationary and seasonality give you a head start in understanding the time-series data you have at your hand. This doesn’t mean you’re done with the exploratory data analysis — the goal is to use these insights as a starting point and work on further in-depth data analysis and further data preparation steps.

From profiling the air quality dataset, we see several columns which are constant, which may not add much value when modeled. From the missing values chart, we see SO2 and CO2 air quality indexes have missing data — we should further explore the impact of this and the scope for imputation or dropping these columns altogether. Several columns were found with non-stationary and seasonality alerts, the next steps would be either to make them stationary or ensure the models we’ll be using can handle the non-stationary data points.

You get the idea — as data scientists, it's important to use profiling tools to quickly grab an overall view of the data (in our case time series), and further inspect and take informed decisions on the data pre-processing and modeling stages.

Conclusions

The motto of pandas-profiling has always been the same:

"Read the data? Pause. Generate the Pandas Profiling report, and inspect the data. Now start cleaning and re-iterate on exploring the data.”

Though structured tabular data remains the most common data when giving the first steps data science, time-series data is widely used and core for the development of many business and advanced data-driven solutions. Due to the nature of the time series and how records depend on time and influence future occurrences — different kinds of insights are sought out by the data scientists during the exploratory data analysis phase.

Thus, it was a matter of time before the Pandas Profiling library incorporated features to enable a time-series analysis mode to uncover these insights. From the changes required from the user to obtain the time-series-specific profiling report — to the output of new alerts that prompt concerns in the data, line plots and correlation graphs that are specific to time-series analysis — we demonstrated everything in this article.

But the metrics and analysis explored today is only the beginning! More questions are to be answered. And for you, what is your usual approach while analysis time-series data? What do you miss the most when working with sequential datasets?

Made with ❤️ by the Data-Centric AI Community

Thank you for reading! If you enjoyed this tutorial and plan to use pandas-profiling in your data quests, please ⭐️ our repository and join the discussion on our Discord server!

How to compare 2 datasets with pandas-profiling 🐼

DCAI Community — Tue, 20 Dec 2022 16:54:52 +0000

Visualization is the cornerstone of Exploratory Data Analysis

When facing a new, unknown dataset, visual inspection allows us to get a feel of the available information, draw some patterns regarding the data, and diagnose several issues that we might need to address.

pandas-profiling has been the indispensable swiss-knife in every data scientist’s tool belt. However, something that seemed to be missing was the ability to compare different reports side-by-side, which would help us continuously assess the transformations performed during EDA!

Side-by-side comparison: the wait is over!

pandas-profiling now supports a "side-by-side" comparison feature that lets us automate the comparison process with a single line of code.

In this blogpost, I'll put you up to speed with this new functionality and show you how we can use it to produce faster and smarter transformations on our data.

I’ll be using the HCC Dataset, which I have personally collected during my MSc. For this particular use case, I’ve artificially introduced some additional data quality issues to show you how visualisation can help us detect them and guide us towards their efficient mitigation.

The full code and examples can be found on this GitHub repository so you can follow along the tutorial.

pandas-profiling: EDA at your fingertips

We’ll start by profiling the HCC dataset and investigating the data quality issues suggested in the report:

pip install pandas-profiling==3.5.0

import pandas as pd
from pandas_profiling import ProfileReport

# Read the HCC Dataset
df = pd.read_csv("hcc.csv")

# Produce the data profiling report
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")

Alerts shown in Pandas Profiling Report.

According to the "Alerts" overview, there are four main types of potential issues that need to be addressed:

Duplicates: 4 duplicate rows in data;
Constant: Constant value “999” in 'O2';
High Correlation: Several features marked as highly correlated;
Missing: Missing Values in ‘Ferritin’

The validity of each potential problem (as well as the need to find a mitigation strategy for it) depends on the specific use case and domain knowledge. In our case, with the exception of the "high correlation" alerts, which would require further investigation, the remaining alerts seem to reflect true data quality issues and can be tackled using a few practical solutions. Let's see how!

Removing Duplicate Rows

Depending on the nature of the domain, there might be records that have the same values without it being an error. However, considering that some of the features in this dataset are quite specific and refer to an individual’s biological measurements (e.g., "Hemoglobin", "MCV", "Albumin"), it’s unlikely that several patients report the same exact values for all features. Let’s start by dropping these duplicates from the data:

# Drop duplicate rows
df_transformed = df.copy()
df_transformed = df_transformed.drop_duplicates()

Removing Irrelevant Features

The constant values in O2 also reflect a true inconsistency in data. There may be two main reasons for such an error to arise: either the O2 values were measured and stored automatically in the database and the pulse oximeter failed, or the person taking this measurement kept evaluating repeated erroneous messages and simply coded them as “999”, which is an absurd value (O2 values range from 0% to 100%). In all cases, these values are erroneous and should therefore be removed from the analysis:

# Remove O2
df_transformed = df_transformed.drop(columns='O2')

Missing Data Imputation

As frequently happens with medical data, HCC dataset also seems extremely susceptible to missing data. A simple way to address this issue (avoiding removing incomplete records or entire features) is data imputation. We’ll use mean imputation to fill in the absent observations, as it is the most common and simple of statistical imputation techniques and often serves as a baseline method:

# Impute Missing Values
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
df_transformed['Ferritin'] = mean_imputer.fit_transform(df_transformed['Ferritin'].values.reshape(-1,1))

Side-by-side comparison: faster and smarter iterations on your data

Now for the fun part! After implementing the first batch of transformations to our dataset, we're ready to assess their impact on the overall quality of our data.

This is where the pandas-profiling report functionality comes in handy: the comparison between the original versus the transformed data can now be automatically performed through the .compare method of the ProfileReport:

transformed_report = ProfileReport(df_transformed, title="Transformed Data")
comparison_report = original_report.compare(transformed_report)
comparison_report.to_file("original_vs_transformed.html")

How did these transformations impacted the quality of our data? And What would we find by further investigating each of the transformations performed? Let’s dive deeper into the comparison results!

Dataset Overview

The comparison report shows both datasets ("Original Data" and "Transformed Data") and distinguishes their properties by respectively using a blue or red colour in titles and graph plots.

As shown in the report, the transformed dataset contains one less categorical feature ("O2" was removed), 165 observations (versus the original 171 containing duplicates) and no missing values (in contrast with the 79 missing observations in the original dataset).

Comparison Report: Dataset Statistics.

Duplicate Records

Conversely to the original data, there are no duplicate patient records in the transformed data: our complete and accurate case base can move onward to the modeling pipeline, avoiding data overfitting.

Comparison Report: Duplicate Rows.

Irrelevant Features

Features that have not been subjected to any transformation remain the same (as shown below for "Encephalopathy"): original and transformed data summary statistics do not change. In turn, removed features are only presented for the original data (shown in blue), as is the case of "O2".

Comparison Report: Encephalopathy remains the same.

Comparison Report: O2 is only shown for the original data.

Missing Values

Contrarily to the original data, there are no missing observations after the data imputation was performed. Note how both the nullity count and matrix show the differences between both versions of the data: in the transformed data, "Ferritin" has now 165 complete values and no blanks can be found in the nullity matrix.

Comparison Report: Missing Values.

A deeper investigation on data properties

If we were to compare all features prior and before the data transformations performed, we would find an insightful detail in what concerns missing data imputation.

When analysing the "Ferritin" values in higher detail, we’d see how imputing values with the mean has distorted the original data distribution, which is undesirable:

Comparison Report: Ferritin - imputed values seem to distort the original feature distribution.

This artefact is also observed through the visualisation of interactions and correlations, where daft interaction patterns and higher correlation values emerge in the relationship between "Ferritin" and the remaining features.

Comparison Report: Interactions between Ferritin and Age: imputed values are shown in a vertical line corresponding to the mean.

Comparison Report: Correlations - Ferritin correlation values seem to increase after data imputation.

This comes to show that the comparison report is not only useful to highlight the differences introduced after data transformations, but it provides several visual cues that lead us towards important insights regarding those transformations: in this case, a more specialised data imputation strategy should be considered.

Conclusion

Throughout this small use case, I've covered the usefulness of comparing two sets of data within the same profiling report to highlight the data transformations performed during EDA and evaluate their impact on data quality. Nevertheless, the applications of this functionality are endless, as the need to (re)iterate on feature assessment and visual inspection is vital for data-centric solutions!

Made with ❤️ by the Data-Centric AI Community

Thank you for reading! If you enjoyed this tutorial and plan to use pandas-profiling in your data quests, please ⭐️ our repository and join the discussion on our Discord server!

The Data-Centric AI Community is on Discord 👾

DCAI Community — Tue, 20 Dec 2022 11:58:31 +0000

Welcome to our humble home!

This Christmas, you won’t be "home alone"! The Data-Centric AI Community is officially moving to a fun little place of the web — our brand new Discord server — and you’re invited to join us during the holidays.

What to expect from the server?

The DCAI Community is the home of all things data and is therefore designed with that in mind.

There are essentially 5 categories you may explore:

🚀 Let's Get Started: to get you started in the community, introduce yourself and invite friends;
💭 Data-Centric Topics: exclusively dedicated to data-centric discussions;
⭐️ Community Hub: to share ideas and personal projects, find inspiration and job opportunities, and foster partnerships;
🐼 Pandas Profiling and 🔐 YData Synthetic: dedicated to our open-source users and contributors.

Everything has its own place

Async communication is already hard, and more so when you have to skim through piles of vendor and promotional content to get truly helpful information and meaningful conversations.

We want our space to be exclusively dedicated the boost genuine interests: that’s what a community is for.

DCAI has a special "#promote" channel, but that’s it. If you are truly invested on having people to provide feedback and collaborate you can post your work in the discussion forums or look for partners and colleagues in "#partnerships".

Alternatively, if you found DCAI through pandas-profiling or ydata-synthetic you can find support for your troubleshooting and provide feedback on interesting features!

So there you have it, everything has its own place, even events! No need to have links scattered through "random" or "general" channels. Just add your event to the community! It will automatically synchronise with your time zone, so that everyone is always up-to-date.

Find your tribe and engage in genuine conversations

Similarly to machine learning algorithms, each of us can belong to one or several tribes, and a community is a place where we can connect with like-minded fellas.

When entering the DCAI Community, you can join yours by simply reacting to their respective emojis: a moderator will then assign you to the desired role.

Then, while navigating the community, you can get in touch with your tribe when asking a particular question or sharing an update. Just be careful not to abuse this permission (e.g., using it for promotional content) or it may be revoked!

How to join

Joining the community is easy-peasy: just click this link. If you are already a Discord user, the link should take you directly to our server.

If you run into any troubles with “Invalid Invite”, try opening the link in a private browser window, or simply add the server manually and paste the invite link: https://discord.gg/mw7xjJ7b7s

If you don’t have a Discord account yet, you’ll need to create one, and then join the server using the invite link: https://discord.gg/mw7xjJ7b7s

See you soon!

This marks a new beginning for the Data-Centric AI Community!

We're preparing a year full of exciting initiatives, where each month will be dedicated to a particular data quality issue — what it is, where we can encounter it "in the wild", how it affects data science applications, and of course... how it may be diagnosed and mitigated!

So if you haven't yet, stay tuned to our newsletter: we promise to spill all The Gaussip.

We hope you can join us and share your journey!👾