DEV Community: brightgitari

The Ultimate Guide to Exploratory Data Analysis (EDA).

brightgitari — Thu, 02 Mar 2023 11:38:10 +0000

Exploratory Analysis Ultimate Guide.
Exploratory data analysis (EDA) is an essential step in the data analysis process. It is used to understand the data and find the relationship between the variables in a dataset. There is a range of activities and techniques applied by different analysts and professionals, but in this guide we are going to go through the steps and techniques used to carry out an effective EDA on your dataset.
The code blocks provided in this link are in python. They can be modified to carry out your specific task.
1. Knowing the Data.
This is very first step in EDA. It aids in understanding of your dataset. In this step you get to know the structure of the dataset and the variables in the your data altogether with its properties. Techniques used to know your data include:
Data Summary: This involves summary calculation of statistical values such as median, mean, mode, variance, and the standard deviation hence getting an overview of the data. This aids in finding or at least having a clue of the range of values, the central tendency and the spread of the data.
Data Visualization: this is the art of creating charts and graphs to represent the data hence anyone (the client) can be able to visualize the data. Visualization helps in a big way to see how your data is spread or distributed even before you get deep into the numbers. Visualizing the data helps you identify the patterns in the data, trends and anomalies. The most popular visualization techniques used include: histograms, box plots, bar charts, scatter plots, pie charts, heat maps and count plots.
Data Sampling: Sometimes one may obtain extremely large data sets. This type of data is too hard to analyze at once therefore one can take a sample of data and it is easier to find the properties of the data. This technique helps a great deal when it comes to find any biases and sampling errors in the data.

** Code Block **

import pandas as pd

# Read data from a CSV file
df = pd.read_csv('data.csv')

# Display the first 5 rows of the data
print(df.head())

# Display the last 5 rows of the data
print(df.tail())

# Display the number of rows and columns in the data
print(df.shape)

# Display the column names
print(df.columns)

# Display the data types of each column
print(df.dtypes)

# Display summary statistics for numeric columns
print(df.describe())

# Display unique values in a column
print(df['column_name'].unique())

# Display the count of each unique value in a column
print(df['column_name'].value_counts())

# Display the number of missing values in each column
print(df.isna().sum())

# Display the number of non-missing values in each column
print(df.count())

# Display the correlation matrix for numeric columns
print(df.corr())

These techniques as explained above will aid you as an analyst or a data professional to understand your data and point out any anomalies that perhaps need to be fixed before starting the analysis.

2. Identifying the missing values and outliers in the data.
Identifying the missing values, outliers is the second step when carrying out EDA on any data set. Missing values and outliers greatly impact the quality and accuracy of the analysis hence should be checked out and fixed before analysis of the data begins.
Techniques used to handle missing values and outliers include:
Identifying missing values: One can check missing values by checking for NaN values, Null values or zeros in the variable columns. The missing values can be handled by replacing them by median or mean values or by deleting the rows with the missing values.
Outliers: Outliers are identified using visualization such as box plots, scatter plots and histograms. Statistical techniques can be used to identify outliers, these techniques are, z-score or Tuckey fences. Once outliers are identified, they can either be removed or replaced by a much more reasonable value. Handling outliers greatly depends on the magnitude that it will have on the analysis of the dataset.

Code Block

# Check for missing values
print(df.isna().sum())

# Identify outliers using box plots
import seaborn as sns
sns.boxplot(x=df['column_name'])

Identification of outliers and missing values contributes largely on making sure your data is clean before proceeding with analysis.
3. Access the data quality.
This involves examining the data for errors, inconsistencies and other issues that can affect the quality and accuracy of the analysis. The following techniques are used to access the data quality.
Data Completeness: This involves checking if data in the required columns and variables is complete and to find out if there are any missing values. This ensures that the data necessary for analysis is available.
Data consistency: The process of Checking if the data is consistent across the different sources. For example, if you have data from different sources, consistency is checked across all the data sources.
Data Accuracy: Data accuracy can be affirmed by counterchecking it with external data sources or by use of logical reasoning to check if the data is valid and reasonable.
Data Relevance: This is used to check if the data is relevant and that it aligns with the research question or the problem statement. The process ensures that the values and variables in the dataset are appropriate for the analysis.
Code Block

# Check data completeness
print(df.isna().sum())

# Check data consistency
df2 = pd.read_csv('data2.csv')
df3 = pd.read_csv('data3.csv')
df_merged = pd.merge(df, df2, on='key')
df_merged = pd.merge(df_merged, df3, on='key')

# Check data accuracy
print(df['column_name'].describe())

# Check data relevance
print(df.info())

By accessing the data quality one identify any potential issues or errors in the data and take appropriate steps to address them. This stage makes sure that the accuracy and reliability of the analysis is guaranteed.

4. Exploring the relationship between variables.
EDA involves checking the relationship between variables. This involves checking the correlation between variables and looking for patterns in a dataset. The techniques commonly applied include:
Correlation Analysis: Calculation of correlation between variables checks for linear correlation between variables. This helps identify variables which are strongly correlated and the ones which are weakly correlated.
Scatter Plots: Scatter plots are a great way to visualize the relationship between variables. They help identify patterns between two variables and one can easily tell and identify the trend between the variables.
Heat Maps: They help show correlation between variables. They are used to show clusters of strongly correlated variables and the ones that are weakly correlated.
Regression Analysis: Regression analysis is used to identify the relationship between one dependent variable and a one or more independent variables. This can help you identify which independent variables are most strongly related to the dependent variable.
Code Block

# Correlation analysis
print(df.corr())

# Scatter plot
import matplotlib.pyplot as plt
plt.scatter(x=df['column1'], y=df['column2'])

# Heat map
sns.heatmap(df.corr(), annot=True)

# Regression analysis
import statsmodels.formula.api as smf
model = smf.ols('dependent_variable ~ independent_variable', data=df).fit()
print(model.summary())

By exploring relationships between variables one can easily identify variables with the utmost importance to the analysis and how they relate to each other. This stage helps identify issues or biases in the data to aid in guiding how to do the data analysis.
5. Test Hypotheses
Testing the hypotheses is an instrumental step especially when carrying out analysis for research. Testing hypotheses involves formulating an hypothesis about the data based on the analysis and testing it using statistical methods. The following are techniques on how to test hypotheses.
T-Tests: T-tests are used to perform comparison between the mean of two groups and determine if they are statistically different.
ANOVA: ANOVA (Analysis of Variance) test is carried out to compare the means of multiple groups and determine if they are statistically different.
Chi-square Test: is carried out to check if there is a significant association between two categorical variables.
Regression Analysis: As explained above regression is carried out to check if there is a relationship between a dependent variable and one or more independent variable.
Code Block

# T-Tests
from scipy.stats import ttest_ind
group1 = df[df['group'] == 1]['column_name']
group2 = df[df['group'] == 2]['column_name']
ttest_ind(group1, group2)

# ANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('column_name ~ C(group)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Chi-Square Test
from scipy.stats import chi2_contingency
obs = pd.crosstab(df['column1'], df['column2'])
chi2_contingency(obs)

# Regression Analysis
import statsmodels.formula.api as smf
model = smf.ols('dependent_variable ~ independent_variable', data=df).fit()
print(model.summary())

By testing hypotheses, one can determine if there is a significant difference or relationship between the variables being analyzed. This step helps validate the results of analysis and can provide insights into the data that may not be apparent from exploratory analysis alone.

6. Document Findings:
The final steps is documentation of the findings in regard to the project carried out. Documenting involves summarizing the key insights and conclusions from the analysis and presenting them ina clear concise manner. The following are examples of techniques for documenting findings.
Summary Statistics: Provide summary statistics such as mean, median, mode, range standard deviation and correlation coefficients of each variable analyzed.
Visualization: Present visualizations such as scatter plots, histograms, and box plots to help communicate patterns and trends in the data.
Key Findings: Summarize the key findings of the analysis in a clear concise manner . this can include; insights into relationships between variables, potential biases or issues with the data and any significant differences or trends identified in the analysis.
Conclusions: Draw conclusions based on the analysis and provide recommendations for next steps or further analysis that may be needed.
Sample Code Block on Summarization

# Summary Statistics
print(df.describe())

# Visualizations
sns.pairplot(df)

# Key Findings
print('There is a strong positive correlation between column1 and column2.')

# Conclusions
print('Based on the analysis, it is recommended that further research be conducted to investigate the relationship between column1 and column2.')

By documenting findings, you can communicate the insights and conclusions from the analysis to others and help inform decision-making based on the data. This step helps to ensure that the analysis is well understood and can be effectively used to support business decisions or research findings.

Python 101! Introduction To Python for Data Science

brightgitari — Sat, 18 Feb 2023 20:12:13 +0000

Data science is the process of digging into data to identify and pull-out important insights for it. It utilizes methods and practices from Mathematics, statistics, computer programming to process small amounts to enormous amount of data.
One of the main aspects of data science is computer programming. This is the use of a programming language to manipulate data in order extract useful information from it.
Programming techniques are used in various processes in data analysis ranging from Querying of data, reading the data, displaying the data, cleaning the data, exploring the data to dig for insights, creating models from the data, storing the findings and documentation of the whole process etc.

Several programming languages are used to carry out processes on data. The most prevalent Programming language currently is Python.
Python is a general multipurpose object-oriented programming language, it was launched in 1991 and it has a simple syntax with a large standard library that have been useful to programmers and developers across many fields such as: web development, data analysis and data science, Machine Learning and Artificial intelligence and used in the academia for scientific computing.
In the recent years Python has been gaining popularity in terms of users and due to a very vibrant community that has led to major development of the language and creation of thousands of new libraries from time to time.

In the field of Data Science Python Has Become the most preferable programming language. On average more than 70% of the professions in the data industry prefer python. These claims are backed by surveys carried out to find out the most preferred programming languages in the data profession e.g.
The annual KDnuggets Data Science Software Poll: In the 2021 poll, Python was the most popular language for data science, with 77.2% of respondents using it. R was the second most popular language, with 29.7% of respondents using it. (Source:https://www.kdnuggets.com/2021/05/poll-data-science-software.html )
and;
The 2020 Data Science Survey by Kaggle: In the survey, Python was the most commonly used language for data science, with 79% of respondents using it. R was the second most commonly used language, with 27% of respondents using it. (source: https )

Python Is Used in The Data Science Field Due to And in The Following Ways:

Data Types and Structures:
There are different types of Data Types

Python can handle several different Data Types, including:

a) Numbers: Python supports several types of numerical data, including integers, floating-point numbers, and complex numbers.

b) Strings: Strings are sequences of characters, and they are used to represent text in Python.

c) Booleans: Booleans are used to represent logical values, either True or False.

d) Lists: Lists are a type of data structure that can hold an ordered sequence of values. They are created using square brackets, and the values in a list can be of any data type.

e) Tuples: Tuples are similar to lists in that they can hold an ordered sequence of values. However, tuples are immutable, which means they cannot be changed once they are created.

f) Dictionaries: Dictionaries are a type of data structure that allow you to store key-value pairs. They are created using curly braces, and the keys and values can be of any data type.

Data structures such as lists, dictionaries, and tuples are commonly used in data science because they allow you to store and manipulate large amounts of data efficiently. Here's a brief overview of these data structures:

a) Lists: Lists are one of the most commonly used data structures in Python. They allow you to store an ordered sequence of values, and you can access and manipulate individual values within a list using their index.

b) Tuples: Tuples are similar to lists, but they are immutable, which means they cannot be changed once they are created. Tuples are often used to store fixed collections of data, such as the coordinates of a point in space.

c) Dictionaries: Dictionaries are used to store key-value pairs, which can be of any data type. They are often used to represent structured data, such as the attributes of an object in a dataset. You can access individual values within a dictionary using their key.

Data Manipulation
Python has several powerful libraries for data manipulation, cleaning, and analysis, including NumPy and Pandas. Here are some ways in which these libraries can be used:

NumPy: NumPy is a library that provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to manipulate them. NumPy provides several features for indexing, slicing, filtering, and sorting arrays, making it an ideal choice for data analysis. Here are some features of NumPy:

Indexing and Slicing: You can access elements of an array using indexing and slicing. NumPy also provides advanced indexing, which allows you to select subsets of an array using boolean masks or integer arrays.

• Filtering: You can filter arrays based on certain conditions using NumPy's boolean indexing. This allows you to extract subsets of data that meet specific criteria.

• Mathematical Operations: NumPy provides a wide range of mathematical functions for arrays, including element-wise operations, matrix operations, and statistical functions.

Pandas: Pandas is a library that provides easy-to-use data structures and data analysis tools for Python. It's built on top of NumPy and provides powerful indexing, filtering, merging, and pivoting capabilities. Here are some features of Pandas:

• DataFrames: Pandas' primary data structure is a DataFrame, which is a two-dimensional table that can hold data of different types. DataFrames can be indexed and sliced, and they support advanced indexing using boolean masks and integer arrays.

• Filtering: Pandas provides powerful filtering capabilities that allow you to extract subsets of data that meet specific criteria. You can use boolean masks, query expressions, or apply custom functions to filter data.

• Merging: Pandas provides several functions for combining data from multiple DataFrames. You can use merge, join, and concat functions to combine DataFrames based on common columns, indexes, or other criteria.

• Pivoting: Pandas allows you to reshape data using pivoting operations. You can use pivot tables and pivot functions to group and aggregate data based on specific criteria.

Overall, NumPy and Pandas are powerful libraries that provide a wide range of features for data manipulation, cleaning, and analysis. Their capabilities include indexing, slicing, filtering, merging, and pivoting, making them suitable for a wide range of data analysis tasks.

Data visualization
Python provides several powerful libraries for data visualization, including Matplotlib, Seaborn, and Plotly. Here are some ways in which these libraries can be used to create common types of data visualizations:

Matplotlib: Matplotlib is a widely used library for data visualization in Python. It provides a wide range of tools for creating different types of visualizations, including scatter plots, line charts, histograms, and more. Here are some examples: • Scatter Plots: Matplotlib provides the scatter function to create scatter plots. You can customize the color, size, and shape of the points, and add a regression line using the plot function. • Line Charts: Matplotlib provides the plot function to create line charts. You can customize the color, line style, and marker style of the lines, and add multiple lines to the same chart. • Histograms: Matplotlib provides the hist function to create histograms. You can customize the bin size, color, and transparency of the bars, and add multiple histograms to the same chart.
Seaborn: Seaborn is a library that provides a higher-level interface for creating data visualizations than Matplotlib. It provides a wide range of statistical plots and color palettes, making it ideal for creating complex visualizations. Here are some examples: • Scatter Plots: Seaborn provides the scatterplot function to create scatter plots. You can customize the color, size, and shape of the points, and add a regression line using the regplot function. • Line Charts: Seaborn provides the lineplot function to create line charts. You can customize the color, line style, and marker style of the lines, and add multiple lines to the same chart. • Histograms: Seaborn provides the histplot function to create histograms. You can customize the bin size, color, and transparency of the bars, and add multiple histograms to the same chart.
Plotly: Plotly is a library that provides interactive data visualizations, making it ideal for creating web-based visualizations. It provides a wide range of tools for creating different types of visualizations, including scatter plots, line charts, histograms, and more. Here are some examples: • Scatter Plots: Plotly provides the scatter function to create scatter plots. You can customize the color, size, and shape of the points, and add interactive tooltips and animations. • Line Charts: Plotly provides the line function to create line charts. You can customize the color, line style, and marker style of the lines, and add multiple lines to the same chart. • Histograms: Plotly provides the histogram function to create histograms. You can customize the bin size, color, and transparency of the bars, and add interactive tooltips and animations. Overall, Matplotlib, Seaborn, and Plotly are powerful libraries for creating data visualizations in Python. They provide a wide range of tools for creating scatter plots, line charts, histograms, and other types of visualizations, and allow you to customize the appearance and behavior of the visualizations to meet your needs.

Machine Learning
Python is a popular programming language for building machine learning models, thanks to its ease of use, flexibility, and a wide range of powerful libraries. Here are some of the most popular libraries for building machine learning models in Python:

Scikit-learn: Scikit-learn is a popular machine learning library for Python that provides a wide range of algorithms for building models. It supports various types of models, including regression, classification, and clustering. Here are some examples: • Regression: Scikit-learn provides algorithms such as linear regression, decision tree regression, and random forest regression for building regression models. These algorithms can be used to predict continuous values, such as the price of a house based on its features. • Classification: Scikit-learn provides algorithms such as logistic regression, decision tree classification, and random forest classification for building classification models. These algorithms can be used to predict discrete values, such as whether a customer will buy a product or not based on their demographic and purchase history. • Clustering: Scikit-learn provides algorithms such as K-means clustering and hierarchical clustering for building clustering models. These algorithms can be used to group similar data points together based on their features.
TensorFlow: TensorFlow is a popular open-source library for building and training machine learning models. It is primarily used for building deep learning models, which are neural networks with many layers. Here are some examples: • Regression: TensorFlow provides neural network architectures such as feedforward neural networks, convolutional neural networks, and recurrent neural networks for building regression models. These models can be used to predict continuous values, such as the price of a house based on its features. • Classification: TensorFlow provides neural network architectures such as feedforward neural networks, convolutional neural networks, and recurrent neural networks for building classification models. These models can be used to predict discrete values, such as whether a customer will buy a product or not based on their demographic and purchase history. • Clustering: TensorFlow provides algorithms such as K-means clustering and spectral clustering for building clustering models. These algorithms can be used to group similar data points together based on their features.
Keras: Keras is a high-level neural network API for building and training machine learning models. It is built on top of TensorFlow and provides a simpler and more user-friendly interface. Here are some examples: • Regression: Keras provides neural network architectures such as feedforward neural networks, convolutional neural networks, and recurrent neural networks for building regression models. These models can be used to predict continuous values, such as the price of a house based on its features. • Classification: Keras provides neural network architectures such as feedforward neural networks, convolutional neural networks, and recurrent neural networks for building classification models. These models can be used to predict discrete values, such as whether a customer will buy a product or not based on their demographic and purchase history. • Clustering: Keras provides algorithms such as K-means clustering and spectral clustering for building clustering models. These algorithms can be used to group similar data points together based on their features. Overall, Python provides a wide range of powerful libraries for building machine learning models, including Scikit-learn, TensorFlow, and Keras. These libraries support different types of models, including regression, classification, and clustering, and provide a wide range of algorithms and architectures for building and training these models.

Deep Learning
Python is a powerful programming language for building deep learning models, thanks to its wide range of powerful libraries. Some of the most popular libraries for building deep learning models in Python are PyTorch and TensorFlow.

PyTorch: PyTorch is a popular open-source deep learning library for Python. It is primarily used for building and training neural networks, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). PyTorch provides a simple and easy-to-use API, making it popular among researchers and practitioners. • Convolutional neural networks (CNNs): PyTorch provides a wide range of layers and functions for building CNNs. It supports popular architectures such as LeNet, AlexNet, and VGG. PyTorch also provides pre-trained models for tasks such as image classification and object detection. • Recurrent neural networks (RNNs): PyTorch provides a wide range of layers and functions for building RNNs. It supports popular architectures such as LSTM and GRU. PyTorch also provides pre-trained models for tasks such as language modeling and speech recognition.
TensorFlow: TensorFlow is a popular open-source deep learning library for Python. It is primarily used for building and training neural networks, including CNNs and RNNs. TensorFlow provides a low-level and high-level API, making it suitable for researchers and practitioners. • Convolutional neural networks (CNNs): TensorFlow provides a wide range of layers and functions for building CNNs. It supports popular architectures such as LeNet, AlexNet, and VGG. TensorFlow also provides pre-trained models for tasks such as image classification and object detection. • Recurrent neural networks (RNNs): TensorFlow provides a wide range of layers and functions for building RNNs. It supports popular architectures such as LSTM and GRU. TensorFlow also provides pre-trained models for tasks such as language modeling and speech recognition. Overall, Python provides a wide range of powerful libraries for building deep learning models, including PyTorch and TensorFlow. These libraries support popular neural network architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), and provide a wide range of layers, functions, and pre-trained models for building and training these models.

Natural Language Processing
Python is a popular programming language for Natural Language Processing (NLP) tasks, such as sentiment analysis, topic modeling, and text classification. Some of the most popular libraries for NLP in Python are NLTK (Natural Language Toolkit) and SpaCy.

Sentiment Analysis: Sentiment analysis is the task of identifying the sentiment expressed in a piece of text. Python can be used to perform sentiment analysis using libraries such as NLTK and SpaCy. • NLTK: NLTK provides a wide range of tools and resources for performing sentiment analysis. It provides pre-trained models for tasks such as sentiment classification and subjectivity detection. • SpaCy: SpaCy provides a wide range of features for performing sentiment analysis. It provides pre-trained models for tasks such as sentiment classification and polarity detection.
Topic Modeling: Topic modeling is the task of identifying the main topics discussed in a piece of text. Python can be used to perform topic modeling using libraries such as NLTK and SpaCy. • NLTK: NLTK provides a wide range of tools and resources for performing topic modeling. It provides algorithms such as Latent Dirichlet Allocation (LDA) for identifying the main topics in a corpus of text. • SpaCy: SpaCy provides a wide range of features for performing topic modeling. It provides algorithms such as Non-negative Matrix Factorization (NMF) for identifying the main topics in a corpus of text.
Text Classification: Text classification is the task of assigning a piece of text to one or more predefined categories. Python can be used to perform text classification using libraries such as NLTK and SpaCy. • NLTK: NLTK provides a wide range of tools and resources for performing text classification. It provides algorithms such as Naive Bayes and Maximum Entropy for assigning a piece of text to one or more predefined categories. • SpaCy: SpaCy provides a wide range of features for performing text classification. It provides algorithms such as Support Vector Machines (SVMs) and Convolutional Neural Networks (CNNs) for assigning a piece of text to one or more predefined categories.

Image Processing
Python is a powerful language that can be used for a wide range of tasks, including image processing. Image processing involves manipulating digital images to improve their quality or extract useful information from them. Python provides a number of libraries that can be used for image processing, including Pillow, OpenCV, and scikit-image.

Pillow: Pillow is a popular Python library for image processing that provides a range of functions for opening, manipulating, and saving images. It supports a wide range of image formats and can be used for tasks such as resizing, cropping, rotating, and converting images. Pillow can be installed using pip, and provides an easy-to-use interface for performing basic image processing tasks.
OpenCV: OpenCV (Open Source Computer Vision) is a powerful open-source library for computer vision and image processing. It provides a wide range of functions for tasks such as image filtering, feature detection, object recognition, and motion detection. OpenCV can be installed using pip, and provides a C++ interface as well as a Python interface for performing complex image processing tasks.
scikit-image: scikit-image is a popular Python library for image processing that provides a range of functions for tasks such as image segmentation, feature detection, and color manipulation. It is built on top of the NumPy and SciPy libraries and provides a user-friendly interface for performing complex image processing tasks. scikit-image can be installed using pip, and provides a range of advanced image processing functions that can be used for research and industrial applications. Overall, Python provides a range of powerful libraries for image processing, including Pillow, OpenCV, and scikit-image. These libraries provide a wide range of functions and algorithms for performing basic and advanced image processing tasks, and can be used for a variety of applications such as computer vision, robotics, and medical imaging. With the help of these libraries, Python can be used to manipulate, analyze, and extract information from digital images with ease.

Overall, Python provides a wide range of powerful libraries for NLP tasks such as sentiment analysis, topic modeling, and text classification, including NLTK and SpaCy. These libraries provide a wide range of tools and resources for performing NLP tasks, and support a variety of algorithms and pre-trained models for achieving high-quality results.
In summary, Python is a powerful and versatile programming language that can be used for a wide range of data science tasks, including data manipulation, visualization, machine learning, deep learning, natural language processing, and image processing. Python is particularly popular in the data science community because of its ease of use, extensive libraries, and the vast community and resources available for learning and development.
Python provides a wide range of libraries and frameworks that are specifically designed for data science, including NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn, TensorFlow, Keras, PyTorch, NLTK, SpaCy, Pillow, OpenCV, and scikit-image. These libraries provide a comprehensive set of tools for data analysis, modeling, and visualization that can be easily applied to a wide range of real-world problems.
Python's popularity in the data science community has led to the development of a vast number of resources for learning and development, including online courses, tutorials, and open-source libraries. These resources make it easy for aspiring data scientists to learn and apply Python to real-world problems, as well as for experienced data scientists to stay up-to-date with the latest developments in the field.
In conclusion, Python is an essential tool for data scientists, offering a wide range of tools and libraries that make it easy to work with data, build models, and create visualizations. With the vast community and resources available for Python in data science, it is no surprise that Python has become the language of choice for many data scientists and researchers.