DEV Community: Stacy Gathu

An Introduction to Fundamental Libraries in Python for Data Science.

Stacy Gathu — Fri, 25 Apr 2025 10:33:49 +0000

Getting started with learning Python as a data scientist can feel overwhelming, with new jargon flying everywhere even for the most basics tasks. It is therefore essential to first have an idea of the basic libraries that exist, why they do and when to use them before taking on the task of using them in your code. Here is a brief and hopefully helpful intro to the most common libraries in Python for beginners.

Pandas

Pandas is an open source library developed by Wes McKinney, see book here in 2008 that is suitable for working with tabular data. Pandas has 2 main data structures(a container that holds data in a specific way) namely Data frames and Series. A pandas Series is a one-dimensional array of labelled data with an index attached to it, whilst a data frame is a two-dimensional array of data, consisting of multiple series.

Here is an example of a pandas series

0          Apple
1         Banana
2         Cherry
3    Dragonfruit
dtype: object

And here is one of a dataframe

         fruit  quantity price
0        Apple        25   $30
1       Banana        30   $10
2       Cherry        30   $20
3  Dragonfruit         5   $50

Regression with CART trees.

Stacy Gathu — Mon, 07 Apr 2025 17:39:40 +0000

CART stands for Classification And Regression Trees. The algorithm builds binary trees, where every split results in two branches. The algorithm recursively splits the data meaning that the dataset is split along the features continuously until a certain threshold is met. The threshold could be that the max depth has been reached, no further improvement after splitting and a few samples in the leaf node.

With regression, CART algorithms split features with an aim to reduce the Mean Squared Error. After the threshold has been met, the leaf node assigns the mean value of the subset of data. The leaf node represents a cell of the partitions. The cell is the smallest unit where a sample regression can be fit to the data accurately. A simple regression model is fitted to a specific subset of data.

Unlike a regression model where values are being assigned to an equation to obtain a prediction, CART regression works differently in that it assigns the mean of that subset of data.

For instance, if we were using a CART regression to predict the BMI of an individual who is 1.65 meters tall and weighs 80kg, the model will assign the mean of the subset of individuals in that position. Say the model had split along individuals weighing >75kg and measuring >1.60m , and the mean of the subset was a BMI of 29.4, an individual with the measurements 76kg and 180cm might still have a BMI of 29.4 assigned to them.

CART regression algorithms have advantages such as being better suited with non-linear relationships as they fit to specific subsets of data. They are also not dependent on feature scaling as they split leaf nodes on their values and do not rely on scale. They are also better to use with missing data as they use the features that are present based on the majority and they handle overfitting better than regression models using techniques such as pruning.

All in all, CART algorithms simplify the prediction process and are less complex owing to the fact that they use mean values rather than running a regression on every value. They tend to predict better owing to the fact that the tree uses multiple regression models fitted on multiple recursive subsets.

Classification evaluation metrics.

Stacy Gathu — Mon, 03 Mar 2025 06:52:12 +0000

We need various metrics to evaluate our models based on what we wish to achieve from our classification problem. For some we may require accuracy, in others we may prefer recall or precision. Below are the evaluation metrics used for classification models.

Accuracy
This is a measure of how accurate the model correctly classifies a variable. This is obtained by

A cc u r a cy = \frac{TP + TN}{TP + TN + FP + FN}

Precision
This is a measure of how accurate the positive predictions are.
This is calculated by

P rec i s i o n = \frac{TP}{TP + TN}

This measure is preferred when false positives need to be minimized to avoid unnecessary interventions, wasted resources, and potential harm to individuals or systems. Such instances include fraud detection. For instance, if our model kept flagging legitimate transactions as fraudulent, this might frustrate customers.

Recall
This is a measure of how many false positives were correctly predicted. This is calculated by

R ec a ll = \frac{TP}{TP + FN}

This metric would be preferred when false negatives need to be minimized. A good example would be in diagnostics where a false negative for a patient who has cancer might lead to a delay of treatment which decreases chances of survival.

F1 score
This is the harmonic mean of precision and recall, taking into consideration both metrics.

F 1 score = 2 * \frac{P rec i s i o n * R ec a ll}{P rec i s i o n + R ec a ll}

A high F1 score symbolizes a high precision as well as high recall. It presents a good balance between precision and recall and gives good results on imbalanced classification problems.

A short summary of data protection, privacy and ethics.

Stacy Gathu — Mon, 03 Feb 2025 05:08:44 +0000

Data Protection
This involves securing data from loss, theft, corruption, or unauthorized access. Organizations implement encryption, firewalls, secure servers, and user authentication protocols to ensure that sensitive data remains safe.
Personal Protection
This involves protecting individuals from harm, including identity theft, fraud, and exposure to malicious activities (e.g., cyberbullying, harassment).
Network Protection
This is about securing networks from unauthorized intrusions or attacks, such as DDoS (Distributed Denial of Service) attacks, malware, ransomware, and hacking attempts.

Examples of Protection
Encryption
Data is encrypted so that only authorized users can decrypt and read it.
Firewalls & Antivirus Software: These tools prevent unauthorized access and help detect malicious activity.
Two-Factor Authentication (2FA): Ensures that a user’s account is protected by requiring two forms of identification (e.g., password and a text message code).

Privacy

Privacy refers to an individual's right to control their personal information and to be free from unwarranted surveillance or interference. In the digital era, privacy is often discussed in the context of data privacy, which refers to how personal data is collected, stored, and shared.

Personal Data
This refers to any information that can identify an individual, such as name, email, IP address, location, or biometric data. Privacy focuses on controlling the collection, sharing, and usage of this data.
Consent and Transparency
Individuals should have control over their personal information, meaning they should be aware of and consent to how their data will be used. This is often addressed through privacy policies, terms of service, and opt-in/opt-out features.
Right to be Forgotten
In some jurisdictions, individuals have the right to request the deletion of personal data that is held about them (such as in the European Union’s GDPR).

Examples of Privacy Protection

GDPR (General Data Protection Regulation): A law that ensures companies protect EU citizens' personal data and provide rights such as data access and deletion.
Data Anonymization: Removing personally identifiable information (PII) from data sets to prevent linking data back to an individual.
Privacy Settings: On platforms like social media, users can control who sees their information, posts, and activity.

Ethics
Ethics involves the moral principles and guidelines that govern behavior, especially in regard to how one’s actions affect others. In the context of technology, ethics ensures that data and technology are used in responsible, fair, and transparent ways.

Fairness
Ethics ensures that individuals or groups are treated fairly and not discriminated against based on data or technology use. For example, algorithms should not perpetuate racial, gender, or socioeconomic biases.

Accountability
Ethical considerations dictate that organizations and individuals who handle data are responsible for their actions. If something goes wrong (e.g., a data breach), those responsible should be held accountable.

Transparency
Ethical standards require that organizations are transparent in how they collect, use, and share data, and that they disclose the impact of their actions on privacy and security.

Informed Consent
Ethics demands that individuals must be informed about what will happen with their data and consent to it explicitly.

Examples of Ethical Challenges

Bias in Algorithms
Algorithms should be built to be fair and avoid perpetuating discriminatory practices. For example, AI systems used for hiring, lending, or law enforcement should not favor one demographic group over another.
Surveillance
There is an ongoing debate over the ethics of mass surveillance (e.g., facial recognition), especially when privacy may be compromised in the name of security.
Data Ownership
Who owns the data? In some cases, individuals should retain ownership of their personal data, while in other cases, companies may want to collect and monetize it. Ethical dilemmas arise when ownership and control aren’t clear.

The Intersection of Protection, Privacy, and Ethics

The relationship between these three concepts is intertwined:
Protection of Data (Security) ensures that personal data is not exposed to unauthorized entities, reducing the risk of breaches and misuse.
Privacy gives individuals the autonomy to decide how their personal information is shared and used, while laws like GDPR or CCPA (California Consumer Privacy Act) establish frameworks for privacy rights.
Ethics governs how organizations and individuals handle both protection and privacy, ensuring that these practices are carried out in a morally sound and fair manner.

Real-World Examples

Social Media Platforms
Social media companies must protect user data (protection), ensure users’ privacy preferences are respected (privacy), and act ethically by preventing the misuse of user data for purposes like political manipulation or unauthorized sharing with third parties (ethics).

Healthcare Data
Patient data must be protected from unauthorized access (protection). Patients should have control over who can access their health records and how they’re used (privacy). Healthcare providers must act ethically by using the data to improve patient outcomes without exploiting it for profit or discrimination.

AI and Automation
AI systems must be protected against misuse or malicious attacks (protection). Individuals should have control over how their data is used by AI systems (privacy). The systems should be designed to avoid discrimination and make transparent, accountable decisions (ethics).

Challenges and Future Considerations

As technology continues to evolve, new challenges emerge in the areas of protection, privacy, and ethics:

Data Breaches and Cybersecurity Risks are becoming increasingly sophisticated, challenging organizations to keep up with new forms of hacking and data theft.
AI and Automation pose ethical dilemmas around bias, transparency, and accountability, especially in decision-making processes.
Privacy vs. Security: Balancing the need for personal privacy with the increasing demand for surveillance and security in public spaces or online platforms is a complex issue.
Global Standards: Different countries have different standards for data protection and privacy (e.g., GDPR in the EU, CCPA in California), and creating global agreements on how to approach these issues is a challenge.

The Ultimate Guide to Data Analytics.

Stacy Gathu — Fri, 06 Sep 2024 09:03:52 +0000

Data analytics is the process of examining and interpreting large datasets to uncover hidden patterns, correlations, and insights. It involves using statistical and computational techniques to extract valuable information from data and support decision-making. The goal of data analytics is to turn raw data into meaningful insights that can inform strategies, solve problems, and optimize performance.

Data Analytics: Unveiling Insights from the Numbers

In today's data-driven world, data analytics has emerged as a critical discipline, transforming raw data into actionable insights and strategic decisions. Whether in business, healthcare, finance, or any other sector, data analytics plays a pivotal role in understanding trends, improving operations, and driving innovation. This article explores the essence of data analytics, its key techniques, and its impact on various industries.

What is Data Analytics?
Data analytics is the process of examining and interpreting large datasets to uncover hidden patterns, correlations, and insights. It involves using statistical and computational techniques to extract valuable information from data and support decision-making. The goal of data analytics is to turn raw data into meaningful insights that can inform strategies, solve problems, and optimize performance.

The Four Types of Data Analytics
Descriptive Analytics:

Objective: To summarize and describe the main features of a dataset.
Techniques: Use of statistical measures such as mean, median, mode, and standard deviation, along with data visualization tools like charts and graphs.
Applications: Understanding historical performance, tracking metrics, and creating reports. For example, a company might use descriptive analytics to summarize quarterly sales data and identify trends.
Diagnostic Analytics:

Objective: To understand the reasons behind past outcomes or events.
Techniques: Root cause analysis, correlation analysis, and regression analysis to identify relationships and causes.
Applications: Investigating why a sales drop occurred, analyzing customer churn, or determining the factors contributing to operational inefficiencies.
Predictive Analytics:

Objective: To forecast future events based on historical data.
Techniques: Use of statistical models and machine learning algorithms such as time series analysis, regression models, and classification algorithms.
Applications: Predicting customer behavior, forecasting sales, or identifying potential risks. For instance, predictive analytics can help a retailer anticipate inventory needs based on seasonal trends.
Prescriptive Analytics:

Objective: To recommend actions based on data-driven insights.
Techniques: Optimization algorithms, simulation models, and decision analysis.
Applications: Recommending strategies for increasing sales, optimizing supply chain logistics, or suggesting marketing campaigns. For example, prescriptive analytics can advise a company on the best pricing strategy to maximize profit.
Key Techniques and Tools
Data Mining:

Definition: The process of discovering patterns and relationships in large datasets using algorithms and statistical techniques.
Techniques: Clustering, classification, association rule mining, and anomaly detection.
Tools: R, Python, and specialized software like RapidMiner and KNIME.
Statistical Analysis:

Definition: Applying mathematical theories and formulas to analyze data and infer properties of a population based on sample data.
Techniques: Hypothesis testing, ANOVA (Analysis of Variance), and regression analysis.
Tools: SPSS, SAS, and R.
Data Visualization:

Definition: The graphical representation of data to make complex information more understandable and accessible.
Techniques: Creating charts, graphs, heat maps, and dashboards.
Tools: Tableau, Power BI, D3.js, and Excel.

Feature Engineering Ultimate Guide.

Stacy Gathu — Sat, 24 Aug 2024 16:03:01 +0000

Feature engineering can be defined as the process of selecting, extracting and transforming raw data into features that are suitable for machine learning models.

This can be achieved through a number of techniques:

Domain Knowledge: Utilize knowledge from the field to create features that capture important aspects of the data. For example, in financial data, you might create features like moving averages or volatility.
Mathematical Transformations: Apply mathematical operations to existing features, such as taking logarithms or creating polynomial features, to capture non-linear relationships.
Feature Extraction:

Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) can reduce the number of features while retaining essential information. This helps in simplifying models and reducing overfitting.
Text Feature Extraction: For text data, methods like Term Frequency-Inverse Document Frequency (TF-IDF) or embeddings like Word2Vec transform text into numerical features.
Feature Selection:

Filter Methods: Use statistical tests or correlation coefficients to select features that have a strong relationship with the target variable.
Wrapper Methods: Use algorithms to evaluate the performance of feature subsets and select the best combination of features.
Embedded Methods: Use feature selection techniques integrated within the learning algorithm, such as regularization in linear models.
Handling Missing Values:

Imputation: Fill missing values using statistical methods or machine learning models to ensure that the dataset remains complete and useful.
Feature Engineering: Create indicators for missing values or use domain knowledge to handle them appropriately.
Normalization and Scaling:

Standardization: Transform features to have zero mean and unit variance, which helps many algorithms perform better.
Min-Max Scaling: Scale features to a fixed range, typically [0, 1], which is useful for algorithms sensitive to the scale of the input data.

Understanding Your Data: The Essentials of Exploratory Data Analysis

Stacy Gathu — Sun, 11 Aug 2024 17:40:39 +0000

Exploratory data analysis is the process by which we try to gain an understanding of the data we want to analyze. For instance, we might want to know the size of our data, the data types, the presence of any outliers or anomalies, the relationships that may exist between the variables in our data and so on.

We do this using a number of techniques.

Some basic exploration of our dataset could be finding out the statistical properties of our data. this may include the measures of central tendencies such as the mean and median, as well as the measures of dispersion such as the variance and inter-quartile ranges. This may come in handy as we try and identify outliers and anomalies and impute missing values in our data. This will also inform our decision to transform some variables if the scales of the features we select are too different especially if we are using models that are affected by distance such as regression models.

As stated earlier, we are interested in finding out if our data has any missing values or anomalies. This is because we may want to use our data to train models to make predictions and anomalies could highly skew the data leading to incorrect predictions. From this exploration we are able to decide whether to fill in the missing values with a suitable replacement, drop some variables from our data set, or identify the cause of the anomalies and refine our data collection methods.

We could also use graphical representations such as bar graphs and heatmaps to show the relationships between variables such as correlation. This again helps in feature selection for our training data and it also helps easily visualize properties like outliers and skewness.

The Ultimate Guide to Data Analytics: Techniques and Tools

Stacy Gathu — Sun, 04 Aug 2024 16:43:29 +0000

The first time I heard the phrase data scientist will be the sexiest job of the 21st century, I was very intrigued. This is true as data is the new oil and data experts will be at the forefront of the discoveries of the century. One of these expert fields is the data analytics. So what exactly is data analytics?

Data analytics can be described as the science of analyzing raw data to make conclusions about the data which help inform decisions. This guide will try to give an analysis of the tools of the trade and techniques in data analytics.

Tools include programming languages such as Python and R. These come with a lot of libraries that help wrangle and manipulate data, do visualizations and develop models.

Business intelligence tools like Microsoft Power BI and Tableau.

Data Techniques include data collection, data cleaning, data wrangling, exploratory data analysis, feature selection, data visualization.