DEV Community: Mercy Jeruto

Unsupervised Learning: Discovering Hidden Patterns in Data

Mercy Jeruto — Sat, 26 Jul 2025 17:32:34 +0000

In the world of machine learning, not all problems come with clear instructions. Sometimes, we don't know the answer in advance we just have raw data. This is where Unsupervised Learning comes in.
While Supervised Learning teaches a model using labeled data (like predicting prices or classifying emails as spam), Unsupervised Learning lets algorithms explore and uncover structure in unlabeled datasets much like solving a puzzle without knowing the final picture.

What is Unsupervised Learning?
Unsupervised Learning is a type of machine learning where the model is trained without labeled output. The goal is to find patterns, groupings, or meaningful insights from the data based purely on its internal structure.

Think of unsupervised learning like walking into a library with no catalog system. Books are scattered around, and you have no idea what genres or categories exist. You start grouping them based on clues like titles, covers, or page numbers. Eventually, you might sort them into categories like fiction, history, or science even though no one told you those categories existed.

That’s exactly what an unsupervised learning algorithm does.

Main Techniques in Unsupervised Learning
1. Clustering
Clustering is about grouping similar data points together.

A) KMeans Clustering
Idea: Divide data into K groups based on similarity.

How it works: The algorithm randomly picks K "centroids", assigns each data point to the closest one, and adjusts centroids until things stabilize.

Real-world use:

Customer Segmentation: In retail or hospitality, we can use clustering to group customers based on age, income, and spending habits.

Example:Here's a step-by-step visual demonstration of applying K-Means clustering to mall customer data:

Understanding the Data
Typical mall customer data includes: Customer ID, Age, Annual Income(k$), Spending Score (1-100).

A simple ASCII-style diagram to help you visualize how KMeans clustering works:

These * symbols represent data points like customers, documents, or patterns, but we don’t yet know how they’re grouped.

C1, C2, C3 are randomly placed centroids, each representing a starting guess for a cluster center.

Each point now belongs to the nearest centroid forming three clusters.

B) Hierarchical Clustering
Idea: Build a tree of clusters (like a family tree) by progressively merging or splitting groups.

Visual Aid: Dendrograms show the merging process.

Use case: Better for visualizing how clusters are formed, especially when we don’t know the number of clusters in advance.
Below is a python code using the mall customer data;

from scipy.cluster.hierarchy import dendrogram, linkage
linkage_matrix = linkage(df[['Age', 'Spending Score (1-100)']], method='ward')
dendrogram(linkage_matrix)

2. Dimensionality Reduction
When working with machine learning models, datasets with too many features can cause issues like slow computation and overfitting. Dimensionality reduction helps to reduce the number of features while retaining key information. Techniques like principal component analysis (PCA), singular value decomposition (SVD) and linear discriminant analysis (LDA) convert data into a lower-dimensional space while preserving important details.

A) Principal Component Analysis (PCA)
Idea: Transform the data into fewer dimensions while preserving variance.

How it helps:

Improves speed and efficiency.
Useful for visualizing high-dimensional data in 2D or 3D.
Helps in noise reduction and pattern recognition.

Why It Matters
You don’t always have labels: In real-world business problems, labels (like churned customers or fraud cases) are expensive or missing.

Find hidden insights: Unsupervised models can uncover segments, anomalies, or structures you didn’t even know existed.

Better targeting: Marketers use clustering to tailor campaigns. Hotels can design loyalty offers based on cluster behavior. Security teams can identify unusual access patterns.

Analogy: Sorting Socks Blindfolded
Imagine reaching into a laundry basket blindfolded. You don’t know the colors, but you try sorting socks by feel — thick vs. thin, long vs. short. Eventually, you form groups. That’s unsupervised learning — no labels, just internal patterns.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

kmeans = KMeans(n_clusters=5)
df['Cluster'] = kmeans.fit_predict(df[['Annual Income (k$)', 'Spending Score (1-100)']])

# Plot clusters
plt.figure(figsize=(8, 5))
plt.scatter(df['Annual Income (k$)'], df['Spending Score (1-100)'], c=df['Cluster'], cmap='rainbow')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score')
plt.title('Customer Segmentation using KMeans')
plt.show()

This gives a powerful visualization — showing customer groups based on how they spend and what they earn.

Conclusion
Unsupervised learning may seem like data alchemy at first — turning unlabeled, messy data into gold. But with tools like KMeans, Hierarchical Clustering, and PCA, we can find meaning in the chaos. Whether you're segmenting customers, detecting fraud, or simplifying data — these techniques are vital for modern data-driven decision making.

Python for SQL: Beginner Level-Introduction to Database Connectivity

Mercy Jeruto — Sat, 24 May 2025 07:26:27 +0000

Outline

Python is a widely used and flexible programming language applicable in numerous areas including web development, scientific computing, and data analysis. A key advantage of Python is its capability to connect with and interact with databases. In this article, we will provide an introduction to using Python for SQL and show how it can be utilized to connect to and manage databases.
Although SQL is a robust language for interacting with databases, it does have its drawbacks. For instance, crafting intricate queries or conducting data analysis can be challenging when relying solely on SQL. In contrast, Python is a versatile programming language that excels in executing complex data analysis, machine learning, and web development activities. By integrating Python with SQL, you can leverage the strengths of both languages to carry out more sophisticated data analysis and database management tasks.

Python for SQL: Database Connectivity
To link Python with a database, you must utilize a Python library or module that offers a database driver. A database driver is a software element that establishes a connection between the Python application and the database management system.
Numerous well-known Python libraries exist for connecting to databases, such as:

PyMySQL: A pure Python MySQL driver that allows you to connect to a MySQL database and perform SQL queries.
psycopg2: A PostgreSQL database adapter that provides access to the PostgreSQL database server.
sqlite3: A built-in Python library for working with SQLite databases.
SQLAlchemy: A SQL toolkit and Object-Relational Mapping (ORM) library that provides a high-level interface to SQL databases.

In this article, we will utilize the PyMySQL library to establish a connection with and manage a MySQL database. PyMySQL is a Python library designed for connecting to MySQL databases. MySQL, an open-source database management system based on the relational model, is commonly used for web applications. PyMySQL offers an easy-to-use interface for connecting to MySQL databases and executing SQL queries.
Connecting to a MySQL Database
To connect to a MySQL database using Python, you need to install the PyMySQL library. Run the following command to install it using the pip package manager:

pip install pymysql

Once you have installed the PyMySQL library, you can create a connection to a MySQL database using the connect() function. The connect() function takes several parameters, including the hostname, username, password, and database name. To connect to a MySQL database using pymysql, we need to create a connection object. To connect to the database, one must utilise the connection object. Here’s an example of how to create a connection object:

import pymysql 
connection = pymysql.connect(host='localhost', user='root', password='passCode', db='your_database')

In this example, we are connecting to a MySQL database running on the local machine. We are using the root user and the password “passCode” to authenticate ourselves. Finally, we are connecting to a database called “your_database”.

Creating a Cursor object
A cursor object is used to execute SQL queries against a database. A cursor object acts as a pointer to a specific location in the database, allowing you to retrieve, insert, update, or delete data. In PyMySQL library, creating a cursor object is an essential step in executing SQL queries.

The database connection object is used to generate the cursor object. To create a cursor object, you need to call the cursor() method on the database connection object. Here is an example:

import pymysql 
# Open database connection 
db = pymysql.connect("localhost","user","password","database_name" ) 

# Create a cursor object 
cursor = db.cursor() 

# Execute SQL query 
cursor.execute("SELECT * FROM table_name") 

# Fetch all rows 
rows = cursor.fetchall()

for row in results: 
print(row)
# Close database connection 
db.close()

In this example, we first open a database connection using the PyMySQL library. The connection requires the host, user, password, and database name to connect to a MySQL database. Once the connection is established, we create a cursor object using the cursor() method.

After the cursor object is created, we can execute an SQL query using the execute() method. In this example, we execute a SELECT statement that retrieves all rows from a specific table in the database.

The fetchall() method is then called on the cursor object to retrieve all rows from the SELECT statement. The rows are stored in a variable named rows. We are iterating over the results and printing each row. Then, we close the database connection using the close() method.
_
It’s important to note that the cursor object does not retrieve any data until a query is executed. The execute() method is used to execute the SQL query, and the fetchall() method retrieves the data from the query._

To add data to the database
In order to add data to a database with pymysql, it is necessary to first create a connection to the database and then execute SQL commands. Below is an example code snippet demonstrating how to insert data into a MySQL database using pymysql:

import pymysql

# Connect to the database
connection = pymysql.connect(host='localhost',
                             user='username',
                             password='password',
                             db='database_name')

# Cursor object creation
cursor = connection.cursor()

# Define the SQL query
sql_query = "INSERT INTO books (author, name, mail, pages) VALUES (%s, %s, %s, %d)"

# Data insertion
data = ("Enid", "Secret Seven", "enid@example.com", 21)


# Execute the query with the data
cursor.execute(sql_query, data)

# Commit the changes
connection.commit()

# Close the cursor and the connection
cursor.close()
connection.close()

The SQL query we establish is an INSERT statement designed to add data to the books table. The data intended for insertion is indicated by the query’s %s and %d placeholders.

Subsequently, we define the data to be inserted as a tuple comprising three elements: the name, email, and phone number. We execute the query with the data by utilizing the execute() method of the cursor object. Following this, we commit the changes through the commit method of the connection object.

It is important to note that if you wish to insert multiple rows of data simultaneously, you may opt for the executemany() method instead of execute(). The executemany() method accepts a list of tuples as its second argument, with each tuple representing a row of data to be inserted.
Dealing with Errors
Handling errors is an important part of writing reliable code in PyMySQL. Here’s an example of how to handle errors

import pymysql

# Connect to the database
try:
    connection = pymysql.connect(host='localhost',
                                 user='username',
                                 password='password',
                                 db='database_name')
except pymysql.Error as e:
    print("Error connecting to database:", e)
    exit()

# Create a cursor object
try:
    cursor = connection.cursor()
except pymysql.Error as e:
    print("Error creating cursor:", e)
    exit()

# Define the SQL query
sql_query = "INSERT INTO books (author, name, mail, pages, weight) VALUES (%s, %s, %s, %d, %d)"

# Define the data to be inserted
data = ("Enid", "Secret Seven", "enid@example.com", 21, 2)

# Execute the query with the data
try:
    cursor.execute(sql_query, data)
except pymysql.Error as e:
    print("Error executing query:", e)
    exit()

# Commit the changes
try:
    connection.commit()
except pymysql.Error as e:
    print("Error committing changes:", e)
    exit()

# Close the cursor and the connection
try:
    cursor.close()
    connection.close()
except pymysql.Error as e:
    print("Error closing connection:", e)
    exit()

In the example above, we use try-except blocks to handle errors and exceptions at different stages of the code. When we connect to the database, we use a try-except block to catch any errors that may occur. If an error occurs, we print an error message and exit the program.

We do the same thing when we create a cursor object, execute the query, commit the changes, and close the cursor and connection.

If an error occurs at any of these stages, we print an error message and exit the program.
**
Conclusion**

Python can be used to connect to SQL databases and execute queries using libraries like PyMySQL, sqlite3, and sqlalchemy.
Using Python with SQL can allow for more powerful and flexible data analysis and manipulation, as well as easier automation of database tasks.
Python code can be used to insert, update, and delete data in a database, as well as query and retrieve data.
When using Python with SQL, it’s important to handle errors and exceptions gracefully to ensure reliability and prevent unexpected behavior.
Python can also be used with other database technologies, such as NoSQL databases and object-relational mappers (ORMs), depending on the needs of your project.
Learning how to use Python with SQL can be a valuable skill for data analysts, data scientists, and software developers who work with databases.
There are various resources from where you can master Python in SQL for data analysis. Some of them are — Edx, Scaler, freecodecamp, Kdnuggets, etc.

Falling behind? Data Analytics and Physical Security in Kenya

Mercy Jeruto — Fri, 23 May 2025 17:07:55 +0000

The incorporation of data analytics in the physical security sector has the capability to bring about a new level of efficiency and effectiveness in maintaining safety, safeguarding assets, and achieving cost reductions.
From interconnected smart systems to access control, surveillance, and incident management, significant insights can be gotten which will assist security teams in making decisions based on data.
The implementation includes the rollout of an incident management system (i.e. applications such as dispatch, incident reporting, investigations, task management and guard tours), data collection and analysis and data visualization.

A study was conducted by one Farhad Tajali at the University of Southern California while pursuing a doctoral degree in Organizational Change and Leadership. The study focused on security professionals with a lens on skill gaps and assets. The research explores knowledge, motivation and organizational influences (KMO) to identify assets and needs impacting the utilization of data analytics in the field of physical security. The purpose of this study was to evaluate the utilization of data analytics by security professionals in the field of physical security. Upon conducting the literature review, scholarly sources supported the hypothesis that security professionals lack effective utilization of data for process improvement, risk mitigation and to inform their decision-making.

Results and findings

The study found three key findings impacting security professionals’;

Slow adoption of data analytics through triangulation of both quantitative and qualitative research data
Lack of motivation to engage in data collection and analysis
Provisioning of resources

When asked to identify barriers in utilization of data, security professionals reported knowledge and resources as the two main barriers in their profession. Additionally, 94% of security professionals who currently do not perform any form of data analysis reported they would utilize data analytics if they knew more about how to use them effectively.

With this;

Is the Kenyan security industry lagging in integrating data analytics into their daily operations and translating quantitative data into practical insights?
Are there any security professionals’ using data analytics to inform data-driven decision-making?
Are security professionals’ motivated by other sectors in the context of using data analytics to inform data-driven decision-making?
Are Security companies in Kenya facilitating or hindering their efforts in the implementation and use of data analytics?

From my own research, some few known Security brands in Kenya are already revolutionizing and integrating physical security with data analytics. (This information was sourced from their websites)

Opticom Kenya Limited
Castor Vali Group
Securex Agencies Limited
WS Insight Limited
GardaWorld Security

Conclusion

Security professionals do not need to be data scientists, but they should strive to be data champions within their organizations. Security professionals should source for adequate training and resources on everything data analytics. Using data, security professionals can tell a better story of the state of security of their clients and within their organizations.
Such intelligence can lead to proactive risk mitigation, effective program management and places the organization in a place of advantage when facing new threats.

The Role of Data Science & Analytics in different industry sectors

Mercy Jeruto — Thu, 24 Apr 2025 10:15:14 +0000

The importance of data science has immensely improved and affected every field, from health to finance, causing a shift towards data-driven insight and efficiency in organizational decisions.
Every day, organizations' collect a huge amount of data, and data science makes a difference by converting the raw data into valuable insights.
Data science and analytics is transforming industries by enabling data-driven decision-making, optimizing operations, and unlocking new opportunities. Here’s how they are reshaping various sectors today:

1. Healthcare

Predictive Analytics: Data analytics can track disease patterns and predict potential outbreaks. By analyzing factors such as population density, climate, and travel data, healthcare organizations can take proactive measures to control the spread of diseases

Hospital resource optimization: Analyzing patient admission rates, bed occupancy, and staff availability, hospitals can predict peak demand periods and allocate resources, accordingly, ensuring efficient patient care.

Early Disease Detection: Sophisticated algorithms analyze patient data, such as medical histories, test results, and genetic information, to identify early signs of diseases like cancer, diabetes, and cardiovascular issues. These algorithms can predict disease risks, allowing healthcare providers to initiate preventative measures and interventions in a timely manner

Repurposing Existing Drugs: Through data analysis, researchers can identify alternative uses for existing drugs, bringing new treatments to market faster and at lower costs compared to developing entirely new drugs

2. Finance & Banking

Marketing and Customer Analytics:
By analysing customer data, banks can develop targeted marketing campaigns, segment customer bases, and create personalised experiences.

Operational Efficiency:
Data science optimises internal processes, automates manual tasks, and streamlines workflows, reducing costs and improving productivity.

Regulatory Compliance:
Banks must comply with numerous regulations and standards.
Data science helps ensure compliance by analysing vast amounts of data, identifying potential anomalies, and facilitating audits and reporting.

3. Manufacturing & Supply Chain

Forecasting: With the ability to integrate more data with higher granularity, companies can utilize predictive and prescriptive analysis to improve the accuracy of demand forecasting.

Pricing: Dynamic pricing will allow companies to further maximize exploration of the customer demand curve, react to market behavior, and gain market growth.

4. Transportation & Logistics

Route Optimization: Reduces fuel costs and delivery times (e.g., UPS, Uber).
Demand Forecasting: Predicts peak travel times for ride-sharing services.

5. Agriculture (AgTech)

Weather Patterns: Monitoring temperature, humidity, rainfall, and other climatic factors to predict how crops perform under different conditions.
Soil Health: Collect soil composition, moisture levels, and nutrient content data to determine the best crops to plant and the most effective fertilizers.
Crop Yields: Analyzing historical and real-time data on crop performance to forecast yields and optimize harvesting schedules.

Market Trends: Gathering data on commodity prices, demand fluctuations, and consumer preferences to make informed decisions about planting and selling.

9. Telecommunications
Network Optimization: Telecom analytics tools offer an overview of the network’s performance, with important telecom metrics like network latency, packet loss, MOS score, etc. This enables telecom operators to detect bottlenecks, maximize network capacity, guarantee sufficient bandwidth, and prevent congestion.

Customer Churn Prediction: Data analytics here can track and supervise any drop in service performance continually, model network behavior, and predict future needs accordingly.

Price optimization: A telecom company can use data analytics to obtain precise and actionable insights into consumer behavior and develop winning pricing plans. This requires you to study customer responses to various pricing strategies, past purchases, and other competitor pricing.

Key Drivers of Transformation
✔ Big Data & Cloud Computing – Enables processing vast datasets at scale.
✔ AI & Machine Learning – Automates insights and predictions.
✔ IoT & Real-Time Analytics – Provides instant decision-making capabilities.
✔ Automation & RPA – Streamlines repetitive tasks.

Challenges
Data privacy & security concerns.
Need for skilled data scientists.
Bias in AI models.

Conclusion
Data science and analytics are revolutionizing industries by making operations smarter, more efficient, and customer-centric. Companies that leverage these technologies gain a competitive edge, while laggards risk falling behind.

SQL Aggregate Functions for beginners

Mercy Jeruto — Tue, 15 Apr 2025 10:04:58 +0000

SQL AGGREGATE FUNCTIONS

Data aggregation is the process of taking several rows of data and condensing them into a single result or summary. When dealing with large datasets, this is invaluable because it allows you to extract relevant insights without having to scrutinize each individual data point.

So, what exactly are SQL aggregate functions? They are specialized functions that perform calculations on groups of variables and return a single result. Unlike traditional functions, aggregate functions work on groups of rows of data. This allows you to efficiently compute statistics or generate summary information from a dataset.

In this article, we will look at the importance of SQL aggregate functions and how to use them. We’ll explain them using real-world examples.

Let's explore each of them below;

COUNT()

The COUNT() functions objective is to count the number of rows in a table or the number of non-null values in a column.

Suppose you want to find out how many products are sold in your store; you can use the following query:

SELECT COUNT(*) as total_products
FROM products;

2.SUM()

The SUM() function returns the total of a numerical column. It is typically used when you need to find the total of values such as sales income, quantities, or expenses. Imagine that you want to know your company's entire revenue from sales; you can do so by running the following query:

SELECT SUM(sales_amount) as total_revenue
FROM sales_data;

3.AVG()
When you need to calculate the average (mean) value of a numeric column, the AVG() function is your go-to. It is useful when looking for the average price, rating, units sold, and so on. Here is an example in a query:

SELECT AVG(Price)
FROM Products;

4.MIN() and MAX()

The MIN() function returns the smallest value within a column. It's especially useful for locating the lowest or smallest value in a dataset. On the other hand, the MAX() function returns the largest value within a column. It is useful for determining the highest value in a dataset. An example respectively;

SELECT MIN(price) 
FROM pieces;

SELECT MAX(price)
FROM pieces;

Other functions include;

Let's break down how ORDER BY, GROUP BY, HAVING, and LIMIT work in SQL with a concise example.
Example;
Suppose you have a table named sales with the following columns:
• id (integer)
• product (varchar)
• quantity (integer)
• price (decimal)
• sale_date (date)
SQL Query Explanation

GROUP BY: This clause groups rows that have the same values in specified columns into summary rows.
HAVING: This clause filters groups based on a condition, similar to WHERE but for groups.
ORDER BY: This clause sorts the result set by specified column(s).
LIMIT: This clause restricts the number of rows returned. Example Query

SELECT 
    product, 
    SUM(quantity) AS total_quantity, 
    SUM(price * quantity) AS total_sales
FROM 
    sales
GROUP BY 
    product
HAVING 
    SUM(quantity) > 100
ORDER BY 
    total_sales DESC
LIMIT 
    10;

Explanation of the Query

SELECT: Choose the columns to display. Here, we select product, the sum of quantity as total_quantity, and the sum of price * quantity as total_sales.
FROM: Specify the table to query, which is sales.
GROUP BY: Group the results by product.
HAVING: Filter groups where the total quantity is greater than 100.
ORDER BY: Sort the results by total_sales in descending order.
LIMIT: Return only the top 10 rows. This query will give you the top 10 products with the highest total sales, but only for those products where the total quantity sold is greater than 100.

Feel free to adapt these examples to fit your specific needs!

Remember practice makes perfect
Good Luck!