DEV Community: sumaya

Unsupervised learning: Clustering

sumaya — Mon, 01 Sep 2025 21:39:23 +0000

Machine learning is divided into supervised learning and unsupervised learning. Unsupervised learning is where the dataset is explored and hidden patterns are discovered within datasets that do not contain predefined labels or outcomes. Instead of predicting known results, unsupervised learning attempts to explore the data structure and group similar data points together. One of the most widely used techniques in unsupervised learning is clustering, which organizes data into meaningful groups based on similarities. Clustering is crucial in areas such as marketing, healthcare, image analysis, and fraud detection, where large volumes of data need to be interpreted without prior labels.

Clustering Models;
K-Means Clustering
K-Means is whereby data is partitioned into a fixed number of clusters (k). Each data point is assigned to the nearest cluster center (centroid), and the centroids are updated iteratively until stability is reached. K-Means is efficient and simple but sensitive to the initial choice of centroids and requires the user to predefine k.

Hierarchical Clustering
This method builds a tree-like structure (dendrogram) that shows how clusters are combined or divided. It can be agglomerative (starting with individual data points and merging them) or divisive (starting with one cluster and splitting it). Unlike K-Means, hierarchical clustering does not require specifying the number of clusters in advance but can become computationally expensive for large datasets.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN groups together data points that are close to each other based on density and marks points in sparse regions as outliers. Unlike K-Means, it does not require specifying the number of clusters. It works well with irregularly shaped clusters and datasets containing noise.

Gaussian Mixture Models (GMMs)
GMM assumes that data is generated from a mixture of several Gaussian distributions. It uses probability to assign points to clusters (soft clustering), which allows for uncertainty in cluster assignments. GMM is useful in complex data distributions but can be computationally intensive.

Applications of Clustering

Clustering is widely applied across industries:

Customer Segmentation: Companies use clustering to group customers based on purchasing behavior, allowing for targeted marketing and personalized recommendations.

Fraud Detection: Unusual behavior in financial transactions can be identified as anomalies through clustering.

Healthcare: Patient data can be clustered to identify disease patterns, predict risks, and personalize treatment plans.

Insights and Challenges

Clustering provides deep insights by revealing hidden structures in data. It enables organizations to make informed decisions, identify unusual patterns, and explore relationships that are not immediately obvious. However, clustering also presents challenges:

Choosing the right number of clusters: Algorithms like K-Means require predefined cluster numbers, which may not always be obvious.

Scalability: Some clustering methods struggle with very large or high-dimensional datasets.

Sensitivity: Many algorithms are sensitive to feature scaling, noise, and initialization.

Interpretability: Clusters may not always have clear real-world meaning, making insights harder to explain.

SUPERVISED LEARNING IN CLASSIFICATION.

sumaya — Mon, 25 Aug 2025 05:45:40 +0000

Machine learning requires teaching computers to learn from experience and improve over time without being directly programmed for every task. It has different approaches depending on the type of problem. For example, in supervised learning, the model learns from data that already has answers, while in unsupervised learning, the model tries to find hidden patterns or groupings in data without given labels.

Supervised learning is one of the most common types of machine learning. In this approach, a model is trained using data that already has labels. Labels are the correct answers we want the model to learn. For example, if we want a system to recognize whether an email is spam or not, we train it with many emails that are already marked as “spam” or “not spam.” By studying these examples, the model learns the patterns that connect the input data to the output labels.

In supervised learning, the goal is to make predictions when given new and unseen data. There are two main types: classification and regression. In classification, the model predicts categories, such as “Pass” or “Fail,” or “Disease” and “No Disease.” In regression, the model predicts numbers, such as predicting the price of a house based on its features.

Several algorithms are commonly used in supervised learning. Some of them include logistic regression, decision trees, random forests, support vector machines, and neural networks. Each has its strengths and is chosen depending on the type of problem and the data available.

To check how well a supervised learning model performs, we use evaluation methods such as accuracy, precision, recall, F1-score, and confusion matrices. These metrics show whether the model is making reliable predictions or if it needs improvement.

From my perspective i have learnt that supervised learning is important because it helps computers learn from examples and make predictions with high accuracy. By using labeled data, it can solve many real-world problems such as suggesting products, diagnosing diseases, or even predicting student performance. Its power lies in the ability to take past knowledge and apply it to new situations, making it one of the most practical and widely used areas of machine learning.

Essay on Type I and Type II Errors in a Medical Scenario

sumaya — Sun, 17 Aug 2025 10:57:47 +0000

In statistics and decision-making, the concepts of Type I and Type II errors play a crucial role in shaping how tests and experiments are designed. These errors, often referred to as false positives and false negatives, are not just abstract ideas but carry serious consequences when applied in real-world contexts. One of the most important fields where these errors must be carefully balanced is medicine. In medical testing and diagnosis, the trade-off between Type I and Type II errors often determines whether patients receive appropriate care, unnecessary treatment, or, in the worst cases, no treatment at all.

A Type I error occurs when we conclude that a condition is present when it is not. In medicine, this means diagnosing a healthy patient as ill. For example, a cancer screening test may wrongly indicate that a person has cancer even though they are cancer-free. The immediate consequence of such an error is psychological stress, unnecessary follow-up tests, and possibly harmful treatments. Although the individual may eventually be cleared of the disease through further examinations, the burden of false alarms can be significant.

On the other hand, a Type II error occurs when we fail to detect a condition that actually exists. In medicine, this is equivalent to telling a sick patient that they are healthy. Returning to the cancer screening example, this means the disease goes undetected, allowing it to progress unchecked. The consequences here are often far more severe: the patient misses early treatment opportunities, their condition may worsen, and in some cases, the chance of survival may drastically decrease.

The trade-off between these two errors is inevitable, because lowering the probability of one often increases the probability of the other. In medical practice, the decision about which error to prioritize depends largely on the nature of the disease and the risks of treatment. For life-threatening illnesses where early detection is critical, such as cancer or HIV, reducing Type II errors becomes more important. Missing a diagnosis in these cases can be fatal, so healthcare systems are usually more willing to accept false positives in exchange for catching as many true cases as possible. In contrast, when treatments are particularly invasive, expensive, or harmful, reducing Type I errors takes precedence. Unnecessarily subjecting a healthy patient to aggressive chemotherapy, for example, may cause more harm than good.

This balance is often expressed in the concepts of sensitivity and specificity in medical testing. Highly sensitive tests are designed to minimize false negatives, ensuring that nearly all patients with the disease are identified. These are commonly used in initial screenings. Highly specific tests, on the other hand, reduce false positives and are often employed as confirmatory follow-ups to prevent unnecessary treatments. Together, they create a layered testing process that balances both types of errors in a practical way.

In conclusion, the tension between Type I and Type II errors cannot be eliminated but must be managed wisely, especially in medicine. While both errors carry consequences, missing a disease is generally seen as the more dangerous outcome, which is why screening programs often prioritize sensitivity over specificity. Ultimately, the trade-off reflects a fundamental reality: decisions in medicine, as in statistics, are about balancing risks. Recognizing and managing these risks carefully ensures that patients receive both accurate diagnoses and appropriate care.

Relationships in Power bi.

sumaya — Sun, 17 Aug 2025 09:59:18 +0000

In today’s world of data analysis and business intelligence, Power BI has become one of the most powerful tools for transforming raw information into meaningful insights. A key concept that enables Power BI to work effectively is the idea of relationships. Relationships allow different tables in a data model to connect with one another, ensuring that information from multiple sources can be combined and analyzed as one coherent whole. Without relationships, a Power BI report would only display isolated fragments of data, making it difficult to uncover patterns and trends that span across tables.

A relationship in Power BI is essentially a link between two tables, usually based on a shared column such as an ID or a key. For example, a sales table might contain a column called Customer ID, while a customer table also contains the same Customer ID. By connecting these two columns, the sales transactions can be matched with customer details, allowing a business analyst to answer questions such as, “Which customers contributed the most revenue?” or “How do sales differ across customer regions?” Relationships therefore serve as bridges that unify data from different perspectives.

There are different types of relationships, and understanding them is essential for building accurate reports. The most common type is the one-to-many relationship, where one record in one table relates to multiple records in another. A practical example is one customer linked to many purchases. Less common but still important are one-to-one relationships, where each row in one table matches exactly one row in another, and many-to-many relationships, which occur when multiple rows in one table relate to multiple rows in another. The latter often requires additional modeling techniques, such as using a bridge table, to avoid confusion and errors.

Another important aspect of relationships in Power BI is cross-filter direction. This refers to how filters applied in one table affect another table. By default, filters flow from a dimension table (like Customers or Products) to a fact table (like Sales). However, Power BI also allows bidirectional filters, where changes in one table can filter the other in return. While this can sometimes make reporting more flexible, it must be used carefully because it may cause ambiguity or performance issues in complex models.

The value of relationships is most visible when working with multiple tables in a report. Imagine an organization that stores data in different systems: one table for sales, another for products, and another for customers. On their own, these tables provide limited insights. But when relationships are established, the analyst can create visuals that combine all the information, such as sales by product category, sales by region, or even customer profitability. Relationships make the data model function as a single, connected structure, turning scattered information into a reliable “single source of truth.”

To build effective relationships, certain best practices should be followed. It is important to ensure that the key columns used for relationships are unique in at least one of the tables, so that the connections remain valid. Data types between the linked columns should match to avoid errors. Where possible, a star schema design—where a central fact table connects to multiple dimension tables—should be used. This design reduces complexity and improves both performance and clarity in the data model.

How Excel is used in real -World Data Analysis.

sumaya — Fri, 13 Jun 2025 21:08:14 +0000

Excel is spreadsheet tool used for data entry, basic analysis and visualization. Excel is commonly used for quick data summaries , creating charts , and performing basic statistical operations .
Excel also helps businesses analyze historical data to identify trends and forecast future outcomes ,it also used to create budgets, profit and loss statements, and perform variance analysis and it also helps marketers track key performance indicators(KPI).
Since i started learning excel i have been able to learn about its features such as Pivot Table which is used for summarizing and analyzing large data sets for example a sales manager can use Pivot Tables to see total sales by region , or by product . Another feature is Conditional Formatting which is used in highlighting important data visually for example a teacher can use conditional formatting to highlight score difference in different colors. The other feature that i have learnt is VLOOKUP/XLOOKUP which is used to find and retrieve data from a table for example a HR manager can use XLOOKUP to find an employee's department or salary based on their ID number.
As someone who is a beginner in learning excel i could say that my view on excel has profoundly changed . I always saw excel just as mere numbers in boxes but now i see it more as a dynamic tool for analysis and decision-making. Before learning excel i used to think that handling large dataset as overwhelming but now i have learnt to organize and structure data efficiently using tables , sorting and filtering. I have also learnt about excel tools such as Pivot Table and charts, and i can now identify patterns and trends within data. Excel has also taught me that data isn't about just numbers; its tells a story. Through learning excel analysis tools are easier for me to use when i am performing complex analyses such as regression and ANOVA without needing advanced statistical software .