DEV Community: Anisha Bhandare

Machine Learning for Newbies

Anisha Bhandare — Mon, 06 Jan 2025 11:42:51 +0000

A regular day for Gen-Z begins with cascade of social-media. LinkedIn notifications like “Your latest post got over ‘n’ impressions” or Instagram equivalent of “Z and other 5 people are following Y” are a staple. How do you think these statistics come forward? Obviously the developers at Microsoft or Meta cannot sit and perform statistical analysis for every user ; hence Machine Learning steps up . Be it personalized recommendations of Netflix shows , curated playlists of Spotify ,relevant results and auto-completing queries of search engines likes Google and Bing , dynamic prizing of e-commerce platforms like Amazon and price estimation and ETA predictions of Uber : all of them are the magic of Machine Learning.

What is Machine Learning?

Teaching computers to imitate the ways humans learn and empowering them to make decisions without human interventions , hence improving accuracy and precision is Machine Learning.

While AI is a broad field consisting of a set of technologies that can be used to build machines having the ability to mimic functions or operations primarily associated with humans , ML is a subset of AI that allows machines to extract knowledge from data and learn from it without human intervention.

Components of Machine Learning

1. Data

Data is the protagonist in the play of Machine Learning . ML models learn from the data. This performance and accuracy of the model depend largely on the quality , quantity and reliability of data fed to it . This data helps in making informed and evidence-backed decisions. Tasks like classification and prediction are performed based on the data . Ideally diverse ,complete , accurate and relevant data is preferred . Data might be obtained through various sources like public datasets(Kaggle), sensors and IoT devices(camera , microphones and temperature sensors) providing continuous data , user-generated data (reviews, social media posts), API and web-scraping .

2. Features

Feature engineering is like preparing ingredients before cooking a meal. For a machine learning model, the features are the ingredients it uses to make predictions or decisions. This process involves selecting, cleaning, and enhancing data to optimize the model's performance. Raw data is transformed into meaningful, structured features that the model can effectively process and utilize.
Raw data often comes in a messy format that machine learning models can't directly use. For example:

In a weather prediction model, you might have temperature readings taken every minute—but what you really need are daily averages.
For text analysis, you might start with full paragraphs but need to break them down into individual words or simple word counts.

Through proper feature engineering, you transform this raw data into something the model can effectively understand and use.

3. Model

Learning Algorithm A learning algorithm enables computers to learn from data to make decisions and predictions without explicit programming for every scenario. It's similar to teaching a child to identify objects through examples rather than memorizing specific rules for each item.
Hypothesis Space This represents all possible models or solutions that can be derived from the available data—essentially all the potential ways to structure the data to solve the problem.

How It Works:

Problem: First, identify your specific problem (such as predicting house prices).
Data: Gather relevant data (such as number of rooms, house size, and location).
Hypothesis Space: Create various potential models or rules to predict the price using different data combinations. This space contains all possible solutions that could solve your problem.

4. Objective Function

Loss Function: Measures how far off the model's predictions are from actual values, helping guide the learning process (e.g., Mean Squared Error for regression).
Optimization Algorithm: Fine-tunes the model's parameters to reduce errors by minimizing the loss function (e.g., Gradient Descent).

5. Evaluation Metrics

Used to assess model performance, such as accuracy, precision, recall, F1-score, or mean absolute error.

Types of Machine Learning

For understanding the types of machine learning , let’s first brush up on types of datasets :

Labelled Data :

Labelled data is the type of data that is assigned a label or category indicating its correct classification. Usually , this labelling is performed by human annotators. Using the labels , the model learns from labelled examples to make predictions on new, unseen data.

Example:

A dataset of images with labels indicating whether each image contains a cat or a dog.
An email dataset labelled as spam or not spam.
A dataset of customer reviews labelled with sentiment (positive, negative, neutral).

Labeled data helps train models for classification, regression, and object detection tasks by predicting specific values for each data point. Though valuable, obtaining labeled data is expensive and time-consuming since it requires human annotators to manually assign labels.

Unlabeled Data :

As the name suggests , unlabeled data does not have any category associated with it . Thus ,the true classification of a data point remains unknown .The model must learn from the inherent structure of the data to uncover patterns or anomalies .

Example:

A dataset of customer transactions without any labels indicating fraudulent or non-fraudulent transactions.
A collection of text documents without any labels indicating the topic or category of each document.
An image dataset without any labels indicating the content or objects in each image.

Supervised Learning
The main differentiating feature of Supervised Learning is that it makes use of labelled data. This data acts as a supervisor for the algorithm for classification or prediction tasks . Using labelled data , the prediction and accuracy of the model can be easily measured .

Supervised learning consists of two main types of problems in data mining: classification and regression:

Classification involves algorithms that sort data into specific categories—like distinguishing apples from oranges. In practical applications, these algorithms can filter spam emails from legitimate ones. Common classification algorithms include linear classifiers, support vector machines, decision trees, and random forest
Regression algorithms analyze relationships between dependent and independent variables. They excel at predicting numerical values—for instance, forecasting a company's sales revenue. Common approaches include linear regression, logistic regression, and polynomial regression.

Obtaining labelled data for Supervised Learning can be expensive and time-consuming, as it requires human annotators to assign labels to each data point.

Unsupervised Learning

Unsupervised Learning uses algorithms to analyze and cluster unlabeled datasets, discovering hidden patterns without human intervention (hence the term "unsupervised").
Unsupervised learning models perform three main tasks: clustering, association, and dimensionality reduction:

Clustering groups unlabeled data points based on their similarities or differences. A common example is K-means clustering, which organizes similar data points into groups, with K determining the number of clusters. This technique proves valuable for applications like market segmentation and image compression.
Association discovers relationships between variables in a dataset using pattern-finding rules. This approach powers features like "Customers Who Bought This Item Also Bought" recommendations and market basket analysis, helping identify products frequently purchased together.
Dimensionality reduction simplifies complex datasets by reducing the number of features while maintaining essential information. This technique is particularly useful during data preprocessing—for instance, when autoencoders clean up visual data by removing noise to enhance image quality.

Difference between the two

The primary difference between supervised and unsupervised learning lies in the use of labeled datasets. Simply put, supervised learning relies on labeled input and output data, while unsupervised learning works without labeled data.

In supervised learning, the algorithm learns by iteratively analyzing the training dataset, making predictions, and adjusting its outputs to match the correct answers. While these models are typically more accurate than unsupervised ones, they require significant human effort upfront to label the data properly. For instance, a supervised learning model can predict your commute time based on factors like the time of day and weather conditions. However, it first needs to be trained to recognize that rainy weather increases travel time.

Unsupervised learning, on the other hand, independently identifies patterns or structures in unlabeled data. While these models don’t require labeled inputs, they still need human intervention to validate their findings. For example, an unsupervised learning model might detect that online shoppers frequently buy certain products together. A data analyst would then confirm if it makes sense for a recommendation system to group baby clothes with items like diapers, medicines, and ketchup.

Choosing between the two

Evaluate your input data: Is it labeled or unlabeled data? Do you have experts that can support extra labeling?
Define your goals: Do you have a recurring, well-defined problem to solve? Or will the algorithm need to predict new problems?
Review your options for algorithms: Are there algorithms with the same dimensionality that you need (number of features, attributes, or characteristics)? Can they support your data volume and structure?

The Best of Both Worlds

Struggling to choose between supervised and unsupervised learning? Semi-supervised learning offers a perfect balance by combining both labeled and unlabeled data in the training process. It’s especially helpful when extracting relevant features is challenging or when dealing with large datasets.

This approach works wonders in fields like medical imaging, where even a small amount of labeled data can dramatically boost accuracy. For instance, a radiologist could label a handful of CT scans to identify tumors or diseases, enabling the machine to more reliably predict which patients might need closer medical attention.

Comparing Technologies for Object Identification and Classification: A Review of AI Approaches

Anisha Bhandare — Fri, 27 Dec 2024 10:03:26 +0000

Abstract
Object identification and classification are extremely important in all industries, whether it's healthcare, manufacturing, or environmental science. The traditional ways of doing it are pretty slow and require specific expertise. However, with the emergence of AI, ML, and DL, object identification is becoming faster and much more accurate. This review focuses on the current tech for object identification by considering CNNs, ensemble learning, and other DL models.

Introduction
Object identification is a great application in quality control, environmental monitoring, and medical diagnostics. Manual identification is time-consuming and prone to errors. AI transformed the process by automating object recognition through image processing and deep learning. This paper discusses the various technologies as compared to their methods, accuracy, and practical applications.

Machine Learning Approaches

There have been vast applications of ML towards object detection and classification. Such methodologies start through extraction of image features, such as colors, textures, and shapes and apply different algorithms for classification, Support Vector Machines and Random Forests, where in majority instances; an accuracy level of 85 to 93 percent is achieved. Although ML models may produce sound outcomes, they rely much on hand feature extraction. This can scale its ability across different datasets.

DL Approaches

DL model CNNs is revolutionizing object identification. It eliminates the need for manual feature extraction and improves accuracy. Since CNNs can handle very large datasets and high-resolution images, it will be best suited for even the most complex object recognition tasks.
Deep CNNs, usually combined with GAP, have achieved accuracy of over 99%. Such models are particularly good for applications where high accuracy is critical, such as medical imaging and defect detection. However, the price is that they consume much more computational resources and require larger datasets to train.

MobileNet is a lightweight CNN that balances performance and efficiency. It is designed to perform object identification in real-time and has achieved 98.3% accuracy, thus aligning well with mobile and cloud platforms. It is quite suitable for edge devices and on-the-go applications but needs regular internet access for cloud processing.

Ensemble Learning Techniques

Ensemble learning involves the combination of multiple models for improved overall performance. It increases accuracy up to about 97-99% as it is less prone to overfitting and helps fill the gap for weaker models. In the ensemble method, a number of different models are integrated, so there is strong performance with respect to different datasets. The deployment and management of an ensemble model, however, is challenging, which requires more coordination and computing power.

Applications in Object Detection

DL is not only used for object ID; it has also been largely used for defect detection, medical imaging, and surveillance. CNNs and DBNs have found great applications in the recognition and classification of objects in different environments. The techniques of XAI are also being used to understand how such models make decisions.

Comparative Analysis
The traditional ML approaches include SVM and RF. Those typically yield more than 85-93% accuracy, work for most relatively trivial tasks, and demand quite labor-intensive engineering of features; otherwise, their applications would be confined.CNN-based DL approaches extend to pushing the accuracy into 95-99%. With fully automatic feature extraction, a large, diversified set might be handled; it's a highly versatile tool, however very computationally expensive. Ensemble learning offers something of the middle ground: a combination of several models for an accuracy between 97 to 99%. While working efficiently, it also introduces model management issues in the deployment. For real-time cases, MobileNet delivers very good performance with an accuracy value of 98.3%. Though being very lightweight for mobility and cloud-based systems, this does have some inherent flaws: internet dependency within such areas may be problematic.

Conclusion
DL, especially CNNs, has taken object identification and detection to a whole new level of precision and efficiency over traditional ML. Ensemble learning improves the reliability, but issues such as data availability and computing power remain. Future research would include lightweight DL models, larger datasets, and enhancing interpretability with explainable AI.