DEV Community: Philemon Kiplangat

Unsupervised machine learning.

Philemon Kiplangat — Fri, 22 Mar 2024 06:58:05 +0000

Unsupervised learning is a branch of artificial intelligence and a type of machine learning that learns from unlabeled data without human supervision and allows for insights about the data and patterns without any explicit guidance or instruction.
Main points to note

Unsupervised learning allows the model to learn by discovering patterns and relationships in unlabeled data.
Clustering algorithms group similar data points together based on their inherent characteristics.
Feature extraction captures esential information.
Label association assigns categories to the clusters based on the extracted patterns and characteristics.

A perfect example of how to illustrate this is to imagine a machine learning model trained on a large dataset of unlabeled patients with different diseases. Your task is to use unsupervised learning to identify different types of diabetes patients, be it type 1, type 2, or gestational diabetes, from the unseen data.Thus, the machine has no idea about the features of the patients. So we can’t categorize a specific type of diabetes among the patients. But it can categorize them according to their similarities, patterns, and differences.

Types of Unsupervised Learning
Unsupervised learning is classified into two categories of algorithms:

Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
Clustering is a type of unsupervised learning that is used to group similar data points together. Clustering algorithms work by iteratively moving data points closer to their cluster centers and further away from data points in other clusters.

Exclusive (partitioning)
Agglomerative
overlapping
Probabilistic
Clustering Types:-
Hierarchical clustering
K-means clustering
Principal Component Analysis
Singular Value Decomposition
Independent Component Analysis
Gaussian Mixture Models (GMMs)
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
An association rule based on machine learning is to determine the probability of the co-occurrence of items in a collection.A good example is finding out which products were purchased together.

In conclusion, unsupervised machine learning does not have evaluation metrics; therefore, it does not have a feedback mechanism.

The Complete Guide to Time Series Models

Philemon Kiplangat — Mon, 23 Oct 2023 13:34:55 +0000

Time series: this refers to the sequence of data points that occur in successive order over a period of time. A time series in investing records the movement of selected data points, such as the price of a security, over a set period of time, with data points recorded at regular intervals. There is no minimum or maximum time requirement, allowing the data to be obtained in a fashion that gives the information desired by the investor or analyst reviewing the activity.
The main importance of the time series is for predictive analysis.
Time series analysis can also be defined as a method of analyzing a collection of data points over a period of time. Instead of recording data points intermittently or randomly, time series analysts record data points at consistent intervals over a defined period of time.
While time-series data is information gathered over time, various types of information describe how and when that information was gathered. For example:

Time series data is a collection of observations on the values that a variable takes at various points in time.

Cross-sectional data: data from one or more variables that were collected simultaneously.

Pooled data: It is a combination of cross-sectional and time-series data.

The importance of time series analysis

Time series analysis has a wide range of importance in different fields, such as sales, economics, and many more. However, the common point is the technique used to model the data over a period of time. The reasons for time series analysis are as follows:

Features:Time series analysis can be used to track features like trend, seasonality, and variability.
Prediction: time series analysis can be used for prediction purposes by studying the patterns of the data available. -Inferences: You can predict the value and draw inferences from the data using time series analysis.

Time Series Analysis Types

Classification: It identifies categories in the data.
Curve-fitting: it plots the data on the curve to identify the relationship between variables in the data.
Descriptive analysis: patterns in the time-series data, such as trends, cycles, and seasonal variations, are identified
Explanative analysis: It attempts to comprehend the data and the relationships between it and cause and effect.
Segmentation: It splits the data into segments to reveal the source data's underlying properties.

*Time series analysis example *

Non-stationary data is defined as data that fluctuates over time or is affected by time and is evaluated using time series analysis.Because currency and sales fluctuate, businesses such as finance, retail, and e-commerce regularly employ time series analysis. Stock market analysis is a great illustration of time series analysis in action, especially when combined with automated trading algorithms.

Time series analysis can be used in:

Rainfall measurements
Automated stock trading
Industry forecast
Temperature readings
Sales forecasting

Data Modeling.

Philemon Kiplangat — Mon, 23 Oct 2023 08:34:48 +0000

Over the years, many businesses have been cautious about decision making processes that affect them. This is important since decisions made by a business determines its success. The part of the decision is forecasting which can be made possible by studying the growth of the business. The oil for decision making has been data. It is through it where one can obtain insights about the business and the growth pattern. Decision making process is greatly characterized by the data modelling process.

Data Modelling refers to the process of analyzing and define all the data types your business collects and produce,as well as the relationship between those bits of data. This can be achieved by using different tools in the tech field. Modelling can be represented using text, symbols and diagrams since it represents how the data is captured, stored and used.

Data Modeling Process

This refer to the process of creating conceptual representation of objects and their relationship to one another.The process of data modeling typically involves specific defined steps which are;

Requirements gathering.
Conceptual design.
Logic design.
Implementation.
During each step the data modelers work with stakeholders to understand the data requirements, define the entities and attributes, establish the relationships between the data objects, and create a model that accurately represents the data in a way that can be used by the stakeholders.

Levels of abstraction.

Conceptual model- collaborate with stakeholders to understand the data requirements, identify the entities and attributes, build the links between the data objects, and develop a model that accurately represents the data in a usable format.
logical level-The logical level involves defining the relationships and constraints between the data objects in more detail, often using data modeling languages such as SQL or ER diagrams.
Physical level-The physical level involves defining the specific details of how the data will be stored, including data types, indexes, and other technical details.

Data Modeling Examples

The best way to picture a data model is to think about a building plan of an architect. An architectural building plan assists in putting up all subsequent conceptual models, and so does a data model.

Entity Relationship Model:This model is based on the notion of real-world entities and relationships among them. It creates an entity set, relationship set, general attributes, and constraints.
Hierarchical Model:This data model arranges the data in the form of a tree with one root, to which other data is connected. The hierarchy begins with the root and extends like a tree. This model effectively explains several real-time relationships with a single one-to-many relationship between two different kinds of data.
Network Model:This database model enables many-to-many relationships among the connected nodes. The data is arranged in a graph-like structure, and here ‘child’ nodes can have multiple ‘parent’ nodes. The parent nodes are known as owners, and the child nodes are called members.
Relational Model:This popular data model example arranges the data into tables. The tables have columns and rows, each cataloging an attribute present in the entity. It makes relationships between data points easy to identify.
Object-Relational Mode:This model is a hybrid of an object-oriented database and a relational database. As a result, it combines the extensive functionality of the object-oriented paradigm with the simplicity of the relational data model.

Benefits Of Data Modeling.

Allows the developers and the stakeholders to understand the relationship between different objects for easier analysis.
Improved data quality: Data modeling can help to identify errors and inconsistencies in the data, which can improve the overall quality of the data and prevent problems later on.
Improved collaboration: Data modeling helps to facilitate communication and collaboration among stakeholders, which can lead to more effective decision-making and better outcomes.
Increased efficiency: Data modeling can help to streamline the development process by providing a clear and consistent representation of the data that can be used by developers, database administrators, and other stakeholders.

Expolaratory Data Analysis

Philemon Kiplangat — Sun, 08 Oct 2023 16:20:14 +0000

Over many years there has been there has been much change from other ways of civilization starting from the agrarian revolution the early years. There has been gradual change till now we are in the age where information is important, hence the phrase knowledge is power. In the digital age, data is the new gold, but its true value rests in the insights it gives. However, raw data is frequently complicated and difficult to comprehend. This is where Data Exploratory Analysis (EDA) comes in.

Therefore, Exploratory Data Analysis can be define as the process of examining and understanding insights and patterns within data and convey the information through visualization.

Importance of Exploratory Data Analysis

Exploratory data analysis can help detect obvious errors, identify outliers in datasets, understand relationships, unearth important factors, find patterns within data, and provide new insights.
EDA is the initial step in data analysis. Its primary goal is to summarize the main characteristics of a dataset, often employing statistical and graphical techniques. By understanding the structure and patterns within the data, researchers and analysts can make better decisions, identify relationships, and even formulate hypotheses for further analysis.

Exploratory data analysis tools.

Specific statistical functions and techniques you can perform with EDA tools include:

Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.

Exploratory Data Analysis Tools

Python-Python is an object-oriented, interpreted programming language with dynamic semantics. Its high-level, built-in data structures, together with dynamic typing and dynamic binding, make it particularly appealing for rapid application development as well as use as a scripting or glue language to connect existing components. Python and EDA can be used in tandem to find missing values in a data set, which is useful for determining how to handle missing values in machine learning.
RThe R Foundation for Statistical Computing supports an open-source programming language and free software environment for statistical computing and graphics. The R programming language is commonly used by statisticians in data science to create statistical observations and data analysis.

The Role of Data Visualization in EDA.

As is is often said , a picture is worth more than a thousand words,data visualization is a powerful tool within the realm of EDA. Instead of drowning in a sea of numbers, visual representations provide a clear, concise, and intuitive way to grasp complex concepts. Here’s how data visualization enhances EDA:

Spotting Patterns and Trends: Charts, graphs, and plots are excellent at showcasing trends over time or across different variables. For instance, line charts can demonstrate stock price fluctuations, while bar charts can compare sales figures among different products.
Identifying Outliers:Outliers, or data points that depart dramatically from the norm, are easily identified in visualizations. Box plots and scatter plots are frequently used to detect these anomalies, which could be errors or interesting occurrences worth investigating.
Understanding Distributions:Histograms and density charts aid in visualizing data distribution. Understanding the distribution shape (normal, skewed, bimodal) provides critical insights regarding the dataset's underlying nature.
Exploring Relationships:Scatter plots are invaluable for displaying relationships between two variables. Positive, negative, or no correlation between variables becomes apparent, aiding in decision-making processes.
Geospatial Analysis:Maps and geospatial visualizations aid in the comprehension of data in a geographical context. This is particularly helpful for businesses, epidemiologists, and sociologists researching regional trends.

Common Data Visualization Techniques

Bar and pie charts are excellent for illustrating categorical data or proportions.
Line charts are ideal for displaying trends over a continuous interval of time.
Scatter plots
are excellent for displaying relationships between two variables.
Histograms
can help you comprehend the distribution and frequency of numerical data.
Box plots are great for finding outliers and analyzing data dispersion.
Heat Maps: These are useful for expressing data in matrix format, and they are frequently used in correlation matrices and geographic research.

Summary

Data exploratory analysis is a crucial step in extracting meaningful insights from raw data. Data visualization serves as a bridge between raw numbers and understandable patterns. By leveraging various visualization techniques, analysts and researchers can unlock the potential of their datasets, enabling better decision-making, problem-solving, and innovation. In a world inundated with data, the ability to harness the power of EDA and data visualization is a skill that can set individuals and organizations apart, propelling them towards success in various fields.

Data Science for Beginners:

Philemon Kiplangat — Sat, 30 Sep 2023 15:56:54 +0000

What is Data Science?
Data science can be define in many ways considering that data has been named to be new oil in different business practice. Data Science can be simply defined as the study of data in order to derive valuable business insights. It is a multidisciplinary approach to data analysis that integrates principles and practices from mathematics, statistics, artificial intelligence, and computer engineering.
there are different paths associated with data , namely data analysis data engineering and analytical engineering. They can be dine differently since each can have different technologies to handle.
Data Engineering can be defined as the practice of designing and building systems for collecting,analyzing and storing data at scale and the practitioner of the discipline is known as a data engineer.
Data Analysis is a discipline which involves gathering raw data and then transforming it into information that users may utilize to make decisions.The practitioner of this discipline is known as a data analyst whose work mainly entails reviewing data to identify key insights into a business's customers and ways the data can be used to solve problems.
Analytical engineering is a discipline where it involves optimization of data models. Its practitioner is referred to as analytical engineer whose task mainly involves modeling data to provide clean, accurate datasets so that different users within the company can work with them.

Data Science for Beginners: 2023 - 2024 Complete Road-map.

Data science has emerged as an uncontested practice and debate topic. The key causes for this scenario are the massive amounts of data generated and the importance of data analysis in driving decision-making. Data science employs a variety of approaches for data collecting, preparation, storage, analytics, and interpretation. The advancement of tools, methods, computational speeds, and automation has further cemented data science's dominance.

Programming-Programming is a process that creates programs that involve the ratification of codes, For data science there are specific languages where one use. These are Python,R, Java and SQL.
Math Fundamentals- There are basic math fundamentals where one is required to use in Data Science. These are, statistics, linear algebra,differential calculus,discrete maths.
Data AnalysisThe main concepts in data analysis involve, feature engineering,data wrangling,exploratory data analysis,
Machine Learning- one should have deep understanding of the following algorithms.classification, regression,reinforcement learning,deep learning, dimensionaly reduction,clustering
Web Scrapping. beautiful Soap,scrappy,URLLIB
Visualization
- PowerBI
- Excel
- Matplotlib
- Tableau
Deployment. This can be done by using various cloud technologies such as Azure or AWS.