DEV Community: Peter Wainaina

Data Science for Beginners: 2023–2024 Complete Roadmap.

Peter Wainaina — Sun, 19 Nov 2023 17:53:59 +0000

Data Science for Beginners: 2023–2024 Complete Roadmap.

What is Data Science?

In very simple terms, Data Science is the study of data with the intention of extracting meaningful insights from the data and then using those insights to make data-informed decisions, mostly for businesses and organizations.

A more technical definition, I especially like how IBM gives the definition of Data Science:

“Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.”

Why learn Data Science?

With the definition out of the way, why then do you need to learn Data Science?

Data is the new oil. You might have heard this somewhere and it may sound like an overstretch but truer words have never been spoken. In the 21st Century, Data is the new driving force behind industries and organizations that have tapped into drawing insights from their customer data, consensually of course, are far ahead of competition.
Pool of opportunities. It goes without saying that data is at the center of any industry you may think of. Some of the industries that have really embraced data science include Healthcare, Fintech and E-commerce.

**Lucrative career. **This shouldn’t be the main reason driving you to get into Data Science but the fact is there’s pretty fair compensation in the Data Science industry. According to Glassdoor, the average salary for a Data Scientist is $117,345/yr.
**Use Data to do good. **Adversities can be detected and avoided by the insights gotten from building predictive models. For example in healthcare, there are a number of models built for detection of some serious complications like heart failure which can predict the chance of a person’s heart developing complications based on some inputs. As a result, a person can learn from the insights and change their lifestyle, consequently avoiding getting the disease in the long run.

The Roadmap.

In order to break into the field of Data Science, there are some basic Foundations that are a must have. These include:

Mathematics

Linear Algebra
Probability and Statistics
Calculus

Programming

Python —(“Introduction to Python for Data Science”)

Programming Syntax in Python
Functions
Data Structures (lists, tuples, dictionaries)
Object Oriented Programming

Data Manipulation

Numpy
Pandas
Dplyr( R Programming)

Data Visualization

Matplotlib
Seaborn
Ggplot2 (R)

Data Preprocessing and Exploration.

Exploratory Data Analysis
Feature Engineering
Data Cleaning
Handling Missing Data
Data Normalization

Git and GitHub.

As a data scientist, your work often involves collaborating with fellow data scientists on various projects. During these collaborations, you need to make updates to specific sections of the code. This is where Git and GitHub play a pivotal role in enhancing workflow efficiency.

I have a detailed article on Git and GitHub for Data Scientists, “Comprehensive Guide to GitHub for Data Scientists.”

SQL.

SQL is one of the most important tools that a data scientist should be well versed with. It gives the Data Scientist the ability to retrieve and filter data, manipulate data, aggregate and summarize data, join data.

I have a detailed article on SQL, “Essential SQL commands that are a must know for a data scientist.”

Machine Learning.

Supervised Learning

Regression
_Linear Regression
_Polynomial Regression
Classification
_Logistic Regression
_Support Vector Machines
_Decision Trees
_K-Nearest Neighbors
_Random Forest

Unsupervised Learning

Clustering
_K-Means Clustering
_Hierarchical Clustering
_DBSCAN
Dimensionality Reduction
_Principal Component Analysis (PCA)
_T- Distributed Stochastic Neighbor Embedding (t-SNE)
_Linear Discriminant Analysis (LDA)

Reinforcement Learning
Model Evaluation and Validation

Cross- Validation
Hyperparameter Tuning
Model Selection

Python Libraries

Scikit-learn
Tensorflow
Pytorch
Keras

Deep Learning.

Neural Networks
_Perceptron
_Multi-Layer Perceptron
Convolutional Neural Networks (CNNs)
_Image Classification
_Object Detection
_Image Segmentation
Recurrent Neural Networks (RNNs)
_Sequence-to-Sequence Models
_Text Classification
_Sentiment Analysis
Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)
_Time Series Forecasting
_Language Modeling
Generative Adversarial Networks (GANs)
_Image Synthesis
_Style Transfer
_Data Augmentation

Data Visualization and Reporting.

Dashboarding Tools
_Tableau
_Power BI
_Dash (Python)
_Shiny (R)
Storytelling with Data

Must have Soft Skills.

Be a Problem- Solver
Effective Communication Skills
Time Management
Teamwork

Keep Learning.

As a Data Scientist, similar to all other fields in tech, you will be a forever- learner. There will always be emerging trends, frameworks and languages and you have to stay up-to-date to be an effective Data Scientist.

Some ways to keep you up to date and in a loop of continuous learning include:

Doing online courses.
Work on projects. You can get datasets which are readily available on platforms like Kaggle.
Solving online challenges like leetcode and Hackerrank.
Reading Data Science Books and Research papers.
Reading Informative Articles and Blogs.
Networking through Meetups both online and Physical.

Web Scraping: Unleashing Insights from Online Data.

Peter Wainaina — Sat, 18 Nov 2023 17:09:29 +0000

Web Scraping: Unleashing Insights from Online Data.

Some background info:

Data Collection is a Data Process and it involves collecting relevant data from various sources, to be used for analysis or for building models. One of the methods of data collection is Web Scraping which you will be able to do by the end of this tutorial.

What is Web Scraping?

Web Scraping is basically a technique used to collect data from the internet. It is an automated process of collecting large unstructured data from websites.

When do you need Web Scraping?

Under normal circumstances, you would ideally use an API to fetch content from a target website but this may not always be the case because not every website has an API through which you can fetch data. In such a scenario where an API is non-existent, you may have to turn to Web Scraping to get the content you want from a website.

Some of the reasons of scraping web pages include:

Collecting data to be used for analysis or building Machine Learning Models
Customer sentiment analysis through reviews on products for example on ecommerce websites.
Competitor analysis and product price comparison.

Take note that you may come across some websites that explicitly forbid the use of automated web scraping tools and therefore always check your target website’s acceptable use policy to ensure you do not violate its terms of use.

How to web-scrape.

You are going to use beautifulsoup, which is a dedicated web scraping Python library.

A Web scraper uses hypertext transfer protocol (HTTP) to request data from a target website using the GET method.

You will GET the HTML from your target URL using the requests library in Python. You will then pass the content that will be returned into beautifulsoup and then use selectors to query the specific data you want from the website.

Prerequisites.

Have python3 installed.
An IDE to run your code, for example VS Code.
After making sure that you have python3 installed on your computer, you will need to install the necessary libraries for building the web scraper. These libraries are: pandas, beautifulsoup and requests.
Have a copy of the URL to your target website. For my case I will scrape data from backmarket.com.

Install the required libraries:

Import the required libraries into your code:

Initialize an empty list to store the data you will get from the target website:

Inspect the target Website:

Go to your target website and inspect it by either right clicking and going to inspect or using the shortcut ctrl+shift+i on Windows or command+shift+i on Mac.

Navigate to the specific HTML code of a product card of your choice, preferably the first one in the page for easier navigation. You can see the product being overlaid by a blue filter as you continue collapsing the html code. Continue collapsing the HTML code until you get to the target data, for my case I want the name of the phone, the price and the status.

Here is the collapsed data of my product card. I will select the class inside the div with the name, storage and status which will query the specific data as you will see in the full code down below.

Here is the complete code that will essentially send a GET request to the target URL, query specific data and store the data in a CSV file.

What specifically is the code doing?

As you had seen earlier, the first three lines of code are the necessary imports you need for the web scraper.

You then define an empty list where the data gotten from the target website will be stored.

There is a for loop that will execute in all the 13 pages of my target website. The page numbers were indicated at the bottom of the page in my case, so I just specified them as they were. This may not always be the case as sometimes you may be scraping hundreds of pages and their number may not be explicitly stated.

I am getting the name, price and status of the phone for each product in my target website, which is Samsung phones.

I then created a list called info and assigned it three elements: Name, Price and Status and appended the info list to the “data” list defined in the code after the imports.

Finally using pandas, put the data into a dataframe and convert the dataframe to CSV format and it should download in the directory where the file containing your code is located.

Conclusion.

There you have it! You can now scrape a target website for specific data you may need, maybe for data analysis. In my case I managed to scrape 892 products, did Exploratory Data Analysis and zeroed down on the specific phone I wanted. This saved me the time that I would have otherwise taken to scroll through all the pages and look over all the 892 Samsung phones listed.

What is WebAssembly? and does it have the potential to replace JavaScript on the web?

Peter Wainaina — Wed, 09 Aug 2023 07:53:53 +0000

Since the inception of the web, the target compiler language has been JavaScript. Every web developer, myself included, has to have a good understanding of JavaScript, then advance to learning its frameworks, either for the front end or back end.

Some advantages of using JavaScript are:

JavaScript is easy to learn and understand, making it accessible for both users and developers. The language’s structure is straightforward and user-friendly, making it easy to implement and saving web developers money when creating dynamic content.
JavaScript is highly interoperable, as it integrates seamlessly with other programming languages, making it a preferred choice for many developers when creating a range of applications. It can be included in any webpage or the script of another programming language.
JavaScript is client-side, therefore data validation can be done within the browser itself rather than being sent to the server, reducing the load on the server. In case of any discrepancies, only the selected area of the page needs to be updated by the browser, eliminating the need to reload the entire website.
Unlike other programming languages such as Java, JavaScript is an interpreted language, which means it requires less compilation time, making it faster to execute. It is also a client-side script that speeds up program execution by eliminating wait times for server connections. JavaScript can be hosted anywhere and always runs in a client environment, which reduces bandwidth usage and boosts execution speed.

Truth be told, JavaScript is a bit stressful to work with, especially if you’re working on a large project and that’s where a language like TypeScript comes in, which is built on JavaScript but with type checking, meaning errors will be identified during run time and save you loads of time that you could have otherwise used during debugging your project, which is the worst situation to be in as a developer.

Some other disadvantages of JavaScript include:

Although JavaScript may interpret quickly, the rendering of HTML through the JavaScript DOM can be slow, leading to delayed rendering and debugging in HTML editors is not as efficient as other programming languages like C/C++.
For larger front-end projects, the configuration process can be tedious due to the need to integrate multiple tools to create a suitable environment. This can directly impact the performance of the library.
Developing large applications in JavaScript can be challenging, but using a TypeScript overlay can help alleviate this issue by identifying errors during run time and significantly reducing the chance of shipping them to production.
JavaScript can be interpreted differently by various browsers, making it challenging to write and read cross-browser code. Additionally, errors in JavaScript can halt the rendering of an entire website and the continuous conversion of JavaScript also increases the time needed to run scripts and reduces their speed.

That was just some background information to get you up to speed.

Now that we are on the same page, here’s the deal: everyone has that one language that they love programming with and sometimes wish they could use it in a more versatile manner, say use it to write code that will run on the web, if at all it’s not JavaScript.

I bear good news, you can do exactly that, with Web Assembly. This is exciting news! Well at least it was for me when I first heard about it.

I am sure you’re intrigued and asking what is this so called WebAssembly? WebAssembly is a binary instruction format that boosts the efficiency of web browser programs. It gives a programmer the power to create web applications in the language of their choice and creates minimal file sizes that load and run more quickly. In comparison to JavaScript, it is a new low-level binary compile format that is more suitable as a compiler target.

Great news, right? I believe you are now as elated as I was when I first heard about Web Assembly.

As a developer, you can write in your preferred language, say C, C++ or Rust, which is subsequently translated into WebAssembly bytecode, meaning WebAssembly is not a programming language and you won’t have to learn a new language, if at all you might be wondering. After you have written your code, the bytecode is performed on the client, usually in a web browser, where it is converted into native machine code.

WebAssembly has a load of perks, including its compatibility with contemporary browsers and support for a variety of languages, including C, C++, Go and Rust.

JavaScript is not meant to be replaced with WebAssembly, but rather to be used in collaboration with it. After all, JavaScript is still one of the core technologies of the web, since the inception of what we know as the World Wide Web, and powers more than 90% of websites on the client side for webpage behavior.

Developers have been adopting WebAssembly, especially for performance-intensive use cases like video editing, CAD Applications, gaming (which are intensive in terms of heavy graphics) and music streaming. Below is a graph showing these uses.

As a UX Designer, I can’t fail to mention Figma as an example of a web service that has already embraced WebAssembly, having been written in C++ and it is evident how powerful and performance intensive it is as a collaborative design tool on the web. WebAssembly supports a variety of languages such as C, C++, C#, Rust, Swift, Kotlin and Go, but support for other languages is being added, so don’t be all sully if your favorite language is not yet supported. Chances are it will be supported by the time you are reading this.

The majority of WebAssembly scenarios involve writing code in a high-level language and converting it to WebAssembly. This can be accomplished in one of these three ways:

Through direct compilation.
Through third-party tools.
Through a WebAssembly-based interpreter.

If at all you were interested in web development and for some reason don’t like JavaScript, WebAssembly has got your back. You can still create awesome web applications using your favorite language, so long as it is supported by WebAssembly. Happy coding!

Comprehensive Guide to GitHub for Data Scientists.

Peter Wainaina — Wed, 26 Apr 2023 09:00:00 +0000

This article is an in-depth guide to Git and GitHub. You will get to know what exactly Git and GitHub are and how you can leverage them to make your data science projects easier to track. As a data scientist, you need to have a solid grasp of these tools.

As a data scientist, you are going to collaborate with other fellow data scientists on projects and as you guys collaborate, there will be times when you have to update some part of the code. This is where Git & GitHub comes in handy and helps create a better workflow in that whatever changes anyone you are collaborating with makes, they can easily make those changes available to all the other collaborators, without necessarily having to be in the same room, country or even time zone. And if you make a mistake, you can always roll back to a previous version.

GitHub gives you the power to create a remote project and has all your team members work on different features in parallel, yet independently and still have a stable running code at the end of the day.

What is the difference between Git & GitHub?

**Git **is a distributed Version Control System (VCS) that lets you keep track of all the modifications you make to your code. Being a distributed Version Control System ideally means that everyone who is collaborating on a project will have a history of the changes made on their local machine. This enables people to work on different features of the project without having to communicate with the server hosting the remote version of the project and you can easily merge any changes made to the project with the remote copy.

GitHub is a platform for version control that is built on top of git technology and uses Git at its core. GitHub hosts the remote version of your project from where all the people collaborating can access it.

Terminologies that you should be familiar with as we start:

Repository – This is sort of a "Database" for all the branches and commits of a particular project.
Branch – It’s an alternative state or line of development for a repository.
Merge – This is bringing together multiple branches into a single branch.
Clone – This is creating a local copy of a remote repository on your machine.
Origin – Refers to the remote repository from which the local clone was cloned.
Master/Main – This is the root branch of your remote repository.
Stage - Choosing the files that will be part of a new commit you intend to make.
Commit - A saved snapshot of staged changes made to the file(s) in the repository.
HEAD – It’s the current commit your local repository is currently on.
Push – This is the act of sending your changes to the remote repository for everyone you may be collaborating with to see.
Pull – It’s the act of getting everybody else's changes (the changes that have been pushed) to your local repository.
*Pull Request *– This is a mechanism to review and approve the changes you have made before merging to the main/master branch in the remote repository.

Basic commands that you should be familiar with:

git init - Create a new repository on your local computer.

git clone - Start working on an existing remote repository.

git clone

git add - Choose file(s) to be saved (staging).
git add (adding a single file)

git add -A (adding everything at once)

git status - Show which files you have changed.
git status

git commit - Save a snapshot (commit) of the chosen file(s).
git commit -m “”

git push - Send your saved snapshots (commits) to the remote repository.
git push origin

git pull - Pull recent commits made by others into your local computer.
git pull origin

git branch - Create or delete branches.
git branch

git checkout - Switch branches or undo changes made to local file(s).
git checkout

git merge - Merge branches to form a single branch.
git merge -m “”

Step-by-step procedure of how to Create and Clone a Repository.

This walkthrough will be of how to install Git on Windows and make a repository to which you will commit changes.

Step 1: Create Account & Git Installations

Go to Git and install the latest version according to your computer. Once you are done installing, launch GitBash and then use the Git --version command to check the version.

Step 2: Initializing a new Repository

Create a new folder/directory using the $mkdir command and navigate to the created folder using the $cd command. My local directory name will be “myproject1” for the sake of context.

Use the $Git init command to initialize the directory. To check if all is well so far, go to the folder where “myproject1” has been created and create a file with the .txt extension and write something to it like, ‘My first project is up and running’, then save the changes. After that, enter the Git bash and use the $Git status to check the status of the folder.

Step 3: Configuring Git

Git config will allow you to set configuration values on how you would want Git to look and operate and uses these configurations to determine the non-default behavior that you may want. With Git config you can set global variables for example the name and email of a user and verify the variables using Git config --list.

Step 4: Commit Files in Git

As it is currently, the file that we created is untracked. The Git add command will copy a file from the working directory to the staging area. Adding commits keeps track of the changes you perform. The commit command performs a commit and the -m “message” adds a message. It then takes a snapshot of the staging area and assigns a hash from the commit to the snapshot.

Step 5: Viewing Logs

Logs will enable you to see the commit history and changes in a project when you have collaborated with different people on the same repository.

Step 6: Uploading to a Remote Repository using Git

Create a new repository on GitHub and give it a name as well as a readme description.

Add a file into a folder and use these commands below in the exact sequence shown:

cd
Git init
Git remote add origin
Git remote -v
Git add . (take note of the full stop/period)
Git commit -m “your message”
Git push origin master

The file will be automatically added to the GitHub repository you just created.

Step 7: Adding Git Remote to Your Repository

Git remote command can be used to share code to a remote repository. Any project can be downloaded from a remote server to your local computer. There is an existing connection between the original remote setup, which points to the “origin” remote connection.

We use the command Git remote add origin

Step 8: Push using Git

The Git push command is used to upload local repository content and commits to a remote repository. After you have made the final modifications to your project, you perform a push operation so that the changes you have made can be successfully shared with remote team members you are collaborating with.

The command is Git push origin master

Step 9: Cloning a GitHub Repository

Cloning a repository will enable you to keep a copy from GitHub to your local repository. Each repository comes with versions of every file and folder for the project. It creates a copy of the existing repository.

Step 10: Branching and Merging

Branching allows you to get the code from production to fix a bug or add a feature without modifying the already existing version. These branches work with a copy of code, make and build changes, test those changes, then merge them to the main branch.

To create a new branch, use the command - _Git branch < name of branch >
_
**Step 1: **Create branch -> Git branch “branch name”

Step 2: Checkout branch -> Git checkout “branch name”

Step 3: Merge new branch in master branch -> Git merge “branch name”

*Step 11: Pull using Git
*
Pull requests inform the changes in a branch in a repository. Once a pull request is opened, one can discuss and review the potential changes with collaborators and then commit after making those changes.

Step 12: Forking and Contributing to the world

Forking is the process of contributing to or using someone else’s project as it creates a remote copy of the original repository into your repository. You get a copy on which you can make changes or improvements to the existing project using pull requests which can then be merged with the original project. You are basically making open-source contributions to someone else’s project.

Open any public repository and click on the Fork button to fork the changes.
You can keep the same name of the repository you want to fork and click on Create Fork.
Once you fork, you will see a copy of the original repository in your account.
Once you have made changes in the code, you need to push the changes back.
This takes the snapshot of the changes, commits and push help to push the changes.

This is how you contribute to open-source changes and contribute to a public repository.

Conclusion

As a data scientist, you must have in-depth knowledge of version control tools like Git and GitHub to participate in maintaining and reviewing changes in collaborative and personal projects.

The key takeaway from this article is the basic Git commands and the step-by-step procedure of creating and cloning a repository.

Getting started with sentiment analysis.

Peter Wainaina — Wed, 26 Apr 2023 07:40:00 +0000

Sentiment analysis is an approach to natural language processing (NLP) that studies the subjective information in an expression. When we say subjective information, this means that the information is subject to change from person to person and it includes the opinions, emotions, or attitudes towards a topic, person or entity which people tend to express in written form. These expressions can be classified as positive, negative, or neutral. Machine Learning algorithms review this textual data and extract valuable information from it and then brands and businesses make decisions based on the information extracted.

Here are a few advantages of Sentiment Analysis especially in Business:

It helps you understand your audience and their specific needs.
You can gather actionable data about your products based on critiques and suggestions given by customers.
You can get meaningful insights about your brand and the kind of emotion it invokes among the people.
Conduct a comprehensive competitive analysis and gauge your product against that of your competitor.
Monitoring long-term brand health by tracking sentiments over long periods ensures that you have a positive relationship with your target customers.

It would be very expensive in terms of time and cost to have human beings read all customer reviews to determine whether the customers are happy or not with the business, service, or products. This necessitates the use of machine learning techniques such as sentiment analysis to achieve similar results at a large scale. For example, imagine a large company like amazon going through all reviews they receive about their products one by one, it would take ages and a lot of manpower to do so. A machine learning model would be the best approach in such a scenario.

In this article, you will practically learn how to go about sentiment analysis using Twitter sentiments. By the end of the article, you will have developed a Sentiment Analysis model to categorize a tweet as either Positive or Negative.

The dataset being used can be gotten from Kaggle.com using this link: https://www.kaggle.com/datasets/kazanova/sentiment140

This dataset contains 1,600,000 tweets extracted using the Twitter API and they have been annotated (0 = negative, 4 = positive) and can be used to detect sentiments.

Take note that I have used Jupyter Notebooks.

Make sure all relevant imports are present as shown in the code snippet below:

Load the dataset into your notebooks and plot the distribution of the tweets based on whether they are positive or negative as shown:

You should expect the output shown below:

Perform Text Processing which is transforming text into a more digestible form so that machine learning algorithms can perform better.

The Text Preprocessing steps that have been taken are:

Converting each text into lowercase.
@Usernames have been replaced with the word "USER". (eg: "@pierre_wainaina" to "USER")
Characters that are neither numbers nor letters of the alphabet have been replaced with a space.
Replacing URLs: Links starting with "http" or "https" or "www" are replaced by "URL".
Short words with less than two letters have been removed.
Stopwords, which are words that do not add much meaning to a sentence, have been removed. (eg: "a", "she", "have")
Words have been lemmatized. Lemmatization is the process of converting a word to its base form. (e.g: “worst” to “bad”)
Emojis have been replaced by using a pre-defined dictionary containing the emojis and their meaning. (eg: ":)" to "EMOJIsmile")
3 or more consecutive letters have been replaced by 2 letters. (eg: "Heyyyy" to "Heyy")

Analyzing the data

Let's analyze the pre-processed data to get to understand it. Below is code for plotting Word Clouds for Positive and Negative tweets from the dataset and it will give a visual output of the words that occur most frequently.

Below is the output of the word cloud for negative tweets:

Splitting the Data

We shall split the pre-processed data into 2 sets :

Training Data: The dataset on which the model would be trained will contain 95% of the data.

Test Data: The dataset on which the model would be tested against will contain 95% of the data.

TF-IDF Vectoriser

This is a tool that helps determine the significance of words while trying to comprehend a dataset. For example, if a dataset contains an essay about "My Car", the word "a" might appear frequently and have a higher frequency than other words such as "car", "engine", or "horse power". These words however, may carry very important information but have lower frequency as compared to words like "the" or "a".

This is where the TF-IDF method comes into play, which assigns a weight to each word based on its relevance to the dataset.

The TF-IDF Vectorizer transforms a set of unprocessed documents into a matrix of TF-IDF characteristics, and is typically trained only on the X_train dataset.

As seen in the code below, X_train and X_test dataset have been transformed into matrix of TF-IDF Features by using the TF-IDF Vectoriser. These datasets will be used to train and test the model.

Creating and Evaluating Models

We will create 3 models for our sentiment analysis.

Bernoulli Naive Bayes (BernoulliNB)
Linear Support Vector Classification (LinearSVC)
Logistic Regression (LR)

As seen in the very first output, our dataset is not skewed and therefore we choose accuracy as our evaluation metric. We are plotting the Confusion Matrix to get an understanding of how our model is performing on both classification types, either positive or negative as seen in the code below.

We can now test to see if our model can classify the tweets correctly.

The output should be as follows and our model works well as it can classify tweets as either positive or negative.

The Ultimate Guide to Exploratory Data Analysis

Peter Wainaina — Wed, 26 Apr 2023 03:00:00 +0000

This is The ultimate guide to exploratory Data Analysis.

Exploratory data analysis (EDA) is an approach to analyzing and summarizing datasets to identify patterns, trends, and relationships. It is very important in the Data Science Life Cycle because it helps you to get a better understanding of your data, identify any issues or problems with the data, and formulate hypotheses for further analysis.

The Data Science Life Cycle is an iterative set of data science steps you take to deliver a project or analysis and maintain any data science product.

Below is a flow chart illustrating the Cycle:

Exploratory Data Analysis is without a doubt one of the most important steps during the process of extracting insights out of data, even before the actual analysis or building of machine learning models begins. For businesses, companies or stakeholders to harness the ultimate power that data provides being the “new oil”, they have to focus on this phase of Exploratory Data Analysis. They, therefore, need to hire data professionals skilled in concepts of exploratory data analysis which include visualization, pattern recognition, creating maps and the like.

This article will give you a guideline on how to get these skills.

What are some of the importance of Exploratory Data Science?

Exploratory Data Analysis helps us to clean the dataset that we are working with.
It provides a better understanding of the variables in our dataset and the relationships between them.
It helps identify obvious errors and gives a better understanding of patterns present in the data and detects outliers or anomalous events.
Exploratory Data Analysis helps to select the best algorithms for building a machine learning model.
It answers questions about standard deviations, categorical variables, and confidence intervals within the data.
Once Exploratory Data Analysis is complete and insights are drawn, its features can then be used for more complex data analysis.

There are four primary types of Exploratory Data Analysis:

Univariate non-graphical. It is the simplest form of data analysis, where the data being analyzed has a single variable. This means that in this case you won’t have to deal with causes or relationships in the data set.
Univariate graphical. In this form, the non-graphical techniques do not present the data analyst with the complete picture of the data as it is in its raw form. Therefore, for comprehensive EDA, you have to implement graphical methods that include:

Stem-and-leaf plots, which show all data values and the shape of the distribution.
Histograms are bar plots in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.

Multivariate nongraphical: Multivariate data consists of more than one variable. Non-graphic multivariate Exploratory Data Analysis methods illustrate relationships between 2 or more data variables using statistics or cross-tabulation.
Multivariate graphical: This technique makes use of graphics to show relationships between 2 or more datasets. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable. Some other widely-used multivariate graphics include bar charts, bar plots, heat maps, bubble charts, run charts, multivariate charts, and scatter plots.

Now let’s have a look at the Exploratory Data Analysis Tools:

Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.

R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

Kindly take note that both Python and R are equally good for Exploratory Data Analysis, but each has its unique advantages over the other and it will depend on you to choose the most adequate one for you.

I use Python because of its ease of use and readability and code written in Python is easier to maintain and more robust than that written in R. Python also has very many rich libraries for Exploratory Data Analysis.

R on the other hand as compared to python is better in both visualization and statistics, one can opt for R for Exploratory Data Analysis because Exploratory Data Analysis is mostly performed with visualization and a part of it is focused on statistics.

These are the steps of Exploratory Data Analysis (EDA)

Data Collection.

Required data can be collected from various sources through methods like surveys, social media, customer reviews, focus groups or secondary data collection methods like already existing data in books and the like. Without collecting sufficient and relevant data, further activities in Exploratory Data Analysis cannot proceed.

Identifying the Variables in the dataset.

This will involve identifying the important variables present in the dataset and which affect the outcome and their possible impact. This is a very crucial step for the final result expected from any data analysis.

Cleaning the Dataset.

A dataset may contain null values and irrelevant information which needs to be removed so that data contains only those values that are relevant to the target variable.

For missing values in a numerical column:
Replace it with a constant value. This can be a good approach when used in discussion with the domain expert for the data we are dealing with.

Replace it with the mean or median. This is a decent approach when the data size is small—but it does add bias.

Replace it with values by using information from other columns.

Predicting Missing Values Using an Algorithm.
Create a simple regression model, and if there are missing values in the input columns, we must handle those conditions when creating the predictive model. You can manage this by choosing only the features that do not have missing values or taking the rows that do not have missing values in any of the cells.

Missing Values in a Categorical Column.
You can take care of this by replacing the missing value with a constant value or the most popular category. This is a good approach when the data size is small but it has the disadvantage of adding bias.

Identifying the correlated Variables.

This is achieved by visualization in form of a heatmap(correlogram) and finding a correlation between variables helps to know how a particular variable is related to another variable in the dataset. The correlation matrix method gives a clear picture of how different variables correlate and helps in understanding the relationship.

Choosing the Right Statistical Methods to employ.

Different statistical tools are used depending on the data, categorical or numerical, the size, the type of variables, and the purpose of analysis. Statistical formulae applied for numerical outputs give fair information, but graphical visuals are more appealing and easier for an observer to interpret. This should help when choosing the right statistical method.

Visualizing and Analyzing the Results

After completing the analysis, the findings are to be visualized so that proper interpretation can be made for the sake of analysis. The trends in the spread of data and correlation between variables give good insights for making suitable changes in the data parameters. The results obtained will be appropriate to the data of the particular domain that you will be working on.

Conclusion

Data is very important and it has to be analyzed to gain useful insights that will help in making data-informed decisions, which would otherwise not be possible while the data is still in its raw form. Exploratory Data Analysis goes deep into the data and gives us results that can be accurate and which are used to make important decisions.

Overall, the goal of Exploratory Data Analysis is to gain a deep understanding of your data and to identify patterns that you can investigate further by so doing you can identify potential problems or biases, and develop hypotheses about what might be the cause of these patterns.

This article gave a detailed guide on exploratory data analysis, its importance, the tools that are used and the steps taken while conducting it.

Python 101: Introduction to Python for Data Science

Peter Wainaina — Tue, 25 Apr 2023 10:30:00 +0000

Data is the new oil. Not literally, but this means that data is really valuable in this era that we are currently living in. Raw data in itself is not as valuable, but the information extracted from the raw data is very valuable. This extraction of valuable information is done using the discipline of data science.

Data Science is the study of data to extract meaningful insights from it. IBM defined data science in this way:

Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.

One of the powerful tools that data scientists use is the python programming language, which you are going to get an overview of in this article. There are other tools and languages like R which is used to handle, store and analyze data and for doing data analysis and statistical modeling. In simple terms, R is an environment for statistical analysis. There is also SAS (Statistical Analytical System.) which is a tool for advanced analytics and complex statistical operations.

The big question is: why python? Below are a few reasons why data scientists prefer python:

Python has powerful mathematical and statistical tools for data analysis and exploration. This is one of the primary reasons that data scientists prefer to use Python.

Data scientists prefer Python because of its ability to handle large data sets, and also incorporate machine learning and modeling because of its rich machine learning libraries.

Python is easy to learn and use, due to its focus on simplicity and readability.

Python Libraries for Data Analysis.

The following are examples of the top 20 python libraries that are essential for data analysis and are you need to import them to work with them.

NumPy
Pandas
Matplotlib
SciKit-Learn
TensorFlow
SciPy
Keras
PyTorch
Scrappy
BeautifulSoup
LightGBM
ELI5
Theano
NuPIC
Ramp
Pipenv
Bob
PyBrain
Caffe2
Chainer

Let’s have a look at the first four libraries as they are very important in beginning to learn about data analysis.

Numpy

NumPy is the most fundamental library for scientific computing with Python and is mostly used for finding solutions to matrix problems.

Pandas

Pandas is used for data manipulation and analysis.

Matplotlib

Matplotlib is a powerful library for Data Visualization using histograms, pie charts, and bar graphs.

SciKit-Learn

SciKit-Learn is a library that focuses on building machine learning models and provides a range of supervised and unsupervised Machine Learning Algorithms.

Seaborn

Seaborn is a data visualization library based on matplotlib and it provides a high-level interface for drawing attractive and informative statistical graphics.

The above libraries can be imported as follows:

Essential SQL commands that are a must know for a data scientist.

Peter Wainaina — Tue, 25 Apr 2023 09:27:00 +0000

SQL, which in full is Structured Query Language is one of the most important tools that a data scientist should be well versed with.

There are several variations of SQL and they include PostgreSQL, MySQL, Microsoft SQL Server, and Standard SQL. Here are some of the benefits data scientists enjoy when they have a good knowledge of SQL:

Data retrieval and filtering: SQL gives data scientists the power to retrieve and filter data from databases using powerful query language features. This makes it easy to extract specific data that is needed for analysis.
Data manipulation: data scientists can manipulate data by creating tables, adding and modifying data in the tables and deleting data from the tables and even entire databases if need be. This is useful when preparing data for analysis and also when performing data cleaning.
Data aggregation and summary: SQL also provides powerful aggregation and summary functions that make it easy to calculate summary statistics like counts, averages, and sums. This is helpful to data scientists when they are analyzing large datasets and when performing exploratory data analysis.
Joining data: Most of the time data will exist in multiple tables or data sources, and SQL allows data scientists to join these tables together to have a single view of the data. This is useful when analyzing data from multiple sources or when performing complex data analysis.
Integration with other tools: Many data analysis tools like Power BI which is a visualization tool and programming languages like Python and R can interact with SQL databases and this is a plus for a data scientist.

As a summary of the above points, SQL is an important tool for data scientists because it provides a powerful and flexible way for them to work with large or small datasets. SQL helps you extract, manipulate, and analyze data more efficiently and effectively.

You should be familiar with several SQL commands as a data scientist so that you can seamlessly work with datasets in SQL. Below are eight of the most important commands a data scientist should know and I have used PostgreSQL to visualize the first five of these commands to show how they are written:

SELECT: To obtain data from a database, use this command. It enables you to define the columns you want to retrieve as well as any data filtering requirements.

The command displayed below selects all values of the columns in the table called customer.

WHERE: Using this command, the material can be filtered according to predetermined standards. The WHERE command, for instance, can be used to only return entries with a certain column value.

The command displayed below will display the names of all customers whose age is below 20 years.

BETWEEN: This command is used to obtain values in a certain specified range.

The command below has displayed the name, city and postal code of customers who are between the age of 20 and 40 years.

LIMIT: This command is used to restrict how many entries a query returns. For instance, you could retrieve only the best 10 records using the LIMIT command.

The command below is similar to the one in number (3) above but this time instead of displaying all values(rows), there is a limit of only displaying 10 values.

ORDER BY: This command is used to order the information a query has produced. You have the option to select the column or columns by which you want to order the data, ascending or descending.

The command below displays the name, age and postal code of customers who reside in the state of California in the order of their age.

JOIN: Using a common column, this command combines data from two or more databases. There are various JOIN kinds, such as CROSS JOIN, OUTER JOIN, and INNER JOIN.

GROUP BY: This command is used to arrange data into groups according to a particular column or collection of columns. To determine aggregate data, such as the average or sum of a particular column, you could use the GROUP BY command.

HAVING: To filter groups based on overall data, use this command in conjunction with the GROUP BY command. For instance, you could use the HAVING command to only return groups where a particular column's average number exceeds a predetermined threshold.

These are just a few of the SQL commands that are commonly used by data scientists. Depending on your specific needs and the structure of your database, you may also need to use other commands, such as INSERT, UPDATE, and DELETE.

This is just an overview of what SQL can do and many resources provide in-depth resources on the same. My recommendation would be a site like w3schools which has well-curated SQL resources. After learning here, you can go to HackerRank and practice what you have learned with fun exercises.

7 Ways of looping through an array in JavaScript.

Peter Wainaina — Wed, 28 Dec 2022 13:49:10 +0000

Simply put, an array is a collection of same data type items stored in contiguous(consecutive; following each other) memory locations.

More often than not, arrays are used during programming for a couple of reasons: storing user information, storing some identification attributes, just to mention but a few.

It becomes necessary at some point in time for a programmer to access this stored data and I have highlighted 7 ways in which one can easily do so using some JavaScript code snippets:

Looping through an array using for loop.

Looping through an array using while loop.

Looping through an array using do...while loop.

Looping through an array using map() method.

Looping through an array using for...of.

Looping through an array using for...in.

Looping through an array using forEach() method.

I hope the article was helpful. Happy coding!

Setting up and configuring the TypeScript Compiler(simplified).

Peter Wainaina — Sun, 25 Dec 2022 13:59:02 +0000

Typescript is essentially JavaScript with type checking. Web browsers don't 'recognize' TypeScript and this makes it necessary that TypeScript code has to be compiled and translated into JavaScript, a process called Transpilation.

Setting up the typescript Compiler.
This is simply done by creating a configuration file for the TypeScript compiler.
On a terminal(in the same directory as the typescript project), type tsc --init and voila, a file with the name tsconfig.json appears in your project folder.
Configuring the typescript Compiler.
The typeScript compiler has quite a number of lines of code, if so to speak. This may be a bit scary for a novice, like it was for my case but fear not! You do not necessarily need to know what each and everyone of them does, well at least not for now. If you are curious to know anyway,each of them has an explanation on the right as to what their functions are, preety handy if you ask me.
There are a few changes that you need to make on the tsconfig.json file so that your typescript code will be compiled efficiently:

in the /Module/ section, uncomment rootDir (ctrl + /) and include the path "./src". This means you have to create a folder named src as indicated in the path and then transfer your TypeScript file into this folder.

In /Emit/ section, uncomment outDir (this specifies the directory that will contain the JavaScript file)and write the path as "/.dist" (distributable folder) and this will be where our JavaScript files are going to be stored after being compiled using the typescript Compiler.

Still in the /Emit/ section, uncomment "removeComments" which is just below outDir and this will make the TypeSript compiler remove all the comments added in the TypeScript code, making the JavaScript code generated shorter.

Still in the /Emit/ section, uncomment "noEmitOnError" and this makes the typescript compiler not generate any JavaScript code if there is any error in the TypeScript code. This is very helpful as your compiled code will almost always be 'clean code'.

These few tweaks will get you started in writing and compiling your TypeScript code. I hope it will be of help, happy coding!