DEV Community: Emily

Introduction to Data Version Control

Emily — Mon, 27 Mar 2023 15:35:24 +0000

In Order to understand Data version control, let's first get a general idea of what version control is. Imagine a company that has employees working remotely all over the continent. These employees will at some point require to work together in some project.The company faces a challenge to collaborate ,for the many workers located in different parts of the continent but are working on the same project.

Another issue is the versions needed to complete a project, since a project is not completed in a single version,how will the employees update or see the updated versions(or where exactly has the changes been made) of the project.The version control system takes care of the collaboration between employees storing different versions.

Version control is the practice of tracking and managing changes to software code. Version control systems are software tools that help software teams manage changes to source code over time. Developers may review, compare, and undo changes made to a file over time using Version Control System, which keeps track of all file modifications.

*Examples of version control systems in the market;
*

Github -it is the most commonly and widely used.

2.Gitlab

3.Perforce

4.Beanstalk

5.AWS code commit

6.Apacha Subversion

7.Mercuril e.t.c

Now that we have an idea of what version control is ,let's narrow down to Data Version Control.

What is Data Version Control?

**
Similar to how version control systems manage changes to code files, data version control is a system for managing changes to data files. Data scientists and machine learning engineers can work together on data projects, manage changes to data files, and replicate data-driven experiments using the data version tool.

**Advantages of Data Version Control
 **

1.Data Version Control allows you to track changes to your data files over time, and keep a record of the exact data files used in each version of the project.

2.Data Version Control allows multiple data scientists and machine learning engineers to work on the same project, share data files, and collaborate on experiments. DVC also provides tools for resolving conflicts when multiple people make changes to the same data file.

3.Data Version Control provides a scalable way to manage large data sets, by allowing you to store data files in cloud storage systems. This makes it easier to work with large data sets without running into storage limitations on your local machine.

4.Data Version Control allows you to reuse data files across multiple versions of the project, which can save time and reduce the amount of data processing required.

Git and github is the most widely used data version control system ,which allows data scientists work on the same project and manage their changes through branches,commits,and merges.
*Reasons why github is widely used /commonly used over the other Version Control Systems
*
1.Github is open-source-it supports open source projects.

2.Github has a large community of developers who share their code and contribute to open-soucrce projects.

3.Github hosts yor code.

4.GitHub makes it easy to collaborate with others on projects. You can easily share your code with other developers, and they can make contributions or suggest changes using pull requests.

5.GitHub integrates with many other tools, such as CI/CD pipelines, code analysis tools, and project management tools.

In this article ,I will give an introduction on how to use github and git when working on a data science project.

First you must have downloaded and configured git (using git config) You must also have created a github account.

Steps to follow when pushing code to github

1.In github create new repository(click 'new' on the repositories page)and name it according to the project you are working on.
In creating a repository, you should add a small description of your project in the description box and a long/detailed description in the README file that should be attached to the repository.

A repository is either public or private. A public repository is accessible to anyone on the internet while a private repository is only accessible to you,people you explicitly share access with.

You need to clone(using git clone and a link to the repository) your repository in your local machine.Open your git bash window and navigate to the directory where you want to store your directory. Use cd to change directory and ls to list all the items in the directory.

3.Add your code to the repository by creating new files or modifying existing ones in the local copy of the repository.

4.Add the files you want to push to the repository by running git add

5.Commit the changes using git commit -m 'commit message'
Replace 'Commit message' with a short message describing the changes you made.

6.Push the changes to github using _git push _command

Steps to updating your code in github

1.Make changes to your local code using your preffered editor eg.jupyter notebook

2.Add the changes running git add .(this is a period)

3.Commit the changes git commit -m 'commit message'

4.Push the changes to github using _git push _command

Confirm that the changes show on your github.

Steps on how to pull code from github

1.Open your git terminal and navigate to the directory where you want to clone the repository.

2.Clone the repository using git clone <repository-url
_
3.Once the repository is cloned,use the _git pull command to fetch the latest changes from the remote repository and merge them into your local copy.

After pulling the code and working on it ,push the changes with the steps described above.

Here is a link where you can get a git cheat sheet for easy navigation in git [https://education.github.com/git-cheat-sheet-education.pdf]

Conclusion
This article is biased towards git and github ,this is because they are the most commonly used systems.However one can use any of the systems mentioned in the article.I would encourage the readers to research more on git and github and the other data version control systems.

Introduction to Sentiment Analysis and Implementation

Emily — Tue, 21 Mar 2023 12:15:50 +0000

Sentiment analysis is basically a domain that trys to understand human emotions through a software .If the sentiments are in written form we can classify them as positive ,negative or neutral.
It is often called opinion mining because we are trying to figure out the opinion or attitude of the customer with respect to a particular product and extract valuable information from it.

Remember the last time you left a review for a product or a mobile app or when you made a textual comment on twitter or Instagram, the algorithms have most probably already reviewed your textual comment to get valuable information.

A customer plays a very big role in the market ,the customer can either make or break your business. Businesses/ companies make decisions based on the information extracted from textual data(given by the customer/consumer) For example, if person A has a company that produces product x, but the product is not selling well in the market. The data scientist in the company will analyze the reviews on the product so as to try and find out why it the product is not selling well so as to improve on it(to see the attitude of the customers towards the product)

The information extracted through sentiment analysis can be used to determine market strategy.

*Applications of sentiment analysis *
1.REVIEW CLASSIFICATION - to know the sentiment behind the many reviews from customers.(Classify the sentiments as positive, negative or neutral)

2.PRODUCT review mining -to know what features of the product customers loves and/or hates. so as to improve the product.

In this article we will go through sentiment analysis in python using machine learning.

Here is a link to a repository([https://github.com/Em-me/twitter-sentiment-analysis] in my GitHub of a project to explain sentiment analysis. You can download the data from here [https://www.kaggle.com/datasets/kazanova/sentiment140]
and follow the steps in the git hub repository.

*Side notes for the project and explanation of some of the steps
*

Checking for null values

Checking for null values is an important step in machine learning as missing data can affect the accuracy of your model's predictions. There are several ways to check for null values in machine learning, including:

Using the isnull() function: This function returns a Boolean value indicating whether each value in the dataset is null or not. You can then use the sum() function to count the number of null values in each column.

Using the info() function: This function provides information about the dataset, including the number of non-null values in each column. If the number of non-null values is less than the total number of rows in the dataset, then there are null values present.

Using visualization tools: Visualizing the dataset can often help identify null values. For example, you can use a heatmap to visualize the null values in the dataset.

Once you have identified the null values, you can choose to either remove the rows or columns with null values, or impute the null values with an appropriate value, such as the mean or median of the column. The choice will depend on the specifics of your dataset and the problem you are trying to solve.

The project is to asses the twitter sentiments so we have to drop the columns which are not associated with the sentiments(remain with the text column)
Data processing
Data processing is an essential step in sentiment analysis, which involves the analysis of the subjective information in text data.

Text cleaning: This step involves removing unnecessary elements from the text data such as special characters, punctuations, stop words, and numbers. Text cleaning also involves converting all the text to lowercase, removing any HTML tags and reducing words to their root forms(removing duplicates) by stemming.

Tokenization: Tokenization is the process of splitting the text into smaller chunks called tokens. Each token represents a single word or a group of words that convey a particular meaning.

Calculating polarity the text data
This involves determining the overall sentiment of a piece of text as positive, negative, or neutral.
Word cloud

A word cloud is a graphical representation of text data, where the size of each word is proportional to its frequency in the text. Word clouds are often used in sentiment analysis to visualize the most commonly used words in the text and to identify the overall sentiment of the text.
Bigram model
A bigram model is a type of language model that analyzes the frequency of occurrence of pairs of words (bigrams) in a piece of text. In sentiment analysis, bigram models can be used to identify common phrases or expressions that are associated with positive or negative sentiments.
Building the model
Splitting the data into training and testing subsets.
A typical train/test split would be to use 70% of the data for training and 30% of the data for testing.
Testing/evaluating the model

Metrics
In this session, I'll discuss common metrics used to evaluate models.

** Classification metrics**

When performing classification predictions, there's four types of outcomes that could occur.

True positives are when you predict an observation belongs to a class and it actually does belong to that class.

True negatives are when you predict an observation does not belong to a class and it actually does not belong to that class.

False positives occur when you predict an observation belongs to a class when in reality it does not.

False negatives occur when you predict an observation does not belong to a class when in fact it does.

These four outcomes are often plotted on a confusion matrix as shown in the project in the repository above.

  **conclusion **

In this article, we discussed using machine learning models to extract information from textual data. This knowledge may then be used to inform business choices, such as the direction of the company or even investment plans. Then, using sentiment analysis methods, we investigated the operation of these machine learning models and the information that might be obtained from such textual data.

Essential SQL Commands for Data Science

Emily — Mon, 13 Mar 2023 15:37:48 +0000

As a data analyst ,one uses loads of data in order to make informed decisions .Often, data stays in an SQL database ,follow the link below to get an introduction to SQL (https://dev.to/emme_42/introduction-to-sql-for-data-analysis-3fj7)
Since data is often stored in an SQL database, one ought to understand the SQL query commands. This article will take you through the essential SQL commands for data science.

**Data definition**

Used to create (define) data structures such as tables, indexes, clusters i.e.,
• CREATE databases, tables
• ALTER databases, tables
• DROP tables

Although most of the time the client gives you data in a database(already created ),it is essential to know how to create databases. Databases are created using the create(); query
for example in MYSQL,
create database database_name;

create table department(
list the columns and their data types );

Drop tables -used to delete a certain table if it is not being used / not needed for analysis.

Data Manipulation
The data manipulation language is used to access and update data; it is not important for representing the data. (Of course, the data manipulation language must be aware of how data is represented, and reﬂects this in the constructs that
it supports i.e.

• SELECT - extracts data from databases - to get all the content from a specific table in the database
_select *
from table_name; _

• UPDATE- updates data in a database-
update table_name
set column1=value1;

• DELETE- deletes data from tables
_delete from table name; _

• INSERT INTO - inserts data into tables
insert into table_name (
column1,colum 2
values(value 1,value two);

Alter table - used to add, delete, or modify columns in an existing table, also used to add and drop various constraints on an existing table.

ALTER TABLE table_name
ADD column_name datatype;

Sorting on some attribute/ Data retrieval with simple conditions**

WHERE This is used to retrieve specific entries that meet specific conditions. example in a dataset with employees ,we would want to know which employees earn more than 50,000 , use, select employee_salary from employee where employee_salary >=50,000;

2.ORDER BY
Used to sort records in order. Sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.
_select employee_salary
from employee
where employee_salary >=50,000
order by emoloyee_salary;

3.Limit
Used to give a limited no of entries.
select employee_salary
from employee
where employee_salary >=50,000
order by emoloyee_salary
limit 10;
-gives only first 10 entries

*AGGREGATIONS *
Used to get a summary of the dataset to get insights.

Group by The GROUP BY statement groups rows that have the same values into summary rows syntax select sum(column_name) from table_name where (condition) group by column_name;

2.Count
it returns the number of rows that matches a specified criterion.
select count(column_name)
from table_name
where condition;

JOINS
This command is used to combine data from two or more tables in a database.
Examples

Inner join
It returns only the rows where there is a match between columns in both tables.
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
Left join
It returns all the rows from the left table and matching rows from the right table. If there is no match in the right table ,the result will have null values.
SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name = table2.column_name;
Right join
it returns all records from the right table , and the matching records from the left table .If there is no match in the left table ,the result will have null values.

SELECT column_name(s)
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;

Outer join

Used to return all the rows from one or both tables.

SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name
WHERE condition;

AVG function
It returns the average value of a numeric column.
SELECT AVG(column_name)
FROM table_name
WHERE condition;
Having function
The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions.
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);
Sum function
It returns the total sum of a numeric column.
SELECT SUM(column_name)
FROM table_name
WHERE condition;

CHANGING DATA TYPES

Cast
it converts a value (of any type) into a specified datatype.
CAST(expression AS datatype(length))
Round
It rounds a number to a specified number of decimal places.
ROUND(number, decimals, operation)

WINDOW FUNCTIONS

A window function performs a calculation across a set of table rows that are somehow related to the current row.
Here are some examples ,

1.Row number()
This is a function assigns a unique sequential number to each row within a partition.
ROW_NUMBER() OVER (
[PARTITION BY expr1, expr2,...]
ORDER BY expr1 [ASC | DESC], expr2,...
)
The window functions are a bit complex and we urge you to do more research about these commands.
These commands are used in all data analysis processes hence if you want to perfect your analysis practice these commands using the open source databases.

THE ULTIMATE GUIDE FOR EXPLORATORY DATA ANALYSIS

Emily — Sat, 25 Feb 2023 20:28:17 +0000

Hi Data enthusiast !
-Exploratory data analysis (EDA) is the first basic step performed on data by a data analyst or data scientist .

**What is exploratory data analysis?**

This is basically a process used by data scientists to analyze and investigate data sets and summarize their main characteristics , often employing data visualizatons method.

It can help determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns and check assumptions.

        **Importance of EDA**

1.Identify patterns and relationships: EDA helps to identify patterns and relationships between different variables in the data. This can help to generate hypotheses and guide further analysis.
2.Detect outliers and errors: EDA can help to identify outliers and errors in the data, which can then be corrected or removed before further analysis.

Assess data quality: EDA can help to assess the quality of the data and determine if it is suitable for analysis. This includes checking for missing values, inconsistencies, and data formatting issues.
4.Understand the data distribution: EDA can help to understand the distribution of the data and its characteristics such as mean, median, and standard deviation. This can help to identify potential biases in the data.
5.Communicate insights: EDA can help to communicate insights and findings to others in a clear and concise manner. This can be especially important in interdisciplinary teams where people may have different levels of technical expertise.

** Types of EDA**
1.Univariate -Univariate analysis involves examining the distribution and characteristics of a single variable.
2.Bivariate – This analysis involves examining the relationship between two variables. .
Multivariate analysis - This analysis involves examining the relationship between two or more variables .

Techniques for EDA
The most common techniques used for EDA are:
1.Box plots
2.Histogram
3.Bar chart

Line graph
Stem and leaf plot 6.Pareto chart
Heat maps 8.Scatter plot

Exploratory Data Analysis can be done using several tools eg R and python.
In this guide we will focus on EDA in python.

Python is a popular programming language used for EDA due to its rich ecosystem of libraries and tools. Here are the basic steps for EDA in Python:

Importing Libraries: The first step is to import the necessary libraries such as pandas, numpy, matplotlib, seaborn, etc.

Loading Data: The next step is to load the data into a pandas dataframe.

Data Exploration: Once the data is loaded, you can start exploring the data by using various pandas functions like head(), tail(), describe(), info() etc.

Data Cleaning: This step involves identifying and handling missing values, removing duplicates, handling outliers, and converting data types if necessary.

Data Visualization: Data visualization is a powerful tool for EDA, and Python offers several libraries like matplotlib, seaborn, and plotly for creating visualizations. You can create different types of plots like scatter plots, histograms, bar plots, etc.

Correlation Analysis: Correlation analysis helps you identify relationships between variables. You can use pandas functions like corr() and heatmap from seaborn library for this purpose.

Feature Engineering: Feature engineering involves creating new features from the existing ones to improve the model's performance. You can use pandas functions like apply(), map() and lambda functions to create new features.

Conclusion: Finally, you can draw conclusions and insights from your analysis and share your findings with others.
Here is a little 'cheat sheet' to help you get started.

IMPORT pandas,numpy,matplotlib,seaborn and the data

.head()-first five observations

.tail()-last five obserations

.shape -no of rows and columns

.info()-columns and their corresponding data

.describe()-summary statistics

.quality.unique-insights from dependant variable

.corr()-find correlation

annot=true - correlation in grid-cells

boxplot-check minimun,quatiles,maximum
check linearity-distribution graph
pairplot

Introduction to SQL for data analysis

Emily — Sat, 18 Feb 2023 19:23:04 +0000

What is SQL?
SQL (Structured Query Language) is a programming language used for managing and analyzing relational databases .SQL stores data in a table format .First ,we need to know what databases are, and define relational databases .A database is an organized collection of raw data files stored in a hard drive .A relational database is a type of database that organizes data into one or more tables, with each table consisting of a set of rows and columns. These tables can be related to each other through the use of common columns or fields, which allows for efficient storage and retrieval of data.

SQL is widely used in data analysis as it allows users to extract and manipulate data from databases with ease. It is used in accessing, cleaning, and analyzing data that's stored in databases.

  **Advantages of SQL**

It does not need one to code.
It is flexible - can be used with other programming languages and on any device.
It uses simple commands in English for complex procedures.
It processes queries at a high speed.

**Disadvantages of SQL**

It has a complicated interface.
It is cost insufficient.
It is not very secure.

Commands used in SQL
Here are some key commands that you'll need to know to get started with SQL for data analysis:
1.Create command - used to create databases.
2.Select -used to extract data from the existing databases.
3.Insert- used to add tables or values into existing databases.
4.Update - used to change tables or values that are already existing in the database.
5.Drop - used to remove table definition and all the data from database tables.
6.Delete - used to delete existing records from a table.

There are many commands in SQL and it's sometimes difficult to memorize all of them but having an SQL cheat sheet is enough in most cases to get by and thrive when using the language for SQL data analysis.

Conclusion
The article gives an overview of SQL and the way it facilitates the analysis of data.