<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Emily</title>
    <description>The latest articles on DEV Community by Emily (@emilyngahu).</description>
    <link>https://dev.to/emilyngahu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1026151%2F81bbb622-2d2c-47fb-82e0-cf085ec7566e.png</url>
      <title>DEV Community: Emily</title>
      <link>https://dev.to/emilyngahu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/emilyngahu"/>
    <language>en</language>
    <item>
      <title>Introduction to Data Version Control</title>
      <dc:creator>Emily</dc:creator>
      <pubDate>Mon, 27 Mar 2023 15:35:24 +0000</pubDate>
      <link>https://dev.to/emilyngahu/introduction-to-data-version-control-2jjk</link>
      <guid>https://dev.to/emilyngahu/introduction-to-data-version-control-2jjk</guid>
      <description>&lt;p&gt;In Order to understand Data version control, let's first get a general idea of what version control is. Imagine a company that has employees working remotely all over the continent. These employees will at some point require to work together in some project.The company faces a challenge to collaborate ,for the many workers located in different parts of the continent but are working on the same project. &lt;/p&gt;

&lt;p&gt;Another issue is the versions needed to complete a project, since a project is not completed in a single version,how will the employees update or see the updated versions(or where exactly has the changes been made) of the project.The version control system takes care of the collaboration between employees storing different versions.&lt;/p&gt;

&lt;p&gt;Version control is the practice of tracking and managing changes to software code. Version control systems are software tools that help software teams manage changes to source code over time. Developers may review, compare, and undo changes made to a file over time using Version Control System, which keeps track of all file modifications.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Examples of version control systems in the market;&lt;br&gt;
*&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Github -it is the most commonly and widely used.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2.Gitlab&lt;/p&gt;

&lt;p&gt;3.Perforce&lt;/p&gt;

&lt;p&gt;4.Beanstalk&lt;/p&gt;

&lt;p&gt;5.AWS code commit &lt;/p&gt;

&lt;p&gt;6.Apacha Subversion &lt;/p&gt;

&lt;p&gt;7.Mercuril e.t.c &lt;/p&gt;

&lt;p&gt;Now that we have an idea of what version control is ,let's narrow down to Data Version Control.&lt;/p&gt;

&lt;p&gt;**&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Data Version Control?
&lt;/h2&gt;

&lt;p&gt;**&lt;br&gt;
Similar to how version control systems manage changes to code files, &lt;em&gt;data version control is a system for managing changes to data files.&lt;/em&gt; Data scientists and machine learning engineers can work together on data projects, manage changes to data files, and replicate data-driven experiments using the data version tool.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Advantages of Data Version Control
 **
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;1.Data Version Control allows you to track changes to your data files over time, and keep a record of the exact data files used in each version of the project.&lt;/p&gt;

&lt;p&gt;2.Data Version Control allows multiple data scientists and machine learning engineers to work on the same project, share data files, and collaborate on experiments. DVC also provides tools for resolving conflicts when multiple people make changes to the same data file.&lt;/p&gt;

&lt;p&gt;3.Data Version Control provides a scalable way to manage large data sets, by allowing you to store data files in cloud storage systems. This makes it easier to work with large data sets without running into storage limitations on your local machine.&lt;/p&gt;

&lt;p&gt;4.Data Version Control allows you to reuse data files across multiple versions of the project, which can save time and reduce the amount of data processing required.&lt;/p&gt;

&lt;p&gt;Git and github is the most widely used data version control system ,which allows data scientists work on the same project and manage their changes through branches,commits,and merges.&lt;br&gt;
   *&lt;em&gt;Reasons why github is widely used /commonly used over the other Version Control Systems&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
1.Github is open-source-it supports open source projects.&lt;/p&gt;

&lt;p&gt;2.Github has a large community of developers who share their code and contribute to open-soucrce projects.&lt;/p&gt;

&lt;p&gt;3.Github hosts yor code.&lt;/p&gt;

&lt;p&gt;4.GitHub makes it easy to collaborate with others on projects. You can easily share your code with other developers, and they can make contributions or suggest changes using pull requests.&lt;/p&gt;

&lt;p&gt;5.GitHub integrates with many other tools, such as CI/CD pipelines, code analysis tools, and project management tools.&lt;/p&gt;

&lt;p&gt;In this article ,I will give an introduction on how to use github and git when working on a data science project.&lt;/p&gt;

&lt;p&gt;First you must have downloaded and configured git (using git config) You must also have created a github account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Steps to follow when pushing code to github
&lt;/h2&gt;

&lt;p&gt;1.In github create  new repository(click 'new' on the repositories page)and name it according to the project you are working on.&lt;br&gt;
In creating a  repository, you should add a small description of your project in the description box and a long/detailed description in the README file that should be attached to the repository.&lt;/p&gt;

&lt;p&gt;A repository is either public or  private. A public repository is accessible to anyone on the internet while a private repository is only accessible to you,people you explicitly share access with.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You need to clone(using &lt;em&gt;git clone and a link to the repository&lt;/em&gt;) your repository in your local machine.Open your git bash window and navigate to the directory where you want to store your directory. Use cd to change directory and ls to list all the items in the directory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;3.Add your code to the repository by creating new files or modifying existing ones in the local copy of the repository.&lt;/p&gt;

&lt;p&gt;4.Add the files you want to push to the repository by running git add&lt;/p&gt;

&lt;p&gt;5.Commit the changes using &lt;strong&gt;git commit -m 'commit message'&lt;/strong&gt;&lt;br&gt;
Replace 'Commit message' with a short message describing the changes you made.&lt;/p&gt;

&lt;p&gt;6.Push the changes to github using _git push _command &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Steps to updating your code in github&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;1.Make changes to your local code using your preffered editor eg.jupyter notebook &lt;/p&gt;

&lt;p&gt;2.Add the changes running &lt;em&gt;git add .&lt;/em&gt;(this is a period)&lt;/p&gt;

&lt;p&gt;3.Commit the changes &lt;strong&gt;git commit -m 'commit message'&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;4.Push the changes to github using _git push _command &lt;/p&gt;

&lt;p&gt;Confirm that the changes show on your github.&lt;/p&gt;

&lt;h2&gt;
  
  
  *&lt;em&gt;Steps on how to pull code from github *&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;1.Open your git terminal and navigate to the directory where you want to clone the repository.&lt;/p&gt;

&lt;p&gt;2.Clone the repository using &lt;em&gt;git clone &amp;lt;repository-url&lt;br&gt;
_&lt;br&gt;
3.Once the repository is cloned,use the _git pull&lt;/em&gt; command to fetch the latest changes from the remote repository and merge them into your local copy.&lt;/p&gt;

&lt;p&gt;After pulling the code and working on it ,push the changes with the steps described above.&lt;/p&gt;

&lt;p&gt;Here is a link where you can get a git cheat sheet for easy navigation in git [&lt;a href="https://education.github.com/git-cheat-sheet-education.pdf"&gt;https://education.github.com/git-cheat-sheet-education.pdf&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;Conclusion &lt;br&gt;
This article is biased towards git and github ,this is because they are the most commonly used systems.However one can use any of the systems mentioned in the article.I would encourage the readers to research more on git and github and the other data version control systems.&lt;/p&gt;

</description>
      <category>data</category>
      <category>versioncontrol</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Introduction to Sentiment Analysis and Implementation</title>
      <dc:creator>Emily</dc:creator>
      <pubDate>Tue, 21 Mar 2023 12:15:50 +0000</pubDate>
      <link>https://dev.to/emilyngahu/introduction-to-sentiment-analysis-and-implementation-18kp</link>
      <guid>https://dev.to/emilyngahu/introduction-to-sentiment-analysis-and-implementation-18kp</guid>
      <description>&lt;p&gt;Sentiment analysis is basically a domain that  trys to understand human emotions through a software .If the sentiments are in written form we can classify them as positive ,negative or neutral.&lt;br&gt;
It is often called opinion mining because we are trying to figure out the opinion or attitude of the customer with respect to a particular product and extract valuable information from it. &lt;/p&gt;

&lt;p&gt;Remember the last time you left a review for a product or a mobile app or when you made a textual comment on twitter or Instagram, the algorithms have most probably already reviewed your textual comment to get valuable information.&lt;/p&gt;

&lt;p&gt;A customer plays a very big role in the market ,the customer can either make or break your business. Businesses/ companies  make decisions based on the information extracted from  textual data(given by the customer/consumer) For example, if person A has a company that produces product x, but the product is not selling well in the market. The data scientist in the company  will analyze the reviews on the product so as to try and  find out why it the product is not selling well so as to improve on it(to see the attitude of the customers towards the product)&lt;/p&gt;

&lt;p&gt;The information extracted through sentiment analysis can be used to determine market strategy.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Applications of sentiment analysis *&lt;/em&gt;&lt;br&gt;
1.REVIEW CLASSIFICATION - to know the sentiment behind the many reviews from customers.(Classify the sentiments as positive, negative or neutral)&lt;/p&gt;

&lt;p&gt;2.PRODUCT review mining -to know what features of the product  customers loves and/or hates. so as to improve the product.&lt;/p&gt;

&lt;p&gt;In this article we will go through sentiment analysis in python using machine learning.&lt;/p&gt;

&lt;p&gt;Here is a link  to a repository([&lt;a href="https://github.com/Em-me/twitter-sentiment-analysis"&gt;https://github.com/Em-me/twitter-sentiment-analysis&lt;/a&gt;] in my GitHub of a project to explain sentiment analysis. You can download the data from  here [&lt;a href="https://www.kaggle.com/datasets/kazanova/sentiment140"&gt;https://www.kaggle.com/datasets/kazanova/sentiment140&lt;/a&gt;]&lt;br&gt;
and follow the steps in the git hub repository.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Side notes for the project and explanation of some of the steps &lt;br&gt;
   *&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checking for  null values &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Checking for null values is an important step in machine learning as missing data can affect the accuracy of your model's predictions. There are several ways to check for null values in machine learning, including:&lt;/p&gt;

&lt;p&gt;Using the &lt;em&gt;isnull()&lt;/em&gt; function: This function returns a Boolean value indicating whether each value in the dataset is null or not. You can then use the sum() function to count the number of null values in each column.&lt;/p&gt;

&lt;p&gt;Using the info() function: This function provides information about the dataset, including the number of non-null values in each column. If the number of non-null values is less than the total number of rows in the dataset, then there are null values present.&lt;/p&gt;

&lt;p&gt;Using visualization tools: Visualizing the dataset can often help identify null values. For example, you can use a heatmap to visualize the null values in the dataset.&lt;/p&gt;

&lt;p&gt;Once you have identified the null values, you can choose to either remove the rows or columns with null values, or impute the null values with an appropriate value, such as the mean or median of the column. The choice will depend on the specifics of your dataset and the problem you are trying to solve.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The project is to asses the twitter sentiments so we have to drop the columns which are not associated with the sentiments(remain with the text column)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data processing &lt;br&gt;
Data processing is an essential step in sentiment analysis, which involves the analysis of the subjective information in text data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Text cleaning: This step involves removing unnecessary elements from the text data such as special characters, punctuations, stop words, and numbers. Text cleaning also involves converting all the text to lowercase, removing any HTML tags and reducing words to their root forms(removing duplicates) by stemming.&lt;/p&gt;

&lt;p&gt;Tokenization: Tokenization is the process of splitting the text into smaller chunks called tokens. Each token represents a single word or a group of words that convey a particular meaning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Calculating polarity the text data &lt;br&gt;
This  involves determining the overall sentiment of a piece of text as positive, negative, or neutral.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Word cloud&lt;br&gt;&lt;br&gt;
A word cloud is a graphical representation of text data, where the size of each word is proportional to its frequency in the text. Word clouds are often used in sentiment analysis to visualize the most commonly used words in the text and to identify the overall sentiment of the text.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bigram model &lt;br&gt;
A bigram model is a type of language model that analyzes the frequency of occurrence of pairs of words (bigrams) in a piece of text. In sentiment analysis, bigram models can be used to identify common phrases or expressions that are associated with positive or negative sentiments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Building the model &lt;br&gt;
Splitting the data into training and testing subsets.&lt;br&gt;
A typical train/test split would be to use 70% of the data for training and 30% of the data for testing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Testing/evaluating the model &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;br&gt;
In this session, I'll discuss common metrics used to evaluate models.&lt;/p&gt;

&lt;p&gt;** Classification metrics**&lt;/p&gt;

&lt;p&gt;When performing classification predictions, there's four types of outcomes that could occur.&lt;/p&gt;

&lt;p&gt;True positives are when you predict an observation belongs to a class and it actually does belong to that class.&lt;/p&gt;

&lt;p&gt;True negatives are when you predict an observation does not belong to a class and it actually does not belong to that class.&lt;/p&gt;

&lt;p&gt;False positives occur when you predict an observation belongs to a class when in reality it does not.&lt;/p&gt;

&lt;p&gt;False negatives occur when you predict an observation does not belong to a class when in fact it does.&lt;/p&gt;

&lt;p&gt;These four outcomes are often plotted on a confusion matrix as shown in the project in the repository above.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  **conclusion **
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In this article, we discussed using machine learning models to extract information from textual data. This knowledge may then be used to inform business choices, such as the direction of the company or even investment plans. Then, using sentiment analysis methods, we investigated the operation of these machine learning models and the information that might be obtained from such textual data.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Essential SQL Commands for Data Science</title>
      <dc:creator>Emily</dc:creator>
      <pubDate>Mon, 13 Mar 2023 15:37:48 +0000</pubDate>
      <link>https://dev.to/emilyngahu/essential-sql-commands-for-data-science-2111</link>
      <guid>https://dev.to/emilyngahu/essential-sql-commands-for-data-science-2111</guid>
      <description>&lt;p&gt;As a data analyst ,one uses loads of data in order to make informed decisions .Often, data stays in an SQL database ,follow the link below to get an introduction to SQL  (&lt;a href="https://dev.to/emme_42/introduction-to-sql-for-data-analysis-3fj7"&gt;https://dev.to/emme_42/introduction-to-sql-for-data-analysis-3fj7&lt;/a&gt;)&lt;br&gt;
Since data is often stored in an SQL database, one ought to understand the SQL query commands. This article will take you through the essential SQL commands for data science. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Data definition** 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Used to create (define) data structures such as tables, indexes, clusters i.e.,&lt;br&gt;
• CREATE databases, tables&lt;br&gt;
• ALTER databases, tables&lt;br&gt;
• DROP tables&lt;/p&gt;

&lt;p&gt;Although most of the time the client gives you data in a database(already created ),it is essential to know how to create databases. Databases are created using the create(); query &lt;br&gt;
for example in MYSQL,&lt;br&gt;
create database database_name;&lt;/p&gt;

&lt;p&gt;create table department(&lt;br&gt;
    list the columns and their data types );&lt;/p&gt;

&lt;p&gt;Drop tables -used to delete a certain table if it is not being used / not needed for analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Manipulation&lt;/strong&gt; &lt;br&gt;
The data manipulation language is used to access and update data; it is not important for representing the data. (Of course, the data manipulation language must be aware of how data is represented, and reﬂects this in the constructs that&lt;br&gt;
it supports i.e.&lt;/p&gt;

&lt;p&gt;• SELECT - extracts data from databases - to get all the content from a specific table in the database &lt;br&gt;
  _select * &lt;br&gt;
  from table_name; _&lt;/p&gt;

&lt;p&gt;• UPDATE- updates data in a database- &lt;br&gt;
      &lt;em&gt;update table_name &lt;br&gt;
      set column1=value1;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;• DELETE- deletes data from tables&lt;br&gt;
     _delete from table name; _&lt;/p&gt;

&lt;p&gt;• INSERT INTO - inserts data into tables&lt;br&gt;
  &lt;em&gt;insert into table_name (&lt;br&gt;
   column1,colum 2&lt;br&gt;
    values(value 1,value two);&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Alter table -  used to add, delete, or modify columns in an existing table, also used to add and drop various constraints on an existing table.&lt;/p&gt;

&lt;p&gt;ALTER TABLE table_name&lt;br&gt;
   ADD column_name datatype;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sorting on some attribute&lt;/strong&gt;/ Data retrieval with simple conditions** &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;WHERE 
This is used to retrieve specific entries that meet specific conditions.
example in a dataset with employees ,we would want to know which employees earn more than 50,000 ,
use,
&lt;em&gt;select employee_salary 
from employee 
where employee_salary &amp;gt;=50,000;&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2.ORDER BY &lt;br&gt;
Used to sort records in order. Sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.&lt;br&gt;
  _select employee_salary &lt;br&gt;
   from employee &lt;br&gt;
   where employee_salary &amp;gt;=50,000&lt;br&gt;
   order by emoloyee_salary;&lt;/p&gt;

&lt;p&gt;3.Limit &lt;br&gt;
 Used to give a limited no of entries.&lt;br&gt;
   select employee_salary &lt;br&gt;
   from employee &lt;br&gt;
   where employee_salary &amp;gt;=50,000&lt;br&gt;
   order by emoloyee_salary&lt;br&gt;
   limit 10;&lt;br&gt;
-gives only first 10 entries &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;AGGREGATIONS *&lt;/em&gt;&lt;br&gt;
Used to get a summary of the dataset to get insights.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Group by 
The GROUP BY statement groups rows that have the same values into summary rows
syntax 
&lt;em&gt;select sum(column_name)
from table_name
where (condition)
group by column_name;&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2.Count &lt;br&gt;
 it returns the number of rows that matches a specified criterion.&lt;br&gt;
   &lt;em&gt;select count(column_name)&lt;br&gt;
   from table_name&lt;br&gt;
   where condition;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;JOINS &lt;br&gt;
This command is used to combine data from two or more tables in a database.&lt;br&gt;
 Examples &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Inner join &lt;br&gt;
It returns only the rows where there is a match between columns in both tables.&lt;br&gt;
SELECT column_name(s)&lt;br&gt;
FROM table1&lt;br&gt;
INNER JOIN table2&lt;br&gt;
ON table1.column_name = table2.column_name;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Left join &lt;br&gt;
It returns all the rows from the left table and matching rows from the right table. If there is no match in the right table ,the result will have null values.&lt;br&gt;
SELECT column_name(s)&lt;br&gt;
FROM table1&lt;br&gt;
LEFT JOIN table2&lt;br&gt;
ON table1.column_name = table2.column_name;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Right join &lt;br&gt;
it returns all records from the right table , and the matching records from the left table .If there is no match in the left table ,the result will have null values.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;SELECT column_name(s)&lt;br&gt;
   FROM table1&lt;br&gt;
   RIGHT JOIN table2&lt;br&gt;
   ON table1.column_name = table2.column_name;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Outer join
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Used to return all the rows from one or both tables.&lt;/p&gt;

&lt;p&gt;SELECT column_name(s)&lt;br&gt;
   FROM table1&lt;br&gt;
   FULL OUTER JOIN table2&lt;br&gt;
   ON table1.column_name = table2.column_name&lt;br&gt;
   WHERE condition;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;AVG function &lt;br&gt;
It returns the average value of a numeric column.&lt;br&gt;
SELECT AVG(column_name)&lt;br&gt;
FROM table_name&lt;br&gt;
WHERE condition;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Having function &lt;br&gt;
The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions.&lt;br&gt;
SELECT column_name(s)&lt;br&gt;
FROM table_name&lt;br&gt;
WHERE condition&lt;br&gt;
GROUP BY column_name(s)&lt;br&gt;
HAVING condition&lt;br&gt;
ORDER BY column_name(s);&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Sum function&lt;br&gt;
It returns the total sum of a numeric column. &lt;br&gt;
SELECT SUM(column_name)&lt;br&gt;
FROM table_name&lt;br&gt;
WHERE condition;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  CHANGING DATA TYPES
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Cast&lt;br&gt;
it converts a value (of any type) into a specified datatype.&lt;br&gt;
CAST(expression AS datatype(length))&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Round &lt;br&gt;
It rounds a number to a specified number of decimal places.&lt;br&gt;
ROUND(number, decimals, operation)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  WINDOW FUNCTIONS
&lt;/h2&gt;

&lt;p&gt;A window function performs a calculation across a set of table rows that are somehow related to the current row.&lt;br&gt;
Here are some examples ,&lt;/p&gt;

&lt;p&gt;1.Row number() &lt;br&gt;
This is a function assigns a unique sequential number to each row within a partition.&lt;br&gt;
    ROW_NUMBER() OVER (&lt;br&gt;
    [PARTITION BY expr1, expr2,...]&lt;br&gt;
    ORDER BY expr1 [ASC | DESC], expr2,...&lt;br&gt;
    )&lt;br&gt;
The window functions are a bit complex and we urge you to do more research about these commands. &lt;br&gt;
These commands are used in all data analysis processes hence if you want to perfect your analysis practice these commands using the open source databases.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>sql</category>
    </item>
    <item>
      <title>THE ULTIMATE GUIDE FOR EXPLORATORY DATA ANALYSIS</title>
      <dc:creator>Emily</dc:creator>
      <pubDate>Sat, 25 Feb 2023 20:28:17 +0000</pubDate>
      <link>https://dev.to/emilyngahu/the-ultimate-guide-for-exploratory-data-analysis-3fnn</link>
      <guid>https://dev.to/emilyngahu/the-ultimate-guide-for-exploratory-data-analysis-3fnn</guid>
      <description>&lt;p&gt;Hi  Data enthusiast ! &lt;br&gt;
-Exploratory data analysis (EDA) is the first basic step performed on data by  a data analyst or data scientist .&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**What is exploratory data analysis?**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is basically a process used by data scientists to analyze and investigate data sets and summarize their main characteristics , often employing data visualizatons method.&lt;/p&gt;

&lt;p&gt;It can help determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns and check assumptions.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        **Importance of EDA**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;1.Identify patterns and relationships: EDA helps to identify patterns and relationships between different variables in the data. This can help to generate hypotheses and guide further analysis.&lt;br&gt;
2.Detect outliers and errors: EDA can help to identify outliers and errors in the data, which can then be corrected or removed before further analysis.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Assess data quality: EDA can help to assess the quality of the data and determine if it is suitable for analysis. This includes checking for missing values, inconsistencies, and data formatting issues.&lt;br&gt;
4.Understand the data distribution: EDA can help to understand the distribution of the data and its characteristics such as mean, median, and standard deviation. This can help to identify potential biases in the data.&lt;br&gt;
5.Communicate insights: EDA can help to communicate insights and findings to others in a clear and concise manner. This can be especially important in interdisciplinary teams where people may have different levels of technical expertise.&lt;/p&gt;

&lt;p&gt;**  Types of EDA**&lt;br&gt;
1.Univariate -Univariate analysis involves examining the distribution and characteristics of a single variable.&lt;br&gt;
2.Bivariate – This analysis involves examining the relationship between two variables. .&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multivariate analysis -  This analysis involves examining the relationship between two or more variables .&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Techniques for EDA&lt;/strong&gt;&lt;br&gt;
The most common techniques used for EDA are:&lt;br&gt;
1.Box plots &lt;br&gt;
2.Histogram &lt;br&gt;
3.Bar chart &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Line graph &lt;/li&gt;
&lt;li&gt;Stem and leaf plot 
6.Pareto chart &lt;/li&gt;
&lt;li&gt;Heat maps 
8.Scatter plot &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Exploratory Data Analysis can be done using several tools eg R and python.&lt;br&gt;
In this guide we will focus on EDA in python.&lt;/p&gt;

&lt;p&gt;Python is a popular programming language used for EDA due to its rich ecosystem of libraries and tools. Here are the basic steps for EDA in Python:&lt;/p&gt;

&lt;p&gt;Importing Libraries: The first step is to import the necessary libraries such as pandas, numpy, matplotlib, seaborn, etc.&lt;/p&gt;

&lt;p&gt;Loading Data: The next step is to load the data into a pandas dataframe.&lt;/p&gt;

&lt;p&gt;Data Exploration: Once the data is loaded, you can start exploring the data by using various pandas functions like head(), tail(), describe(), info() etc.&lt;/p&gt;

&lt;p&gt;Data Cleaning: This step involves identifying and handling missing values, removing duplicates, handling outliers, and converting data types if necessary.&lt;/p&gt;

&lt;p&gt;Data Visualization: Data visualization is a powerful tool for EDA, and Python offers several libraries like matplotlib, seaborn, and plotly for creating visualizations. You can create different types of plots like scatter plots, histograms, bar plots, etc.&lt;/p&gt;

&lt;p&gt;Correlation Analysis: Correlation analysis helps you identify relationships between variables. You can use pandas functions like corr() and heatmap from seaborn library for this purpose.&lt;/p&gt;

&lt;p&gt;Feature Engineering: Feature engineering involves creating new features from the existing ones to improve the model's performance. You can use pandas functions like apply(), map() and lambda functions to create new features.&lt;/p&gt;

&lt;p&gt;Conclusion: Finally, you can draw conclusions and insights from your analysis and share your findings with others.&lt;br&gt;
Here is a little 'cheat sheet' to help you get started.&lt;/p&gt;

&lt;h1&gt;
  
  
  IMPORT pandas,numpy,matplotlib,seaborn and the data
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .head()-first five observations
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .tail()-last five obserations
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .shape -no of rows and columns
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .info()-columns and their corresponding data
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .describe()-summary statistics
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .quality.unique-insights from dependant variable
&lt;/h1&gt;

&lt;h1&gt;
  
  
  .corr()-find correlation
&lt;/h1&gt;

&lt;h1&gt;
  
  
  annot=true - correlation in grid-cells
&lt;/h1&gt;

&lt;p&gt;boxplot-check minimun,quatiles,maximum&lt;br&gt;
check linearity-distribution graph &lt;br&gt;
pairplot&lt;/p&gt;

</description>
      <category>watercooler</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Introduction to SQL for data analysis</title>
      <dc:creator>Emily</dc:creator>
      <pubDate>Sat, 18 Feb 2023 19:23:04 +0000</pubDate>
      <link>https://dev.to/emilyngahu/introduction-to-sql-for-data-analysis-3fj7</link>
      <guid>https://dev.to/emilyngahu/introduction-to-sql-for-data-analysis-3fj7</guid>
      <description>&lt;p&gt;&lt;strong&gt;What is SQL?&lt;/strong&gt;&lt;br&gt;
SQL (Structured Query Language) is a programming language used for managing and analyzing relational databases .SQL stores data in a table format .First ,we need to know what databases are, and define relational databases .A database is an organized collection of raw data files stored in a hard drive .A relational database is a type of database that organizes data into one or more tables, with each table consisting of a set of rows and columns. These tables can be related to each other through the use of common columns or fields, which allows for efficient storage and retrieval of data.&lt;/p&gt;

&lt;p&gt;SQL is  widely used in data analysis as it allows users to extract and manipulate data from databases with ease. It is used in accessing, cleaning, and analyzing data that's stored in databases.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  **Advantages of SQL**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;It does not need one to code. &lt;br&gt;
It is flexible - can be used with other programming languages and on any device. &lt;br&gt;
It uses simple commands in English for complex procedures.&lt;br&gt;
It processes queries at a high speed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Disadvantages of SQL**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;It has a complicated interface.&lt;br&gt;
It is cost insufficient.&lt;br&gt;
It is not very secure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Commands used in SQL&lt;/strong&gt;&lt;br&gt;
Here are some key commands that you'll need to know to get started with SQL for data analysis:&lt;br&gt;
1.Create command - used to create databases. &lt;br&gt;
2.Select -used to extract data from the existing databases.&lt;br&gt;
3.Insert- used to add tables or values into existing databases. &lt;br&gt;
4.Update - used to change tables or values that are already existing in the database. &lt;br&gt;
5.Drop - used to remove table definition and all the data from database tables.&lt;br&gt;
6.Delete - used to delete existing records from a table.&lt;/p&gt;

&lt;p&gt;There are many commands in SQL and it's sometimes difficult to memorize all of them but having an SQL cheat sheet is enough in most cases to get by and thrive when using the language for SQL data analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
The article gives an overview of SQL and the way it facilitates the analysis of data.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>psychology</category>
      <category>healthydebate</category>
    </item>
  </channel>
</rss>
