<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: justkmike</title>
    <description>The latest articles on DEV Community by justkmike (@justkmike).</description>
    <link>https://dev.to/justkmike</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1171289%2F23937460-a2c2-408b-affb-d5bd3ca9f39c.jpg</url>
      <title>DEV Community: justkmike</title>
      <link>https://dev.to/justkmike</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/justkmike"/>
    <language>en</language>
    <item>
      <title>Data Cleaning with Pandas</title>
      <dc:creator>justkmike</dc:creator>
      <pubDate>Fri, 20 Oct 2023 11:28:27 +0000</pubDate>
      <link>https://dev.to/justkmike/data-cleaning-with-pandas-4f6m</link>
      <guid>https://dev.to/justkmike/data-cleaning-with-pandas-4f6m</guid>
      <description>&lt;p&gt;In this guide, we'll explore various data-cleaning techniques using Python and the Pandas library. We'll also cover functions like &lt;code&gt;head()&lt;/code&gt;, &lt;code&gt;tail()&lt;/code&gt;, &lt;code&gt;info()&lt;/code&gt;, &lt;code&gt;describe()&lt;/code&gt;, &lt;code&gt;shape&lt;/code&gt;, and &lt;code&gt;size&lt;/code&gt;, and demonstrate how to remove empty cells, deal with wrong data formats, access data and remove duplicates.&lt;/p&gt;

&lt;h2&gt;
  
  
  DataFrame Basics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;head()&lt;/code&gt; and &lt;code&gt;tail()&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;These functions display the first and last &lt;code&gt;n&lt;/code&gt; rows of a DataFrame, respectively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Display the first 5 rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Display the last 5 rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;info()&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;info()&lt;/code&gt; provides essential information about the DataFrame, including column data types, non-null counts, and memory usage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;describe()&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;describe()&lt;/code&gt; offers statistical summaries of the DataFrame, such as mean, median, and quartiles.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;shape&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;shape&lt;/code&gt; returns the dimensions of the DataFrame as a tuple (number of rows, number of columns).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;size&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;size&lt;/code&gt; returns the total number of elements in the DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Cleaning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Removing Empty Cells
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;dropna()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;dropna()&lt;/code&gt; removes rows with empty cells, and it can create a new DataFrame. If you want to modify the existing DataFrame, use the &lt;code&gt;inplace=True&lt;/code&gt; parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a new DataFrame with empty cells removed
&lt;/span&gt;&lt;span class="n"&gt;new_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Modify the existing DataFrame in-place
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;fillna("Value to replace with")&lt;/code&gt; replaces empty cells with a specified value. It also supports additional parameters like &lt;code&gt;axis&lt;/code&gt;, &lt;code&gt;method&lt;/code&gt;, &lt;code&gt;limit&lt;/code&gt;, and &lt;code&gt;value&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Replace empty cells with a specific value
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Replacement Value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Handling Wrong Data Formats
&lt;/h3&gt;

&lt;p&gt;For example, to convert a column named "date" to datetime format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Removing Duplicates
&lt;/h3&gt;

&lt;p&gt;To identify and remove duplicate rows:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;duplicated()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;duplicated()&lt;/code&gt; returns a Boolean Series, indicating whether each row is a duplicate (True) or not (False).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;duplicate_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duplicated&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;drop_duplicates()&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;drop_duplicates()&lt;/code&gt; removes duplicate rows. Use the &lt;code&gt;inplace=True&lt;/code&gt; parameter to modify the existing DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Accessing Data in a DataFrame
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;at&lt;/code&gt; and &lt;code&gt;iat&lt;/code&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;at&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;at&lt;/code&gt; is used to get or set a specific element by row and column labels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get the value at row 2, column "name"
&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Assign a new value to the selected element
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Justkmike"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;iat&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;iat&lt;/code&gt; is used to access elements by row and column index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get the value at row 1, column 2
&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Update data at a specific index
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;[loc](https://www.statology.org/pandas-loc-vs-iloc/)&lt;/code&gt; and &lt;code&gt;iloc&lt;/code&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;loc&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;loc&lt;/code&gt; selects rows using index labels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Select a row with the index label "12-23-23"
&lt;/span&gt;&lt;span class="n"&gt;selected_row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"12-23-23"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;iloc&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;iloc&lt;/code&gt; selects rows using integer-based indexing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Select the first two rows and the first two columns
&lt;/span&gt;&lt;span class="n"&gt;selected_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remember that this is just the tip of the iceberg regarding Pandas. There are many more operations and functions available for data manipulation. If you encounter any issues or need further assistance, feel free to contact &lt;a href="//mwkariuki2e@gmail.com"&gt;mwkariuki2e@gmail.com&lt;/a&gt;. Stay tuned for our next guide on data visualization. See you! 😊&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>pandas</category>
      <category>python</category>
    </item>
    <item>
      <title>Python Pandas: Introduction to pandas (part 2)</title>
      <dc:creator>justkmike</dc:creator>
      <pubDate>Wed, 18 Oct 2023 10:32:08 +0000</pubDate>
      <link>https://dev.to/justkmike/python-pandas-introduction-to-pandas-part-2-1gj8</link>
      <guid>https://dev.to/justkmike/python-pandas-introduction-to-pandas-part-2-1gj8</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/justkmike/introduction-to-pandaspython-pandas-library-for-data-science-404g"&gt;previous article&lt;/a&gt;, we looked at what DataFrames and Series are in pandas. We also explored how to create a pandas DataFrame (referred to as &lt;code&gt;df&lt;/code&gt; here) from a list and a dictionary. I hope you've also researched the key differences between a DataFrame and a Series in pandas. In this article, we will build on that knowledge.&lt;/p&gt;

&lt;p&gt;To understand better, in data science, data can be sourced from various sources, and each source may store the data in different formats, such as comma-separated values (CSV), Excel sheets, SQL files, and more. It's up to you to know what tools to use. Now, let's dive into the available functions in pandas.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reading and Writing CSV Files
&lt;/h3&gt;

&lt;p&gt;Pandas provides &lt;code&gt;read_csv()&lt;/code&gt; and &lt;code&gt;to_csv()&lt;/code&gt; functions to work with CSV files.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reading CSV Files
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;read_csv()&lt;/code&gt; is used to read data from CSV files. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, &lt;code&gt;read_csv()&lt;/code&gt; assumes the first row in your file is the header row. If this is not the case, you can use the &lt;code&gt;header&lt;/code&gt; parameter to specify the row number to use as the header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that all rows above the specified header will be ignored when creating the DataFrame.&lt;/p&gt;

&lt;h5&gt;
  
  
  Reading CSV Files with Multiple Headers
&lt;/h5&gt;

&lt;p&gt;If your CSV file has multiple header rows, you can specify them using a list with the &lt;code&gt;header&lt;/code&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  Reading CSV Files Without Headers
&lt;/h5&gt;

&lt;p&gt;If your CSV file doesn't include headers, but you have the column names separately, you can use the &lt;code&gt;names&lt;/code&gt; parameter to provide a list of column names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"first_name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"last_name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"gender"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check the set headers using &lt;code&gt;csvdf.columns&lt;/code&gt;.&lt;/p&gt;

&lt;h5&gt;
  
  
  Adding a Prefix to Column Names
&lt;/h5&gt;

&lt;p&gt;If you have many columns and labeling them is cumbersome, you can add a prefix to the columns using the &lt;code&gt;prefix&lt;/code&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Col_"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Setting the Index Column
&lt;/h4&gt;

&lt;p&gt;By default, DataFrames are indexed from 0 to n-1. If you want to use a specific column as the index, you can do so with the &lt;code&gt;index_col&lt;/code&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'company'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Alternate approach using column index
&lt;/span&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it's a multi-index DataFrame, use a list to indicate the index, similar to how we did for the headers.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reading CSV Files with Defined Columns and Rows
&lt;/h4&gt;

&lt;p&gt;If you don't need all the columns and rows in the provided file, you can filter them using the &lt;code&gt;usecols&lt;/code&gt; and &lt;code&gt;nrows&lt;/code&gt; parameters. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usecols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'names'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'gender'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;nrows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means you'll get a DataFrame with only the 'names' and 'gender' columns and only 10 rows of the available rows.&lt;/p&gt;

&lt;h4&gt;
  
  
  Skipping Rows
&lt;/h4&gt;

&lt;p&gt;If you need to skip specific rows, like those with even row numbers, you can use the &lt;code&gt;skiprows&lt;/code&gt; functionality:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;csvdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file_path.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skiprows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This uses a lambda function to skip rows where the row number is even.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing to CSV
&lt;/h3&gt;

&lt;p&gt;To save a DataFrame to a CSV file, use the &lt;code&gt;to_csv()&lt;/code&gt; method. By default, it also includes the index in the file. You can disable this by setting the &lt;code&gt;index&lt;/code&gt; parameter to `False:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;python&lt;br&gt;
csvdf.to_csv("file_name.csv", index=False)&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Most of these functionalities ie read_json() and read_excel() are similar _(and I will not be handling that at least for now)_when writing to CSV as when reading from CSV. Note that this is not an exhaustive guide on &lt;code&gt;read_csv&lt;/code&gt;, so I encourage you to explore further and practice to better understand. You can also find free courses on platforms like &lt;a href="https://simpli-web.app.link/e/xtVZT3XFZDb"&gt;Simplilearn &lt;/a&gt;to boost your skills and resume.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>pandas</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Introduction to Pandas:Python Pandas library for data science(Part 1)</title>
      <dc:creator>justkmike</dc:creator>
      <pubDate>Fri, 06 Oct 2023 07:23:30 +0000</pubDate>
      <link>https://dev.to/justkmike/introduction-to-pandaspython-pandas-library-for-data-science-404g</link>
      <guid>https://dev.to/justkmike/introduction-to-pandaspython-pandas-library-for-data-science-404g</guid>
      <description>&lt;p&gt;&lt;strong&gt;What is Pandas?&lt;/strong&gt;&lt;br&gt;
Pandas is a Python library designed for data manipulation and analysis. It simplifies various data-related tasks, making them more efficient and accessible. Whether you're working with datasets, performing data cleaning, exploration, or statistical analysis, Pandas provides the tools to help you achieve your goals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Use Pandas?&lt;/strong&gt;&lt;br&gt;
Pandas offer numerous advantages for data scientists and analysts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data Analysis: Pandas simplifies data analysis by providing powerful data structures and functions.&lt;/li&gt;
&lt;li&gt;Data Cleaning: It offers tools for cleaning and preprocessing data, such as handling missing values and outliers.&lt;/li&gt;
&lt;li&gt;Data Manipulation: Pandas allows you to reshape and transform data, making it suitable for your specific analysis needs.&lt;/li&gt;
&lt;li&gt;Readability: It enhances data readability through structured data frames and series.&lt;/li&gt;
&lt;li&gt;Simplified Workflow: Pandas streamlines data-related tasks, saving time and effort in data projects. For more in-depth information, you can explore resources like W3Schools,geeksforgeeks for more info.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to Install Pandas:&lt;/strong&gt;&lt;br&gt;
You can easily install Pandas using the Python package manager, pip. Open your command prompt or terminal and run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Usage:&lt;/strong&gt;&lt;br&gt;
Once Pandas is installed, you can import it into your Python script or notebook using the alias 'pd':&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check the installed Pandas version with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pandas Series:&lt;/strong&gt;&lt;br&gt;
A Pandas Series is a one-dimensional data structure representing a single column in a data frame. It is homogenous, meaning it contains elements of the same data type, and each element has a label (index). Here's an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;my_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;my_series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_series&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pandas Labels:&lt;/strong&gt;&lt;br&gt;
By default, Pandas assigns labels indexed from 0 to n-1, where n is the length of the series. However, you can customize the index as you prefer. Here's an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;custom_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'c'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'d'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'e'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;my_new_series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;custom_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_new_series&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can access a series item using its label, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_new_series&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key-Value Objects in Pandas Series:&lt;/strong&gt;&lt;br&gt;
If you have a dictionary with key-value pairs, you can transform it into a Pandas Series. The keys will become the labels for the series.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DataFrames:&lt;/strong&gt;&lt;br&gt;
Pandas DataFrames are multidimensional tables with rows and columns. They can be thought of as collections of Pandas Series, and they are commonly used for structured data. Here's an example of creating a DataFrame from a dictionary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;my_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Mike"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"John"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;new_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can access specific rows using &lt;code&gt;.loc[]&lt;/code&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;new_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"row_index"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To access multiple rows, you can pass a list of indices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;new_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also specify named indexes when creating the DataFrame by providing a list of indexes to the &lt;code&gt;index&lt;/code&gt; argument.&lt;/p&gt;

&lt;p&gt;When you need to load data from sources like CSV files, Excel files, or JSON files into a DataFrame, Pandas provides built-in functions like &lt;code&gt;pd.read_csv()&lt;/code&gt;, &lt;code&gt;pd.read_excel()&lt;/code&gt;, and &lt;code&gt;pd.read_json()&lt;/code&gt; to simplify the process.&lt;/p&gt;

&lt;p&gt;I will be showing you how to use the functions in the next article feel free to go ahead and do some research on your own, see you on the next one.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>ai</category>
      <category>python</category>
      <category>pandas</category>
    </item>
    <item>
      <title>Data Science! Data Science</title>
      <dc:creator>justkmike</dc:creator>
      <pubDate>Sat, 30 Sep 2023 09:41:22 +0000</pubDate>
      <link>https://dev.to/justkmike/data-science-data-science-2bck</link>
      <guid>https://dev.to/justkmike/data-science-data-science-2bck</guid>
      <description>&lt;p&gt;Before we dive into data science, What is data?&lt;br&gt;
&lt;strong&gt;Data&lt;/strong&gt; is a collection of raw facts and figures collected for a specific task e.g. census data, number of cars that use a road per hour, daily weather updates, etc. Once this data is collected it is not useful to anyone until some analysis is done on it and this brings us to data science.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data science Definition&lt;/strong&gt;&lt;br&gt;
Data science is the study of data to extract meaningful insights for business and solve real-life problems. Easy right :).&lt;br&gt;
Data science is a multidisciplinary field that is it combines different disciplines including mathematics, statistics, probability, programming, and more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Components of Data Science&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data collection&lt;/li&gt;
&lt;li&gt;Data cleaning &lt;/li&gt;
&lt;li&gt;Data Exploration&lt;/li&gt;
&lt;li&gt;Modeling&lt;/li&gt;
&lt;li&gt;Interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Collection&lt;/strong&gt;&lt;br&gt;
Data collection is the first step in data analysis. This involves gathering data from existing databases or collecting it directly from locals. eg if I need to do an analysis of how a school is performing I can get the data from KNEC or visit various schools, get the data from them, and do my "thing". &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data cleaning&lt;/strong&gt;&lt;br&gt;
You have the data now, many a time when this data is messy, has errors, has outliers, and has duplicates basically means it cannot be used in raw form. Data cleaning involves removing all the irregularities that may interfere with the correct analysis and insights. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Exploration&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;It's time to work with the data now that you've cleaned. Data exploration is summarising and visualizing data in order to better comprehend its properties. Finding patterns and relationships in the data is made easier by methods like data visualization, descriptive statistics, and exploratory data analysis (EDA).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modeling&lt;/strong&gt; &lt;br&gt;
It entails creating mathematical and statistical models to predict the future or unearth undiscovered information. Regression analysis, clustering, neural networks, and machine learning algorithms are examples of common modeling techniques.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interpretation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data scientists interpret the outcomes of model training and prediction to obtain actionable insights. To fully understand the impact of the findings and make informed choices, this phase necessitates domain expertise e.g. transport, Medicine, climate, etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Science Applications&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Health Care &lt;br&gt;
Transport &lt;br&gt;
Sports etc&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Data science is the process of extracting meaning full insights that will help businesses make informed decisions as well us solve reallife problems.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>luxacademy</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
