<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gabriela Trindade</title>
    <description>The latest articles on DEV Community by Gabriela Trindade (@gtrindadi).</description>
    <link>https://dev.to/gtrindadi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F249072%2F519624fb-3fbc-4230-851b-e529d70a03a5.jpg</url>
      <title>DEV Community: Gabriela Trindade</title>
      <link>https://dev.to/gtrindadi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gtrindadi"/>
    <language>en</language>
    <item>
      <title>Creating an Exploratory Data Analysis Report with Pandas-Profiling</title>
      <dc:creator>Gabriela Trindade</dc:creator>
      <pubDate>Mon, 11 Jan 2021 17:42:16 +0000</pubDate>
      <link>https://dev.to/gtrindadi/creating-an-exploratory-data-analysis-report-with-pandas-profiling-n4f</link>
      <guid>https://dev.to/gtrindadi/creating-an-exploratory-data-analysis-report-with-pandas-profiling-n4f</guid>
      <description>&lt;p&gt;Data Scientists and Analysts usually spend some time to get to know the data they are going to work on by doing exploratory analysis. It's one of the first steps in their journey before making further analysis and predictions. As Pythonists, while doing exploratory analysis with pandas, it's a must using methods such as &lt;code&gt;head&lt;/code&gt;, &lt;code&gt;describe&lt;/code&gt;, &lt;code&gt;info&lt;/code&gt;, &lt;code&gt;columns&lt;/code&gt;, &lt;code&gt;shape&lt;/code&gt;, &lt;code&gt;isnull&lt;/code&gt;, &lt;code&gt;value_counts&lt;/code&gt;, &lt;code&gt;unique&lt;/code&gt;, &lt;code&gt;duplicated&lt;/code&gt;, &lt;code&gt;corr&lt;/code&gt;, and so on. In addition to using some visualization libraries, such as seaborn or matplotlib, which is also primordial.&lt;/p&gt;

&lt;p&gt;What if with just a very few lines of code we were able to get insights that would require using all of the methods I mentioned before? What if it's a report with visualization built-in? Wow, that would save us a lot of time! And in fact, we can do that. Hopefully, pandas-profiling can provide us a report with exploratory insights.&lt;/p&gt;

&lt;p&gt;pandas-profiling is an &lt;a href="https://github.com/pandas-profiling/pandas-profiling" rel="noopener noreferrer"&gt;open-source Python library&lt;/a&gt; that allows us to quickly do exploratory analysis with just a few lines of code. Also, as I mentioned before, it's possible to use this library to generate an interactive report, with variables' distributions besides other insights commonly gotten in dataframes during exploratory analysis. This report can be saved in HTML format and easily shared with anyone. Awesome, right?!&lt;/p&gt;

&lt;p&gt;Now, let's see in practice how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing pandas-profiling
&lt;/h2&gt;

&lt;p&gt;You can install it from the command line via pip.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;profiling&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Generating an Exploratory Data Analysis Report
&lt;/h2&gt;

&lt;p&gt;After installing it, go to your Jupyter Notebook and load the data you want to explore as a DataFrame object. As an example, we can use the Titanic dataset, but feel free to use the data you want. See the code below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://raw.githubusercontent.com/gabrielatrindade/ml-playground/master/projects/titanic/dataset/train.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;titanic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, let's import the &lt;code&gt;ProfileReport&lt;/code&gt; class to create the report for the dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pandas_profiling&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProfileReport&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we are able to create the report.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProfileReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;titanic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;explorative&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Titanic Exploratory Analysis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;profile&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set the &lt;code&gt;explorative&lt;/code&gt; parameter as &lt;code&gt;True&lt;/code&gt; for a deeper exploration, and a &lt;code&gt;title&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We can see the report as output in the Jupyter Notebook. However, if you want to generate an HTML file to share the analysis with someone, it's also possible. Check the code below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output_titanic_report.html&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The report is composed of a lot of information, below I will list most of them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overview:&lt;/strong&gt; we can see some general statistics of the data, information on the report and warnings, that show insights that can highly impact the analysis, such as a high number of null values in a variable, duplicated rows, and high correlation between variables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variables:&lt;/strong&gt; composed of descriptive and quantile statistics information for each variable. Also, it's possible to see the histogram and the common and extreme values of the variable, in the case of continuous variables, and pie chart and frequency of each value for categorical data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactions:&lt;/strong&gt; allows us to see the relationship between two variables through the scatter plot visualization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Correlations:&lt;/strong&gt; shows the heatmap of Pearson, Spearman, Kendall, and Phik correlation matrix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing values:&lt;/strong&gt; through a bar chart or matrix visualization it's possible to see the missing values for each variable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample:&lt;/strong&gt; first 10 rows and last 10 rows are printed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate rows:&lt;/strong&gt; shows the duplicated rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the image below you can see what it looks like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpy2k69wryx7a0dvqxwwg.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpy2k69wryx7a0dvqxwwg.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pandas-profiling limitation
&lt;/h2&gt;

&lt;p&gt;One limitation I could see of pandas-profiling is when it's applied to large datasets because, as the dataset size increases, the report generation time increases a lot.&lt;/p&gt;

&lt;p&gt;One way to solve this problem is to generate the report from a sample of the dataset. In that case, if you select a few rows, it's important to make sure that they are representative of all the data you have or you can also select the variables you want to explore.&lt;/p&gt;

&lt;p&gt;Another way to deal with this problem is to use the minimal mode (introduced in the 2.4 pandas profiling version). This will generate a simplified report, taking less time than the full one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProfileReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;titanic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minimal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Titanic Exploratory Analysis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I have shown you how easily we can get an exploratory data analysis report using the pandas-profiling library. With a few lines of code, we can generate an interactive report and create an HTML file for it. The &lt;code&gt;ProfileReport&lt;/code&gt; class can save a lot of work in the phase of knowing the data and getting some insights into it.&lt;/p&gt;

&lt;p&gt;I hope it can be useful for you! See you next time!&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>analytics</category>
      <category>pandasprofiling</category>
    </item>
    <item>
      <title>pandas #4: Merging</title>
      <dc:creator>Gabriela Trindade</dc:creator>
      <pubDate>Mon, 16 Mar 2020 15:56:59 +0000</pubDate>
      <link>https://dev.to/gtrindadi/pandas-4-merging-8b8</link>
      <guid>https://dev.to/gtrindadi/pandas-4-merging-8b8</guid>
      <description>&lt;p&gt;Now that we learned about &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;reading files and basic DataFrame operations&lt;/a&gt; from the second post, and &lt;a href="https://dev.to/gtrindadi/pandas-3-aggregation-and-grouping-3i21"&gt;aggregation and grouping&lt;/a&gt; from the third one, it's time to learn about merging.&lt;/p&gt;

&lt;p&gt;But, &lt;strong&gt;what is &lt;code&gt;merge&lt;/code&gt;?&lt;/strong&gt; In simple words, &lt;code&gt;merge&lt;/code&gt; (a.k.a. "joining") is a database-style join operation by columns or indexes to join two dataframes. Let's understand when to use &lt;code&gt;merge&lt;/code&gt; and how it works.&lt;/p&gt;

&lt;h3&gt;
  
  
  When do we need to use &lt;code&gt;merge&lt;/code&gt; operation?
&lt;/h3&gt;

&lt;p&gt;As we know, in real-life data projects is common you get more than one table storing the data. Although you get multiple tables, they can match in some way and give us more details and insights about each observation of a table, and with &lt;code&gt;merge&lt;/code&gt; we can do that. But why doesn't all this information come in only one table? There are a few reasons why it's better to have multiples data tables, some of them are: it avoids redundancy of data, it saves some disk space, it makes it easier to manage the data, it allows your query to be faster because the table is smaller, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does it works?
&lt;/h3&gt;

&lt;p&gt;pandas provides &lt;code&gt;merge&lt;/code&gt; operation that allows us to combine DataFrames. It's almost the same as SQL's join operations. So, to get all the information from different tables together in only one table we can use &lt;code&gt;merge&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Here we're going to practice &lt;code&gt;merge&lt;/code&gt; operation with the last dataset we used in the last post, but with modifications - so you need to download this new version (&lt;code&gt;restaurant_orders_v2.csv&lt;/code&gt;) - and one more new dataset (&lt;code&gt;restaurant_products_price.csv&lt;/code&gt;). Then I strongly recommend you to check the section &lt;strong&gt;Before we start&lt;/strong&gt; to download the datasets we are going to work on.&lt;/p&gt;

&lt;p&gt;Like in the last post, this one will be more practical so I would appreciate if you code along with me. I try to do these posts independent from each other, here I'm considering that you'll create a new Jupyter Notebook, then I'm going to declare some variables again. &lt;/p&gt;

&lt;p&gt;If you have any questions please feel free to use the comments section below.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [1]: I made a Jupyter Notebook available on &lt;a href="https://github.com/gabrielatrindade/blog-posts-pandas-series"&gt;my GitHub&lt;/a&gt; with the code used in this post.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Before we start
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;In order to work with pandas, we're going to import this library in the beginning.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let's download the two datasets we are going to work on.&lt;br&gt;
&lt;em&gt;Note [2]: This first file is not the same as my previous post. Download this new file version.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders_v2.csv
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U0BSKvrR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/d118xaiezhrwkfa3fyif.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U0BSKvrR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/d118xaiezhrwkfa3fyif.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_products_price.csv
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DVVZmvSL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/wngjszieagoe68b7vee6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DVVZmvSL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/wngjszieagoe68b7vee6.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Now, let's turn them into dataframes and store each one in a variable.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="n"&gt;column_names_orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'order_date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
 &lt;span class="n"&gt;orders_v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'restaurant_orders_v2.csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;','&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;column_names_orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

 &lt;span class="n"&gt;column_names_prices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'product_price'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
 &lt;span class="n"&gt;products_price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'restaurant_products_price.csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;','&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;column_names_prices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;If you would like to know details about this process, you can go through the &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;second post&lt;/a&gt; in &lt;strong&gt;Reading a File&lt;/strong&gt; section.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check these dataframes and try to understand the information that each one has.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="n"&gt;orders_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hBbs2SYL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/o55xrgxd20po5wvqw29m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hBbs2SYL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/o55xrgxd20po5wvqw29m.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;orders_v2&lt;/code&gt; dataframe is just a new version of the dataframe we used on the last post. The modification in this second version is about a huge number of values that were deleted to reduce our dataset and one column that was removed from the first dataframe version, so now we have a dataframe with 4 columns, which are: &lt;code&gt;order_number&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;item_name&lt;/code&gt;, and &lt;code&gt;quantity&lt;/code&gt;. I believe that the &lt;code&gt;orders_v2&lt;/code&gt; column names are self-explanatory, but if you have any questions about the dataset, let me know through the comments.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="n"&gt;products_price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sA8ZMhfs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/rzd5wr9kf2lhy6i8g4c0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sA8ZMhfs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/rzd5wr9kf2lhy6i8g4c0.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see there are just two columns in this &lt;code&gt;products_price&lt;/code&gt; dataframe, and they represent the product name (&lt;code&gt;item_name&lt;/code&gt;) and the product price (&lt;code&gt;product_price&lt;/code&gt;), respectively.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Merging Operation
&lt;/h2&gt;

&lt;p&gt;Now that we discussed the motivation of &lt;code&gt;merge&lt;/code&gt; operation, let's going to deep into it, to know the different types and learn how to apply it on the datasets you downloaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Overview
&lt;/h3&gt;

&lt;p&gt;Basically, there are 4 types of merge: outer join, inner join, left join and right join. In the image below we can see the purpose of each one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vjf-F9QM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/kvucmsrwe2e0a82ve2x5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vjf-F9QM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/kvucmsrwe2e0a82ve2x5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This information you will pass in &lt;code&gt;how&lt;/code&gt; parameter.&lt;/p&gt;

&lt;p&gt;Another important parameter to know is &lt;code&gt;right_on&lt;/code&gt; and &lt;code&gt;left_on&lt;/code&gt;. They allocate the column from the right DataFrame and the one from left DataFrame, respectively, which will match the values. If you don't pass these parameters, pandas will merge through those columns that have the same name.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [3]: In case you don't understand, no worries! It's going to be more clear with the code examples.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Inner: intersection of dataframes
&lt;/h3&gt;

&lt;p&gt;An inner join will get only the rows (observations) from the keys that have in both dataframes.&lt;/p&gt;

&lt;p&gt;By default &lt;code&gt;merge&lt;/code&gt; pandas operation consider inner join. Then you can omit this information and the result will be the same. And because we have columns in each dataframe with the same name, pandas will consider them as the keys and will merge through them. Then you can also omit this information.&lt;/p&gt;

&lt;p&gt;Let's merge our two dataframes and take just the purchases that the &lt;code&gt;item_name&lt;/code&gt; is registered in both tables. Check it below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;orders_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products_price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8624KC7r--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/2mfocj92i4h9mxk4kxvi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8624KC7r--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/2mfocj92i4h9mxk4kxvi.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YCWds2f7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/vvpfnzmrjv1g4pkell5y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YCWds2f7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/vvpfnzmrjv1g4pkell5y.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;orders_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'inner'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DtHlc95q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/btrbk0awzystyscqminv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DtHlc95q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/btrbk0awzystyscqminv.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nR-rqBVl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/kgbzakmkl3tby4ylqsks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nR-rqBVl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/kgbzakmkl3tby4ylqsks.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note [4]: I recommend you use &lt;code&gt;how&lt;/code&gt;, &lt;code&gt;right_on&lt;/code&gt;, and &lt;code&gt;left_on&lt;/code&gt; , in this way your code will be more explicit.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As you can see we have just 25 observations, remember that the inner operation will join observations that are present in the two datasets.&lt;/p&gt;
&lt;h3&gt;
  
  
  Outer: union of dataframes
&lt;/h3&gt;

&lt;p&gt;An outer join will get all the rows (observations) independently if the keys are present in just one dataframe, regardless of whether there is a match. In case the key is present in one dataframe the columns of another dataframe will be filled by &lt;code&gt;NaN&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;Let's merge our two dataframes and take all the purchases independently if the &lt;code&gt;item_name&lt;/code&gt; is registered in both tables. Here you will see that we have more observations (36), but some of them will appear with &lt;code&gt;NaN&lt;/code&gt; values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;orders_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'outer'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hVtx_vdq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/h43r9umjexjii8f78eso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hVtx_vdq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/h43r9umjexjii8f78eso.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xADx3Bqb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1emei45s0pho3p3mixe1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xADx3Bqb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1emei45s0pho3p3mixe1.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Right: keys from right dataframe
&lt;/h3&gt;

&lt;p&gt;A right join will get just the rows (observations) from the keys of the right dataframe, regardless of whether there is a match. If the key has no information on the left dataframe, the rest of the columns will be filled by &lt;code&gt;NaN&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;Let's merge our two dataframes and take all the information from the &lt;code&gt;products_price&lt;/code&gt; dataframe independently if the &lt;code&gt;item_name&lt;/code&gt; was bought in some order on the &lt;code&gt;orders_v2&lt;/code&gt; dataframe. In this case, you will see that some products will appear with their values but there are no orders to them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;orders_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'right'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hBJvEeTz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/g6srciwwf2li0pf3qv7d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hBJvEeTz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/g6srciwwf2li0pf3qv7d.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SdCBelay--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9ipto3q8cfj6cm162etu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SdCBelay--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9ipto3q8cfj6cm162etu.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Left: keys from left dataframe
&lt;/h3&gt;

&lt;p&gt;A left join will get just the rows (observations) from the keys of the left dataframe, regardless of whether there is a match. If the key has no information on the right dataframe, the rest of the columns will be filled by &lt;code&gt;NaN&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;Let's merge our two dataframes and take all the purchases from the &lt;code&gt;orders_v2&lt;/code&gt; dataframe independently if the &lt;code&gt;item_name&lt;/code&gt; is registered in the &lt;code&gt;products_price&lt;/code&gt; dataframe. In this case, we will have some orders with products that have no prices registered.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;orders_v2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'left'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;right_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;left_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KqFaHWWL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9lo6o0krz2ni5w2tilo4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KqFaHWWL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9lo6o0krz2ni5w2tilo4.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vM2pKOAb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/g8kij9s2923a431mmcv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vM2pKOAb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/g8kij9s2923a431mmcv2.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;merge&lt;/code&gt; vs &lt;code&gt;join&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;By default the &lt;code&gt;join&lt;/code&gt; method considers the index to join dataframes or a specific column from the dataframe that it's called on (left dataframe).  Then it means that the column from the left dataframe doesn't have to be an index, but for the right dataframe, the key must be its index. In general, I would say that the &lt;code&gt;join&lt;/code&gt; method is based on the index.&lt;/p&gt;

&lt;p&gt;Otherwise, by default, the &lt;code&gt;merge&lt;/code&gt; method will look for overlapping columns in which to merge on, unless we attribute &lt;code&gt;True&lt;/code&gt; value to &lt;code&gt;right_index&lt;/code&gt; or &lt;code&gt;left_index&lt;/code&gt; parameters. &lt;code&gt;merge&lt;/code&gt; allows controlling over merge keys through &lt;code&gt;right_on&lt;/code&gt; and &lt;code&gt;left_on&lt;/code&gt; parameters, as we saw in this post. &lt;code&gt;merge&lt;/code&gt; is useful when we don’t want to join on the index. Differently from &lt;code&gt;join&lt;/code&gt;, it will return a combined dataframe in which the original index will be destroyed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;This was the fourth post of my pandas series where I could show to you that &lt;code&gt;merge&lt;/code&gt; is a join operation that is very useful. We learned about some reasons to use it. &lt;/p&gt;

&lt;p&gt;Also, we could see how it works. Through some practical examples, we also saw the basic parameters of &lt;code&gt;merge&lt;/code&gt; operation and what are the default ones. &lt;/p&gt;

&lt;p&gt;One of these parameters was &lt;code&gt;how&lt;/code&gt; that allocates the merge type. We learned that there are 4 types of merge that allows us to get a set of our joining dataframes that makes more sense for us. The 4 merges are inner (by default), outer, right and left.&lt;/p&gt;

&lt;p&gt;Other parameters were &lt;code&gt;right_on&lt;/code&gt; and &lt;code&gt;left_on&lt;/code&gt; that are about the columns that will represent the keys in each dataframe.&lt;/p&gt;

&lt;p&gt;Finally, we learned a little bit about the difference between &lt;code&gt;merge&lt;/code&gt; and &lt;code&gt;join&lt;/code&gt; in pandas. And we saw that the biggest difference is that &lt;code&gt;join&lt;/code&gt; is based on indexes while &lt;code&gt;merge&lt;/code&gt; is based on columns.&lt;/p&gt;

&lt;p&gt;In the next post, we'll see more about Data Wrangling. See you there!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SfsYKs09--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/lupyez09eccjg79heq0d.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SfsYKs09--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/lupyez09eccjg79heq0d.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>python</category>
      <category>tutorial</category>
      <category>datascience</category>
    </item>
    <item>
      <title>pandas #3: Aggregation and Grouping</title>
      <dc:creator>Gabriela Trindade</dc:creator>
      <pubDate>Sat, 14 Dec 2019 19:40:01 +0000</pubDate>
      <link>https://dev.to/gtrindadi/pandas-3-aggregation-and-grouping-3i21</link>
      <guid>https://dev.to/gtrindadi/pandas-3-aggregation-and-grouping-3i21</guid>
      <description>&lt;p&gt;Let's continue with the series and learn more about how to analyse data with pandas. From the &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;previous post&lt;/a&gt; we learned about reading files and basic DataFrame operations. Now we know how to store our data into a DataFrame object, so in this post we are going to do more with it.&lt;/p&gt;

&lt;p&gt;Like the last post, this one will be more practical so I would appreciate if you code along with me. I try to do these posts independent from each other, here I'm considering that you'll create a new Jupyter Notebook, then I'm going to declare some variables again. &lt;/p&gt;

&lt;p&gt;If you have any questions please feel free to use the comments section below.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [1]: I made a Jupyter Notebook available on &lt;a href="https://github.com/gabrielatrindade/blog-posts-pandas-series"&gt;my GitHub&lt;/a&gt; with the code used in this post.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Before we start
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;This is a pandas tutorial, then it's always required that we import pandas library in the beginning.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Let's download the dataset we are going to work on.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [2]: The file here is the same from my previous post, so if you have it already you can reuse it.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders.csv
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NOAN9wPy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/9tpn67qd5s3i9ql6rlof.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NOAN9wPy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/9tpn67qd5s3i9ql6rlof.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Now, let's turn it into a dataframe and store it in a variable.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'order_date'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'product_price'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

 &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'restaurant_orders.csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;','&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you would like to know details about this process, you can go through the previous post in &lt;strong&gt;Reading a File&lt;/strong&gt; section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Aggregation functions
&lt;/h2&gt;

&lt;p&gt;First, we are going for more details about aggregations functions. But, what does it mean? and what is it for? Aggregate functions are used to apply functions in multiple rows resulting in one single value. By default, these functions are applied in each column (Series), so you can get a single value for each Series. Let me clarify this through our dataframe from the previous post. Do you remember the dataframe structure? Check it again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--J-3VEF-Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/skh5o8gijiybm9naom9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--J-3VEF-Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/skh5o8gijiybm9naom9v.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And what if I wish to know how many records are in my data? Or how much I sold since I started to store my data? Or what about how many products did I sell? We can do this simply by summing all the numbers of a certain column (Series) or counting the number of lines in our dataframe, right?! But how to do that? Applying some aggregation functions we can get these numbers easily and get information like median, mean and so on.&lt;/p&gt;

&lt;h3&gt;
  
  
  sum()
&lt;/h3&gt;

&lt;p&gt;Imagine that we would like to know how many products we sold, like I said before. We can do this by summing all the numbers from the &lt;code&gt;quantity&lt;/code&gt; column, right?! Applying &lt;code&gt;sum&lt;/code&gt; function directly on the column we can get the answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5ZW17wQC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/hvjk4et9kp584ksms0rx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5ZW17wQC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/hvjk4et9kp584ksms0rx.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [3]: In the &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;second post&lt;/a&gt; of this pandas series we saw how to access a value in column with pandas. If you go through the &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;previous post&lt;/a&gt; (in Basic DataFrame operations &amp;gt;&amp;gt; Selecting specific rows and columns &amp;gt;&amp;gt; Columns) you can see that there are 3 ways to do that. You can compare the solution above with &lt;code&gt;orders.quantity.sum()&lt;/code&gt; or &lt;code&gt;orders[['quantity']].sum()&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [4]: If you don't specify a column it will return the sum for each column of the dataframe.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the same way, we can get numbers like how much I sold. But in this case we need to multiple the columns &lt;code&gt;quantity&lt;/code&gt; and &lt;code&gt;product_price&lt;/code&gt; before applying &lt;code&gt;sum&lt;/code&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'product_price'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JsbwqT3z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ckpnh4adoem55cg5y1th.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JsbwqT3z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ckpnh4adoem55cg5y1th.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [5]: We use parenthesis to evaluate first the multiplication operation before doing the method call.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  count()
&lt;/h3&gt;

&lt;p&gt;Now, let's answer the question of how many records we have. To do this I need to count the number of lines I have in my dataframe. So, &lt;code&gt;count&lt;/code&gt; function is appropriate for this case. As we did before, in this case we just apply the function over the dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gS5vSLGf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/d69kvc2m6054aod5a609.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gS5vSLGf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/d69kvc2m6054aod5a609.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However it will return the counting of records for each column.&lt;/p&gt;

&lt;p&gt;As we can see each column has the same value. It happens because all columns (variables) were not empty. In case of &lt;code&gt;None&lt;/code&gt; values, it would not be counted.&lt;/p&gt;

&lt;p&gt;But if we want to get one unique value that represents the records quantity, we should choose a column which has no &lt;code&gt;None&lt;/code&gt; values and apply the &lt;code&gt;count&lt;/code&gt; function over it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pGdaUc0A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/2qozkcu4i0kmmke4lfip.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pGdaUc0A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/2qozkcu4i0kmmke4lfip.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  min() and max()
&lt;/h3&gt;

&lt;p&gt;And what about the lowest product price I have recorded in my orders? Or the highest one? In these cases we are applying &lt;code&gt;min&lt;/code&gt; and &lt;code&gt;max&lt;/code&gt; functions, respectively, on the right Series (&lt;code&gt;product_price&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'product_price'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NpIhjWBo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/suq8sehen6zwpkl9alz4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NpIhjWBo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/suq8sehen6zwpkl9alz4.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'product_price'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0tNUfP69--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/wl844gvgyukrxuax9bkh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0tNUfP69--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/wl844gvgyukrxuax9bkh.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  mean() and median()
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;mean&lt;/code&gt; and &lt;code&gt;median&lt;/code&gt; functions return us the column average and the column median, respectively. Below I briefly explain the difference between the two functions: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mean()&lt;/code&gt; is the average of a set of numbers. To get this result just add all the numbers and then divide by the amount of elements you added. The mean is not a robust tool since it is largely influenced by outliers.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;median()&lt;/code&gt; is the middle value in an ascending ordered list of numbers. If the amount of elements in the list is even, i.e. has no middle number, so you add the two numbers in the middle and divide by 2 to get the median. The median is better suited for skewed distributions to derive at central tendency since it is much more robust and sensible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let's get the &lt;code&gt;mean()&lt;/code&gt; and &lt;code&gt;median()&lt;/code&gt; of &lt;code&gt;quantity&lt;/code&gt; column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--65N7gCM0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/teuneepw3u3z6s5ltikx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--65N7gCM0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/teuneepw3u3z6s5ltikx.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m-MnvAU8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ypw7pr0u1vtrnnl82qg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m-MnvAU8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ypw7pr0u1vtrnnl82qg5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are more aggregation functions, such as to calculate: standard deviation (&lt;code&gt;.std()&lt;/code&gt;), variance (&lt;code&gt;.var()&lt;/code&gt;), mean absolute deviation (&lt;code&gt;.mad()&lt;/code&gt;), standard error of the mean of groups (&lt;code&gt;.sem()&lt;/code&gt;),  descriptive statistics (&lt;code&gt;.describe()&lt;/code&gt;) and so on. &lt;/p&gt;

&lt;p&gt;I would like to comment that through the &lt;code&gt;describe&lt;/code&gt; function we can get a lot of information, like the ones already mentioned in this post, for example &lt;code&gt;count&lt;/code&gt;, &lt;code&gt;mean&lt;/code&gt;, &lt;code&gt;median&lt;/code&gt; and &lt;code&gt;std&lt;/code&gt;. But I'll show it in more details in a Data Wrangling post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grouping
&lt;/h2&gt;

&lt;p&gt;Segmentation is part of a Data Scientist work. It means to separate the dataset in groups of elements which has something in common, or that makes sense to be in a group. &lt;/p&gt;

&lt;p&gt;Let's take as an example our dataframe. In our dataframe we have records of products that were sold, right?! And it means that we have one or more records per order. So, what if we would like to know how many items were bought in each order? We should treat each order as a group and then apply &lt;code&gt;sum&lt;/code&gt; aggregation function. Let's do it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--60lH1pWZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/erzleziog0bi8zek3eeo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--60lH1pWZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/erzleziog0bi8zek3eeo.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [6]: If I use just one bracket it will return a Series object, instead of DataFrame object. You can see more about this the &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;previous post&lt;/a&gt;, in Basic DataFrame operations &amp;gt;&amp;gt; Selecting specific rows and columns &amp;gt;&amp;gt; Columns.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can also put the Series you want to select in the end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()[[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EcTBVmmz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/avnrwmcrmkml5oa8nb12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EcTBVmmz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/avnrwmcrmkml5oa8nb12.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It won't make any difference in our result.&lt;/p&gt;

&lt;p&gt;So... what about how many orders I have by day? It's another way that we can group our dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_date'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y6-_nowa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ptaook74b3v1i7bu7tdk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y6-_nowa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ptaook74b3v1i7bu7tdk.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Simple, isn't it?&lt;/p&gt;

&lt;p&gt;Maybe you have noticed that a &lt;code&gt;groupby&lt;/code&gt; call will usually be followed by an aggregation function. If we just &lt;code&gt;groupby&lt;/code&gt; a dataframe without specifying the aggregation function, it will return the type &lt;code&gt;pandas.core.groupby.DataFrameGroupBy&lt;/code&gt; object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RWWNvQBn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/od74swam206pn3x6zzjc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RWWNvQBn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/od74swam206pn3x6zzjc.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To practice more, let's get these values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The order with the highest amount of different products.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'item_name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cIFdSZIc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/gxotqh9l96n86mo5amzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cIFdSZIc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/gxotqh9l96n86mo5amzx.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More details: &lt;code&gt;sort_values&lt;/code&gt; function is sorting the result by &lt;code&gt;item_name&lt;/code&gt; column, in descending order (&lt;code&gt;ascending=False&lt;/code&gt;), then we get just the first result (&lt;code&gt;head(1)&lt;/code&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 7 days that had the lowest quantity of products sold.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_date'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c4Lgfr8d--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/m1kuqxyo9a9bevdwc1vd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c4Lgfr8d--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/m1kuqxyo9a9bevdwc1vd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More details: Note that we changed to ascending order (&lt;code&gt;ascending=True&lt;/code&gt;) because we want the days that have the lowest quantity of products sold.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 5 most expensive orders.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before getting these 3 next answers, let's create a new dataframe with a &lt;code&gt;subtotal&lt;/code&gt; column, which represents the &lt;code&gt;product_price&lt;/code&gt; multiplied by &lt;code&gt;quantity&lt;/code&gt;. To not damage our original dataframe, we can create one copy. &lt;/p&gt;

&lt;p&gt;If you want to know more details, check the &lt;a href="https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2"&gt;previous post&lt;/a&gt; (in Basic DataFrame operations &amp;gt;&amp;gt; Adding rows and columns &amp;gt;&amp;gt; Columns).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'product_price'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
                           &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'quantity'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now, get the 5 most expensive orders.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eF5wfLTP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/689tfdphku67lgy9yafh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eF5wfLTP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/689tfdphku67lgy9yafh.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 3 cheapest orders.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_number'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--muL9Dubp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/s8okpf15w6svwbebo54t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--muL9Dubp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/s8okpf15w6svwbebo54t.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The 2 days that had the biggest income.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'order_date'&lt;/span&gt;&lt;span class="p"&gt;)[[&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'subtotal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WhtVbTzI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/4tmgd0mvmjjtgrmh6azc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WhtVbTzI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/4tmgd0mvmjjtgrmh6azc.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;This was the third post of pandas series. Here we learned about aggregation functions and grouping, and as you saw in some examples it's very useful and help us to discover new insights from our dataset.&lt;/p&gt;

&lt;p&gt;Aggregation functions are used to apply specific functions in multiple rows resulting in one single value. And grouping is a way to gather elements (rows) that make sense when they are together. It's very common that we use &lt;code&gt;groupby&lt;/code&gt; followed by an aggregation function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IpOSkl0K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/54uosea53o0ezczxdpoi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IpOSkl0K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/54uosea53o0ezczxdpoi.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope you enjoyed it and you found it clear. If you have any question, please let me know in the comments below.&lt;/p&gt;

&lt;p&gt;In the next post we'll learn about merging.&lt;/p&gt;

</description>
      <category>python</category>
      <category>pandas</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>pandas #2: Reading files and basic DataFrame operations</title>
      <dc:creator>Gabriela Trindade</dc:creator>
      <pubDate>Fri, 25 Oct 2019 20:05:29 +0000</pubDate>
      <link>https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2</link>
      <guid>https://dev.to/gtrindadi/pandas-2-reading-files-and-basic-dataframe-operations-38l2</guid>
      <description>&lt;p&gt;The first step to start doing data analysis in Python is to import your data from a certain source (in my &lt;a href="https://dev.to/gtrindadi/getting-started-with-pandas-3i53"&gt;first post&lt;/a&gt; we saw that pandas is great for data analysis). In this post I'm going to show you how to load files into pandas data structure (dataframes) and then we'll check how we can print the whole dataframe or a sample of the data, filter specific values and select specific columns and rows, besides append and delete them. In the end we'll check the logic sequence of pandas operations.&lt;/p&gt;

&lt;p&gt;This second post is different from the first one. In the first one we learned the theoretical stuff. On the other hand, this post will be more practical so I would appreciate if you code along with me. If you have any questions please feel free to use the comments section below.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [1]: I made a Jupyter Notebook available on &lt;a href="https://github.com/gabrielatrindade/blog-posts-pandas-series" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt; with the code used in this post.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Before we start
&lt;/h2&gt;

&lt;p&gt;This is a pandas tutorial, so the first step is to import it by typing the following in your Python runner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You may have noticed that I didn't just import pandas but I also gave an alias to it. Then, everytime I need to use the pandas library I can refer to it as &lt;code&gt;pd&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading a file
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is your file extension?&lt;/strong&gt; First of all you need to know what is the file type you'll work on. It's common to have data in .csv (Comma Separated Value) files. But keep in mind that you can work with other file types, like .xls, .xlsx, .txt, .json, .html, and so on.&lt;/p&gt;

&lt;p&gt;To analyse the data, the first step is to import it from the file into a dataframe. To do this is easy with pandas since it provides functions for each file type, as you can see &lt;a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#" rel="noopener noreferrer"&gt;here&lt;/a&gt;. But in this post we're working with .csv files because it is one of the most common or maybe the most common.&lt;/p&gt;

&lt;p&gt;The second important thing to know is &lt;strong&gt;where is your file&lt;/strong&gt;, i.e. you need to know what is the path directory of your file to allow pandas to find it.&lt;/p&gt;

&lt;p&gt;Knowing this, let's &lt;strong&gt;download our working file&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, open your file and &lt;strong&gt;make observations&lt;/strong&gt; on it, like: Is there a head, i.e. the first line represents the column names? What is the column separator? This two questions are important to import the file into a dataframe. To do that use the &lt;code&gt;read_csv()&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;restaurant_orders.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpix3uyow8bq7rf2k9xqg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpix3uyow8bq7rf2k9xqg.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see the file has no head and pandas will considerate the first line a head by default, thus let's add the column names to it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;order_number&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;restaurant_orders.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8r8uam2pg2sdrs5d2q26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8r8uam2pg2sdrs5d2q26.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's store our dataframe into a variable because we'll work a lot with it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;restaurant_orders.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, you can see the whole dataframe through the variable we just introduced.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftnsh69a8u1vgfguh1tfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftnsh69a8u1vgfguh1tfe.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fy1ltv5tkwi3f51ew8a4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fy1ltv5tkwi3f51ew8a4c.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [2]: Did you notice how this file has so many rows (observations)? To be exact there are 74818 rows. Wow! And we are going to work with this dataset.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Before we move on, pay attention to our dataset and try to understand what it means. It represents orders from Indian takeaway restaurant in London, UK. Each row is a single product within the order, i.e. one order can have more than one row. There are 5 columns: &lt;code&gt;order_number&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;item_name&lt;/code&gt;, &lt;code&gt;quantity&lt;/code&gt;, and &lt;code&gt;product_price&lt;/code&gt;. I believe that the names are self explanatory, but if you have any questions about the dataset, let me know through the comments.&lt;/p&gt;

&lt;p&gt;You also can verify the type of &lt;code&gt;orders&lt;/code&gt; variable. As we know, it's a DataFrame object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1y8zzh0i2hnfwt4ifedw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1y8zzh0i2hnfwt4ifedw.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Basic DataFrame operations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Printing samples
&lt;/h3&gt;

&lt;h4&gt;
  
  
  head and tail
&lt;/h4&gt;

&lt;p&gt;These functions will print the first and last rows, respectively. They are commonly used to see a sample of Series or DataFrame object. By default they will print five rows, but you can pass a number as an argument.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F588o9odk0adp26mxtd6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F588o9odk0adp26mxtd6m.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4thvg6xwqk9605qmmrg4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4thvg6xwqk9605qmmrg4.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Sample
&lt;/h4&gt;

&lt;p&gt;It's one more function to have a sample of Series or DataFrame. However you need to pass the number as an argument and then it will sort the rows randomly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvjtquz8beazlq2gibnc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fvjtquz8beazlq2gibnc0.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Selecting specific rows and columns
&lt;/h3&gt;

&lt;p&gt;Other ways to get a subset of your dataframe is by selecting a specific set of rows and columns you want.&lt;/p&gt;

&lt;h4&gt;
  
  
  Rows
&lt;/h4&gt;

&lt;p&gt;To select rows you only need to pass the rows interval between brackets. &lt;/p&gt;

&lt;p&gt;It's important to highlight that pandas is based-indexing zero, i.e. the first element is zero which is included, and the last index is excluded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8faxuly1ngeujeopbtqp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F8faxuly1ngeujeopbtqp.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But if you don't want to specify the begin or the end of your subset, you can simply omit this information. It will consider the first index or the last index for omitted ones, respectively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fiyc48npplp7yz6ph9oed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fiyc48npplp7yz6ph9oed.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Columns
&lt;/h4&gt;

&lt;p&gt;To select columns, if you want to get a DataFrame object, use double brackets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpaxpvorws6thv7jmms1x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fpaxpvorws6thv7jmms1x.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But if your need is to have a Series object, there are two ways to access it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmsxcxhwsn7pnwaq3yq0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmsxcxhwsn7pnwaq3yq0i.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F84aozz32ike8wi77ertr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F84aozz32ike8wi77ertr.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you'll see at the end of this post, you can chain operations and it will follow the linear logic. So you can do something like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]][:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7ervtkmyqweop1op6sx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7ervtkmyqweop1op6sx4.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But I will explain it in more details in the end.&lt;/p&gt;

&lt;h4&gt;
  
  
  .iloc, .loc
&lt;/h4&gt;

&lt;p&gt;The selection of rows and columns also can be made by some attributes, such as &lt;code&gt;.iloc&lt;/code&gt; and &lt;code&gt;.loc&lt;/code&gt;. The general syntax for these attributes is &lt;code&gt;dataframe.attribute[&amp;lt;row selection&amp;gt;, &amp;lt;column selection&amp;gt;]&lt;/code&gt;. In these cases you can omit the &lt;code&gt;&amp;lt;column selection&amp;gt;&lt;/code&gt;, but you need to inform the row ones. I am going to show you some examples.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.iloc&lt;/code&gt; is based on position.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fkykka9x1hm70dn0tr19j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fkykka9x1hm70dn0tr19j.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6q0xa0pfwrxv5nh8w67z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F6q0xa0pfwrxv5nh8w67z.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3ha9c2r5pyo6ej0t6irb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F3ha9c2r5pyo6ej0t6irb.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [3]: Negative index will consider from the bottom to the top. Then, in this case it returns the last row. Ah, and it works in the same logic to columns as well.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [4]: &lt;code&gt;.iloc&lt;/code&gt; returns a pandas Series when one row is selected, and a pandas DataFrame when multiple rows are selected or if any column in full is selected. If you want a DataFrame use double brackets.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.loc&lt;/code&gt; is based on label. However our label for rows was not changed so by default it is the index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;order_number&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fugkdttibth9shidzjumv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fugkdttibth9shidzjumv.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;order_number&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9p2oj1obt2l7763wudem.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9p2oj1obt2l7763wudem.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Filtering by specific values
&lt;/h3&gt;

&lt;p&gt;What if you need to select rows that contain specific values? Let's say we want to select only the order 16118 to see the items of it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_number&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;16118&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqzjf1dvc82x28r3bnyf7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fqzjf1dvc82x28r3bnyf7.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this way we can see all the items listed for this order. But how it works? First pandas check the condition between brackets &lt;code&gt;orders.order_number == 16118&lt;/code&gt;. If you just compile this part, it returns a pandas Series object with Boolean value where each one is the result of the condition for all rows in the dataframe. Then, in the second part &lt;code&gt;orders[...]&lt;/code&gt; pandas prints just the correspondent rows for values that were True.&lt;/p&gt;

&lt;p&gt;It works like a 'where' clause in SQL.&lt;/p&gt;

&lt;p&gt;One more example is to filter by day.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2019-08-03&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2019-08-02&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ft0czd9vbwpysdq6yhdle.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ft0czd9vbwpysdq6yhdle.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fcdzgstbbn3m9svfqtgkp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fcdzgstbbn3m9svfqtgkp.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [5]: In these examples I'm using the two ways to access a column in a dataframe. As I explained at the beginning of this post, I can access column using brackets or dot notation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As you probably noticed we can use some logical operators (&lt;code&gt;&amp;amp;&lt;/code&gt;, &lt;code&gt;|&lt;/code&gt; to represent 'and', 'or') as well as comparison operators (&lt;code&gt;==&lt;/code&gt;, &lt;code&gt;!=&lt;/code&gt;, &lt;code&gt;&amp;lt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;=&lt;/code&gt;) in our conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding rows and columns
&lt;/h3&gt;

&lt;p&gt;In order to not compromise our dataframe, for this subsection and the next one, we'll create a copy of our dataframe and then work from it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fr2mcd33868o6rifdnvm7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fr2mcd33868o6rifdnvm7.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [6]: To create a copy you need to use &lt;code&gt;copy()&lt;/code&gt; function. If you just assign &lt;code&gt;orders&lt;/code&gt; to &lt;code&gt;orders_copy&lt;/code&gt; you are creating a reference to it which means that changing &lt;code&gt;orders_copy&lt;/code&gt; will change &lt;code&gt;orders&lt;/code&gt; too.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Rows
&lt;/h4&gt;

&lt;p&gt;To add new rows to a dataframe is easy. We can use the &lt;code&gt;append()&lt;/code&gt; function and as an argument we pass a Series list, or Dictionary to represent the row.&lt;/p&gt;

&lt;p&gt;Pay attention about the type of the values, they don't need to match with the data types of each dataframe column, but you don't want to mess up your dataframe and consequently your analysis. So be sure that you are adding the right values in the correct columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;123456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2019-08-04&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Product Test&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.00&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fliv4jim8uj99q9yda19t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fliv4jim8uj99q9yda19t.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's also possible to do it by using the &lt;code&gt;.loc&lt;/code&gt; attribute directly. Although it's not considered a good practice since you should rely more on the object API instead of changing directly their internal state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12134567&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2019-08-04&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Product Test&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.70&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmunie02vaiau6iwi08oe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fmunie02vaiau6iwi08oe.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Columns
&lt;/h4&gt;

&lt;p&gt;There are some ways to add columns in a dataframe.&lt;/p&gt;

&lt;p&gt;You can do that using list, but in this case you need to assign for each row a value. Then, in our case we'd need to create a list with 74820 values. In general we do this by assigning a function or an expression to the column. But, if you want to assign the same values for all rows, you can do this just like the example below.&lt;/p&gt;

&lt;p&gt;Let's say that all items had a discount.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_pct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flwxdce3aoz4dmehn38yz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Flwxdce3aoz4dmehn38yz.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It doesn't look so useful, does it? But in our case we'll use it to give the discount for each product price.&lt;/p&gt;

&lt;p&gt;So let's see the interesting part. You can use an expression to fill the column. And here we're going to create the column that represents discount through price.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
                                 &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_pct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fprsm2w24f3k0z3387tkr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fprsm2w24f3k0z3387tkr.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So all the &lt;code&gt;discount_price&lt;/code&gt; column is representing the discounted value for each product (row).&lt;/p&gt;

&lt;p&gt;You can also add columns using &lt;code&gt;assign()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;Let's create the total discount taking into account the quantity of products.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;discount_subtotal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fw4kv3sdk6bal82hxbqo9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fw4kv3sdk6bal82hxbqo9.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One more way to do add columns is through the &lt;code&gt;apply()&lt;/code&gt; function. In this case you'll create a column using a function that will be applied for each row. Like the examples above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subtotal&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
                 &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxp2q2bnkvchzgtdxsvge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fxp2q2bnkvchzgtdxsvge.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;axis=1&lt;/code&gt; means that we are working with column.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deleting rows and columns
&lt;/h3&gt;

&lt;p&gt;It's important to evaluate the decision of deleting something before doing it. You can do some filters to check if it's really what you want to delete. Here we are working with the dataframe copy, so we don't need to worry about that.&lt;/p&gt;

&lt;p&gt;To both, rows and columns, we use &lt;code&gt;drop()&lt;/code&gt; function.&lt;/p&gt;

&lt;h4&gt;
  
  
  Row
&lt;/h4&gt;

&lt;p&gt;Let's delete the row that we added. But before, let's check the row index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Product Test&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F919rgceh76qzzofzq8tx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F919rgceh76qzzofzq8tx.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we know that the index is 74818. Let's drop it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;74818&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check if it was deleted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;74818&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzqfj7d99srtxnr8if6xd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fzqfj7d99srtxnr8if6xd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It doesn't exist anymore.&lt;/p&gt;

&lt;h4&gt;
  
  
  Column
&lt;/h4&gt;

&lt;p&gt;In the same way, let's delete the columns that were created.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_pct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;discount_subtotal&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subtotal&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;axis=1&lt;/code&gt; refers that it's a column, not a row.&lt;/p&gt;

&lt;p&gt;Check it is was deleted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders_copy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F2m4nl9dgo9wlv19m67gj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F2m4nl9dgo9wlv19m67gj.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Linear logic
&lt;/h3&gt;

&lt;p&gt;The pandas logic is very linear (compared to SQL, for example), you can chain operations one after the other. The input of the latter function is the output of the previous one. Let's see some examples.&lt;/p&gt;

&lt;p&gt;Get the first 3 different products bought at '2019-08-03' . Consider that the dataset is sorted by the orders. Print only the product name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2019-08-03&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9rver5j4vhoh5sjsii7l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F9rver5j4vhoh5sjsii7l.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First we filtered by the orders at '2019-08-03', then we selected just the product name column and finally we printed the first 3 products.&lt;/p&gt;

&lt;p&gt;Get the 3 last different products which was buying. Print the product name and the quantity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F38x5xipt6i26r4le2qms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F38x5xipt6i26r4le2qms.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see we selected the right columns and then we applied the tail function to get the last 3. Similarly we would first apply the tail function and then select the columns. The output would be the same.&lt;/p&gt;

&lt;p&gt;And finally, let's print the five last orders that happened on '2019-08-03' and have a 'Plain Papadum' on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2019-08-03&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Plain Papadum&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)].&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fsbs7xhd2pyos1da2ayrr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fsbs7xhd2pyos1da2ayrr.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, in this case we used the &lt;code&gt;&amp;amp;&lt;/code&gt; operator to filter our dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;In this post we learned how to create a dataframe by reading a file. We saw how it is easy with pandas. We saw the main information we need to know to read a file, such as file extension, how it is organized, and the file path directory.&lt;/p&gt;

&lt;p&gt;We also learned about dataframe basic operations. Such as print sample of our dataframe; select, append and delete rows and columns; and filter specific rows through conditions.&lt;/p&gt;

&lt;p&gt;Finally, I showed to you about how to do chained operations in only one line of code, explaining about the pandas linear logic.&lt;/p&gt;

&lt;p&gt;In the next post we'll see about aggregation and grouping. How we can apply some interesting methods, like for example &lt;code&gt;sum()&lt;/code&gt;, &lt;code&gt;mean()&lt;/code&gt;, &lt;code&gt;avg()&lt;/code&gt;, and grouping our dataset by a collection of elements in common. I'm looking forward to show you interesting things we can do with pandas.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Dataset original reference: &lt;a href="https://www.kaggle.com/henslersoftware/19560-indian-takeaway-orders/" rel="noopener noreferrer"&gt;Takeaway Food Orders&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>pandas</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>pandas #1: Getting started with pandas</title>
      <dc:creator>Gabriela Trindade</dc:creator>
      <pubDate>Sun, 13 Oct 2019 18:44:54 +0000</pubDate>
      <link>https://dev.to/gtrindadi/getting-started-with-pandas-3i53</link>
      <guid>https://dev.to/gtrindadi/getting-started-with-pandas-3i53</guid>
      <description>&lt;h2&gt;
  
  
  What is pandas?
&lt;/h2&gt;

&lt;p&gt;Pandas is an open source library in Python which allows us to handle data with two-dimensional data tables. It is built on top of NumPy package, and they are frequently used together. Pandas library depends on NumPy array to implement some of its objects. &lt;/p&gt;

&lt;p&gt;Pandas was originally created by Wes McKinney. On the book "Python for Data Analysis", Wes McKinney says where the name "pandas" comes from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The pandas name itself is derived from panel data, an econometrics term for multidimensional structured datasets, and a play on the phrase Python data analysis itself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Pandas contains data structures and data manipulation tools meant to make data cleaning and analysis fast and easy. &lt;/p&gt;

&lt;p&gt;It is one of the most preferred and used tools in data wrangling. If you don't know what it means: data wrangling or munging is the process of transforming and mapping data from one "raw" data form into another format, making it more appropriate for analysis, for example.&lt;/p&gt;

&lt;p&gt;With pandas it is possible to load your data into a dataframe, which means that it aligns your data in a tabular fashion as rows and columns. Additionally you can select subset of data, merge dataframes, group by specific values, run functions, and much more. It's like a SQL but with Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why should I use pandas for analysis?
&lt;/h2&gt;

&lt;p&gt;When I started to learn Python and specifically pandas I was wondering why should I use it instead of plain NumPy. I can do several things with NumPy arrays as well, but I found that with pandas I can work with both tabular and heterogeneous data, and it is the biggest difference between pandas and NumPy. NumPy is for working with homogeneous numerical array data. If you intend to deal with data analysis you will have to work with different data types.&lt;/p&gt;

&lt;p&gt;Besides, pandas DataFrame object is easier to work when you compare it to working with lists or dictionaries through loops or lists comprehension.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing
&lt;/h2&gt;

&lt;p&gt;Run the following commands depending on your operational system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ubuntu
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;python3-pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MacOS
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# install pip&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;easy_install pip
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip

&lt;span class="c"&gt;# install pandas with pip&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Trying it out
&lt;/h2&gt;

&lt;p&gt;Once you have pandas installed in your machine, in any Python runner you can try it out by running the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;It works&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note [1]: You can start Python by typing 'python3' on your terminal or you can use another environment to run Python, such as PyCharm and Jupyter Notebook.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [2]: If you don't have python3, update your python! Python 2.x will be supported until 2020.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pandas main data structure
&lt;/h2&gt;

&lt;p&gt;Pandas has two types of data structures: Series and DataFrame.&lt;/p&gt;

&lt;p&gt;As you read at the beginning of this post, pandas loads your data into a dataframe composed of rows and columns³, where rows have index and columns have names. Each column is a Series object, so we can conclude that a DataFrame is a collection of Series objects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1iukmtog4jvx0xum117a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1iukmtog4jvx0xum117a.png" alt="pandas data structure"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note [3]: You might see in some articles on the internet that columns are called  'variables', and rows as 'observations'.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Series
&lt;/h3&gt;

&lt;p&gt;A Series is the data structure for a single column of a DataFrame, in other words it is a one-dimensional data structure. They are capable of holding any data type (integers, strings, float, Python objects, etc). Series has elements of a single type.&lt;/p&gt;

&lt;p&gt;The first element in a Series is assigned the index 0, being it a zero-based indexing. Therefore the last element has the index equals N-1, where N represents the total number of elements in the Series.&lt;/p&gt;

&lt;p&gt;The values of a Series are mutable, however the size of a Series is immutable and cannot be changed. So, to change the Series size, you need to change the DataFrame size too; in other words, if you want add a value in a Series you need to add values to all the Series in a DataFrame.&lt;/p&gt;

&lt;h3&gt;
  
  
  DataFrame
&lt;/h3&gt;

&lt;p&gt;DataFrame is a collection of Series, like a two-dimensional data-structure, where each Series can have its own type.&lt;/p&gt;

&lt;p&gt;As Series, the rows in a DataFrame works with zero-based indexing.&lt;/p&gt;

&lt;p&gt;DataFrame is size-mutable which means that elements can be appended or deleted from it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;In this post you learned that pandas is a library in Python which helps us to deal with two-dimensional data. It has a data structure to ease the data manipulation and data analysis, and that's why we can use pandas for analysis.&lt;/p&gt;

&lt;p&gt;We also learned how to install and try it out in your machine.&lt;/p&gt;

&lt;p&gt;And we finish it learning about its data structures, where pandas is composed by DataFrame that contains a collection of Series.&lt;/p&gt;

&lt;p&gt;In the next post I will show you how to read a file in pandas and some of the basics DataFrame operations.&lt;/p&gt;

</description>
      <category>python</category>
      <category>pandas</category>
      <category>beginners</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
