DEV Community: Gabriela Trindade

Creating an Exploratory Data Analysis Report with Pandas-Profiling

Gabriela Trindade — Mon, 11 Jan 2021 17:42:16 +0000

Data Scientists and Analysts usually spend some time to get to know the data they are going to work on by doing exploratory analysis. It's one of the first steps in their journey before making further analysis and predictions. As Pythonists, while doing exploratory analysis with pandas, it's a must using methods such as head, describe, info, columns, shape, isnull, value_counts, unique, duplicated, corr, and so on. In addition to using some visualization libraries, such as seaborn or matplotlib, which is also primordial.

What if with just a very few lines of code we were able to get insights that would require using all of the methods I mentioned before? What if it's a report with visualization built-in? Wow, that would save us a lot of time! And in fact, we can do that. Hopefully, pandas-profiling can provide us a report with exploratory insights.

pandas-profiling is an open-source Python library that allows us to quickly do exploratory analysis with just a few lines of code. Also, as I mentioned before, it's possible to use this library to generate an interactive report, with variables' distributions besides other insights commonly gotten in dataframes during exploratory analysis. This report can be saved in HTML format and easily shared with anyone. Awesome, right?!

Now, let's see in practice how it works.

Installing pandas-profiling

You can install it from the command line via pip.

pip install pandas-profiling[notebook]

Generating an Exploratory Data Analysis Report

After installing it, go to your Jupyter Notebook and load the data you want to explore as a DataFrame object. As an example, we can use the Titanic dataset, but feel free to use the data you want. See the code below.

import pandas as pd

url = 'https://raw.githubusercontent.com/gabrielatrindade/ml-playground/master/projects/titanic/dataset/train.csv'
titanic = pd.read_csv(url)

Then, let's import the ProfileReport class to create the report for the dataframe.

from pandas_profiling import ProfileReport

Now, we are able to create the report.

profile = ProfileReport(titanic, explorative=True,
                        title='Titanic Exploratory Analysis')

profile

Set the explorative parameter as True for a deeper exploration, and a title.

We can see the report as output in the Jupyter Notebook. However, if you want to generate an HTML file to share the analysis with someone, it's also possible. Check the code below.

profile.to_file('output_titanic_report.html')

The report is composed of a lot of information, below I will list most of them.

Overview: we can see some general statistics of the data, information on the report and warnings, that show insights that can highly impact the analysis, such as a high number of null values in a variable, duplicated rows, and high correlation between variables.
Variables: composed of descriptive and quantile statistics information for each variable. Also, it's possible to see the histogram and the common and extreme values of the variable, in the case of continuous variables, and pie chart and frequency of each value for categorical data.
Interactions: allows us to see the relationship between two variables through the scatter plot visualization.
Correlations: shows the heatmap of Pearson, Spearman, Kendall, and Phik correlation matrix.
Missing values: through a bar chart or matrix visualization it's possible to see the missing values for each variable.
Sample: first 10 rows and last 10 rows are printed.
Duplicate rows: shows the duplicated rows.

In the image below you can see what it looks like.

Pandas-profiling limitation

One limitation I could see of pandas-profiling is when it's applied to large datasets because, as the dataset size increases, the report generation time increases a lot.

One way to solve this problem is to generate the report from a sample of the dataset. In that case, if you select a few rows, it's important to make sure that they are representative of all the data you have or you can also select the variables you want to explore.

Another way to deal with this problem is to use the minimal mode (introduced in the 2.4 pandas profiling version). This will generate a simplified report, taking less time than the full one.

profile = ProfileReport(titanic, minimal=True,
                        title='Titanic Exploratory Analysis')

Conclusion

I have shown you how easily we can get an exploratory data analysis report using the pandas-profiling library. With a few lines of code, we can generate an interactive report and create an HTML file for it. The ProfileReport class can save a lot of work in the phase of knowing the data and getting some insights into it.

I hope it can be useful for you! See you next time!

pandas #4: Merging

Gabriela Trindade — Mon, 16 Mar 2020 15:56:59 +0000

Now that we learned about reading files and basic DataFrame operations from the second post, and aggregation and grouping from the third one, it's time to learn about merging.

But, what is merge? In simple words, merge (a.k.a. "joining") is a database-style join operation by columns or indexes to join two dataframes. Let's understand when to use merge and how it works.

When do we need to use `merge` operation?

As we know, in real-life data projects is common you get more than one table storing the data. Although you get multiple tables, they can match in some way and give us more details and insights about each observation of a table, and with merge we can do that. But why doesn't all this information come in only one table? There are a few reasons why it's better to have multiples data tables, some of them are: it avoids redundancy of data, it saves some disk space, it makes it easier to manage the data, it allows your query to be faster because the table is smaller, etc.

How does it works?

pandas provides merge operation that allows us to combine DataFrames. It's almost the same as SQL's join operations. So, to get all the information from different tables together in only one table we can use merge.

Here we're going to practice merge operation with the last dataset we used in the last post, but with modifications - so you need to download this new version (restaurant_orders_v2.csv) - and one more new dataset (restaurant_products_price.csv). Then I strongly recommend you to check the section Before we start to download the datasets we are going to work on.

Like in the last post, this one will be more practical so I would appreciate if you code along with me. I try to do these posts independent from each other, here I'm considering that you'll create a new Jupyter Notebook, then I'm going to declare some variables again.

If you have any questions please feel free to use the comments section below.

Note [1]: I made a Jupyter Notebook available on my GitHub with the code used in this post.

Before we start

In order to work with pandas, we're going to import this library in the beginning.
```
 import pandas as pd
```

Let's download the two datasets we are going to work on.
Note [2]: This first file is not the same as my previous post. Download this new file version.

 !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders_v2.csv

 !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_products_price.csv

Now, let's turn them into dataframes and store each one in a variable.

 column_names_orders = ['order_number', 'order_date', 'item_name', 'quantity']
 orders_v2 = pd.read_csv('restaurant_orders_v2.csv',
                         delimiter=',', names=column_names_orders)

 column_names_prices = ['item_name', 'product_price']
 products_price = pd.read_csv('restaurant_products_price.csv',
                              delimiter=',', names=column_names_prices)

If you would like to know details about this process, you can go through the second post in Reading a File section.

Check these dataframes and try to understand the information that each one has.
```
 orders_v2.head()
```
orders_v2 dataframe is just a new version of the dataframe we used on the last post. The modification in this second version is about a huge number of values that were deleted to reduce our dataset and one column that was removed from the first dataframe version, so now we have a dataframe with 4 columns, which are: order_number, order_date, item_name, and quantity. I believe that the orders_v2 column names are self-explanatory, but if you have any questions about the dataset, let me know through the comments.
```
 products_price.head()
```
As you can see there are just two columns in this products_price dataframe, and they represent the product name (item_name) and the product price (product_price), respectively.

Merging Operation

Now that we discussed the motivation of merge operation, let's going to deep into it, to know the different types and learn how to apply it on the datasets you downloaded.

Overview

Basically, there are 4 types of merge: outer join, inner join, left join and right join. In the image below we can see the purpose of each one.

This information you will pass in how parameter.

Another important parameter to know is right_on and left_on. They allocate the column from the right DataFrame and the one from left DataFrame, respectively, which will match the values. If you don't pass these parameters, pandas will merge through those columns that have the same name.

Note [3]: In case you don't understand, no worries! It's going to be more clear with the code examples.

Inner: intersection of dataframes

An inner join will get only the rows (observations) from the keys that have in both dataframes.

By default merge pandas operation consider inner join. Then you can omit this information and the result will be the same. And because we have columns in each dataframe with the same name, pandas will consider them as the keys and will merge through them. Then you can also omit this information.

Let's merge our two dataframes and take just the purchases that the item_name is registered in both tables. Check it below.

    orders_v2.merge(products_price)

    orders_v2.merge(products_price, how='inner',
                    right_on='item_name', left_on='item_name')

Note [4]: I recommend you use how, right_on, and left_on , in this way your code will be more explicit.

As you can see we have just 25 observations, remember that the inner operation will join observations that are present in the two datasets.

Outer: union of dataframes

An outer join will get all the rows (observations) independently if the keys are present in just one dataframe, regardless of whether there is a match. In case the key is present in one dataframe the columns of another dataframe will be filled by NaN values.

Let's merge our two dataframes and take all the purchases independently if the item_name is registered in both tables. Here you will see that we have more observations (36), but some of them will appear with NaN values.

    orders_v2.merge(products_price, how='outer',
                    right_on='item_name', left_on='item_name')

Right: keys from right dataframe

A right join will get just the rows (observations) from the keys of the right dataframe, regardless of whether there is a match. If the key has no information on the left dataframe, the rest of the columns will be filled by NaN values.

Let's merge our two dataframes and take all the information from the products_price dataframe independently if the item_name was bought in some order on the orders_v2 dataframe. In this case, you will see that some products will appear with their values but there are no orders to them.

    orders_v2.merge(products_price, how='right',
                    right_on='item_name', left_on='item_name')

Left: keys from left dataframe

A left join will get just the rows (observations) from the keys of the left dataframe, regardless of whether there is a match. If the key has no information on the right dataframe, the rest of the columns will be filled by NaN values.

Let's merge our two dataframes and take all the purchases from the orders_v2 dataframe independently if the item_name is registered in the products_price dataframe. In this case, we will have some orders with products that have no prices registered.

    orders_v2.merge(products_price, how='left',
                    right_on='item_name', left_on='item_name')

`merge` vs `join`

By default the join method considers the index to join dataframes or a specific column from the dataframe that it's called on (left dataframe). Then it means that the column from the left dataframe doesn't have to be an index, but for the right dataframe, the key must be its index. In general, I would say that the join method is based on the index.

Otherwise, by default, the merge method will look for overlapping columns in which to merge on, unless we attribute True value to right_index or left_index parameters. merge allows controlling over merge keys through right_on and left_on parameters, as we saw in this post. merge is useful when we don’t want to join on the index. Differently from join, it will return a combined dataframe in which the original index will be destroyed.

Wrapping up

This was the fourth post of my pandas series where I could show to you that merge is a join operation that is very useful. We learned about some reasons to use it.

Also, we could see how it works. Through some practical examples, we also saw the basic parameters of merge operation and what are the default ones.

One of these parameters was how that allocates the merge type. We learned that there are 4 types of merge that allows us to get a set of our joining dataframes that makes more sense for us. The 4 merges are inner (by default), outer, right and left.

Other parameters were right_on and left_on that are about the columns that will represent the keys in each dataframe.

Finally, we learned a little bit about the difference between merge and join in pandas. And we saw that the biggest difference is that join is based on indexes while merge is based on columns.

In the next post, we'll see more about Data Wrangling. See you there!

pandas #3: Aggregation and Grouping

Gabriela Trindade — Sat, 14 Dec 2019 19:40:01 +0000

Let's continue with the series and learn more about how to analyse data with pandas. From the previous post we learned about reading files and basic DataFrame operations. Now we know how to store our data into a DataFrame object, so in this post we are going to do more with it.

Like the last post, this one will be more practical so I would appreciate if you code along with me. I try to do these posts independent from each other, here I'm considering that you'll create a new Jupyter Notebook, then I'm going to declare some variables again.

If you have any questions please feel free to use the comments section below.

Note [1]: I made a Jupyter Notebook available on my GitHub with the code used in this post.

Before we start

This is a pandas tutorial, then it's always required that we import pandas library in the beginning.
```
 import pandas as pd
```
Let's download the dataset we are going to work on.

Note [2]: The file here is the same from my previous post, so if you have it already you can reuse it.
```
 !wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders.csv
```

Now, let's turn it into a dataframe and store it in a variable.

 column_names = ['order_number', 'order_date',
                 'item_name', 'quantity', 'product_price']

 orders = pd.read_csv('restaurant_orders.csv', delimiter=',',
                      names=column_names)

If you would like to know details about this process, you can go through the previous post in Reading a File section.

Data Aggregation functions

First, we are going for more details about aggregations functions. But, what does it mean? and what is it for? Aggregate functions are used to apply functions in multiple rows resulting in one single value. By default, these functions are applied in each column (Series), so you can get a single value for each Series. Let me clarify this through our dataframe from the previous post. Do you remember the dataframe structure? Check it again.

orders.head()

And what if I wish to know how many records are in my data? Or how much I sold since I started to store my data? Or what about how many products did I sell? We can do this simply by summing all the numbers of a certain column (Series) or counting the number of lines in our dataframe, right?! But how to do that? Applying some aggregation functions we can get these numbers easily and get information like median, mean and so on.

sum()

Imagine that we would like to know how many products we sold, like I said before. We can do this by summing all the numbers from the quantity column, right?! Applying sum function directly on the column we can get the answer.

orders['quantity'].sum()

Note [3]: In the second post of this pandas series we saw how to access a value in column with pandas. If you go through the previous post (in Basic DataFrame operations >> Selecting specific rows and columns >> Columns) you can see that there are 3 ways to do that. You can compare the solution above with orders.quantity.sum() or orders[['quantity']].sum().

Note [4]: If you don't specify a column it will return the sum for each column of the dataframe.

In the same way, we can get numbers like how much I sold. But in this case we need to multiple the columns quantity and product_price before applying sum function.

(orders['quantity']*orders['product_price']).sum()

Note [5]: We use parenthesis to evaluate first the multiplication operation before doing the method call.

count()

Now, let's answer the question of how many records we have. To do this I need to count the number of lines I have in my dataframe. So, count function is appropriate for this case. As we did before, in this case we just apply the function over the dataframe.

orders.count()

However it will return the counting of records for each column.

As we can see each column has the same value. It happens because all columns (variables) were not empty. In case of None values, it would not be counted.

But if we want to get one unique value that represents the records quantity, we should choose a column which has no None values and apply the count function over it.

orders['order_number'].count()

min() and max()

And what about the lowest product price I have recorded in my orders? Or the highest one? In these cases we are applying min and max functions, respectively, on the right Series (product_price).

orders['product_price'].min()

orders['product_price'].max()

mean() and median()

mean and median functions return us the column average and the column median, respectively. Below I briefly explain the difference between the two functions:

mean() is the average of a set of numbers. To get this result just add all the numbers and then divide by the amount of elements you added. The mean is not a robust tool since it is largely influenced by outliers.
median() is the middle value in an ascending ordered list of numbers. If the amount of elements in the list is even, i.e. has no middle number, so you add the two numbers in the middle and divide by 2 to get the median. The median is better suited for skewed distributions to derive at central tendency since it is much more robust and sensible.

Now, let's get the mean() and median() of quantity column.

orders['quantity'].mean()

orders['quantity'].median()

There are more aggregation functions, such as to calculate: standard deviation (.std()), variance (.var()), mean absolute deviation (.mad()), standard error of the mean of groups (.sem()), descriptive statistics (.describe()) and so on.

I would like to comment that through the describe function we can get a lot of information, like the ones already mentioned in this post, for example count, mean, median and std. But I'll show it in more details in a Data Wrangling post.

Grouping

Segmentation is part of a Data Scientist work. It means to separate the dataset in groups of elements which has something in common, or that makes sense to be in a group.

Let's take as an example our dataframe. In our dataframe we have records of products that were sold, right?! And it means that we have one or more records per order. So, what if we would like to know how many items were bought in each order? We should treat each order as a group and then apply sum aggregation function. Let's do it.

orders.groupby('order_number')[['quantity']].sum()

Note [6]: If I use just one bracket it will return a Series object, instead of DataFrame object. You can see more about this the previous post, in Basic DataFrame operations >> Selecting specific rows and columns >> Columns.

You can also put the Series you want to select in the end.

orders.groupby('order_number').sum()[['quantity']]

It won't make any difference in our result.

So... what about how many orders I have by day? It's another way that we can group our dataframe.

orders.groupby('order_date')[['order_number']].count()

Simple, isn't it?

Maybe you have noticed that a groupby call will usually be followed by an aggregation function. If we just groupby a dataframe without specifying the aggregation function, it will return the type pandas.core.groupby.DataFrameGroupBy object.

orders.groupby('order_number')

To practice more, let's get these values:

The order with the highest amount of different products.

(orders.groupby('order_number')[['item_name']]
       .count()
       .sort_values('item_name', ascending=False)
       .head(1))

More details: sort_values function is sorting the result by item_name column, in descending order (ascending=False), then we get just the first result (head(1)).

The 7 days that had the lowest quantity of products sold.

(orders.groupby('order_date')[['quantity']]
       .sum()
       .sort_values('quantity', ascending=True)
       .head(7))

More details: Note that we changed to ascending order (ascending=True) because we want the days that have the lowest quantity of products sold.

The 5 most expensive orders.

Before getting these 3 next answers, let's create a new dataframe with a subtotal column, which represents the product_price multiplied by quantity. To not damage our original dataframe, we can create one copy.

If you want to know more details, check the previous post (in Basic DataFrame operations >> Adding rows and columns >> Columns).

orders_copy = orders.copy()

orders_copy['subtotal'] = (orders_copy['product_price'] *
                           (orders_copy['quantity']))

Now, get the 5 most expensive orders.

(orders_copy.groupby('order_number')[['subtotal']]
            .sum()
            .sort_values('subtotal', ascending=False)
            .head(5))

The 3 cheapest orders.

(orders_copy.groupby('order_number')[['subtotal']]
            .sum()
            .sort_values('subtotal', ascending=True)
            .head(3))

The 2 days that had the biggest income.

(orders_copy.groupby('order_date')[['subtotal']]
            .sum()
            .sort_values('subtotal', ascending=False)
            .head(2))

Wrapping up

This was the third post of pandas series. Here we learned about aggregation functions and grouping, and as you saw in some examples it's very useful and help us to discover new insights from our dataset.

Aggregation functions are used to apply specific functions in multiple rows resulting in one single value. And grouping is a way to gather elements (rows) that make sense when they are together. It's very common that we use groupby followed by an aggregation function.

I hope you enjoyed it and you found it clear. If you have any question, please let me know in the comments below.

In the next post we'll learn about merging.

pandas #2: Reading files and basic DataFrame operations

Gabriela Trindade — Fri, 25 Oct 2019 20:05:29 +0000

The first step to start doing data analysis in Python is to import your data from a certain source (in my first post we saw that pandas is great for data analysis). In this post I'm going to show you how to load files into pandas data structure (dataframes) and then we'll check how we can print the whole dataframe or a sample of the data, filter specific values and select specific columns and rows, besides append and delete them. In the end we'll check the logic sequence of pandas operations.

This second post is different from the first one. In the first one we learned the theoretical stuff. On the other hand, this post will be more practical so I would appreciate if you code along with me. If you have any questions please feel free to use the comments section below.

Note [1]: I made a Jupyter Notebook available on my GitHub with the code used in this post.

Before we start

This is a pandas tutorial, so the first step is to import it by typing the following in your Python runner:

import pandas as pd

You may have noticed that I didn't just import pandas but I also gave an alias to it. Then, everytime I need to use the pandas library I can refer to it as pd.

Reading a file

What is your file extension? First of all you need to know what is the file type you'll work on. It's common to have data in .csv (Comma Separated Value) files. But keep in mind that you can work with other file types, like .xls, .xlsx, .txt, .json, .html, and so on.

To analyse the data, the first step is to import it from the file into a dataframe. To do this is easy with pandas since it provides functions for each file type, as you can see here. But in this post we're working with .csv files because it is one of the most common or maybe the most common.

The second important thing to know is where is your file, i.e. you need to know what is the path directory of your file to allow pandas to find it.

Knowing this, let's download our working file.

!wget https://raw.githubusercontent.com/gabrielatrindade/blog-posts-pandas-series/master/restaurant_orders.csv

Now, open your file and make observations on it, like: Is there a head, i.e. the first line represents the column names? What is the column separator? This two questions are important to import the file into a dataframe. To do that use the read_csv() function:

pd.read_csv('restaurant_orders.csv', delimiter=',')

As you can see the file has no head and pandas will considerate the first line a head by default, thus let's add the column names to it.

column_names = ['order_number', 'order_date', 'item_name',
                'quantity', 'product_price']

pd.read_csv('restaurant_orders.csv', delimiter=',', names=column_names)

Now let's store our dataframe into a variable because we'll work a lot with it.

orders = pd.read_csv('restaurant_orders.csv', delimiter=',',
                     names=column_names)

Now, you can see the whole dataframe through the variable we just introduced.

orders

Note [2]: Did you notice how this file has so many rows (observations)? To be exact there are 74818 rows. Wow! And we are going to work with this dataset.

Before we move on, pay attention to our dataset and try to understand what it means. It represents orders from Indian takeaway restaurant in London, UK. Each row is a single product within the order, i.e. one order can have more than one row. There are 5 columns: order_number, order_date, item_name, quantity, and product_price. I believe that the names are self explanatory, but if you have any questions about the dataset, let me know through the comments.

You also can verify the type of orders variable. As we know, it's a DataFrame object.

type(orders)

Basic DataFrame operations

Printing samples

head and tail

These functions will print the first and last rows, respectively. They are commonly used to see a sample of Series or DataFrame object. By default they will print five rows, but you can pass a number as an argument.

orders.head()

orders.tail()

Sample

It's one more function to have a sample of Series or DataFrame. However you need to pass the number as an argument and then it will sort the rows randomly.

orders.sample(5)

Selecting specific rows and columns

Other ways to get a subset of your dataframe is by selecting a specific set of rows and columns you want.

Rows

To select rows you only need to pass the rows interval between brackets.

It's important to highlight that pandas is based-indexing zero, i.e. the first element is zero which is included, and the last index is excluded.

orders[0:7]

But if you don't want to specify the begin or the end of your subset, you can simply omit this information. It will consider the first index or the last index for omitted ones, respectively.

orders[:6]

Columns

To select columns, if you want to get a DataFrame object, use double brackets.

orders[['item_name']]

But if your need is to have a Series object, there are two ways to access it.

orders['item_name']

orders.item_name

As you'll see at the end of this post, you can chain operations and it will follow the linear logic. So you can do something like this.

orders[['item_name']][:6]

But I will explain it in more details in the end.

.iloc, .loc

The selection of rows and columns also can be made by some attributes, such as .iloc and .loc. The general syntax for these attributes is dataframe.attribute[<row selection>, <column selection>]. In these cases you can omit the <column selection>, but you need to inform the row ones. I am going to show you some examples.

.iloc is based on position.

orders.iloc[0:7]

orders.iloc[:, 2:4]

orders.iloc[-1]

Note [3]: Negative index will consider from the bottom to the top. Then, in this case it returns the last row. Ah, and it works in the same logic to columns as well.

Note [4]: .iloc returns a pandas Series when one row is selected, and a pandas DataFrame when multiple rows are selected or if any column in full is selected. If you want a DataFrame use double brackets.

.loc is based on label. However our label for rows was not changed so by default it is the index.

orders.loc[0:7, ['order_number', 'item_name']]

orders.loc[[1, 3, 5], ['order_number', 'item_name']]

Filtering by specific values

What if you need to select rows that contain specific values? Let's say we want to select only the order 16118 to see the items of it.

orders[orders.order_number == 16118]

In this way we can see all the items listed for this order. But how it works? First pandas check the condition between brackets orders.order_number == 16118. If you just compile this part, it returns a pandas Series object with Boolean value where each one is the result of the condition for all rows in the dataframe. Then, in the second part orders[...] pandas prints just the correspondent rows for values that were True.

It works like a 'where' clause in SQL.

One more example is to filter by day.

orders[(orders.order_date == '2019-08-03') |
       (orders.order_date == '2019-08-02')]

Note [5]: In these examples I'm using the two ways to access a column in a dataframe. As I explained at the beginning of this post, I can access column using brackets or dot notation.

As you probably noticed we can use some logical operators (&, | to represent 'and', 'or') as well as comparison operators (==, !=, <>, <, >, <=, >=) in our conditions.

Adding rows and columns

In order to not compromise our dataframe, for this subsection and the next one, we'll create a copy of our dataframe and then work from it.

orders_copy = orders.copy()

orders_copy.head()

Note [6]: To create a copy you need to use copy() function. If you just assign orders to orders_copy you are creating a reference to it which means that changing orders_copy will change orders too.

Rows

To add new rows to a dataframe is easy. We can use the append() function and as an argument we pass a Series list, or Dictionary to represent the row.

Pay attention about the type of the values, they don't need to match with the data types of each dataframe column, but you don't want to mess up your dataframe and consequently your analysis. So be sure that you are adding the right values in the correct columns.

orders_copy = orders_copy.append(
    pd.Series([123456, '2019-08-04', 'Product Test', 4, 1.00],
              index=orders_copy.columns),
    ignore_index=True)

orders_copy.tail()

It's also possible to do it by using the .loc attribute directly. Although it's not considered a good practice since you should rely more on the object API instead of changing directly their internal state.

row = [12134567, '2019-08-04', 'Product Test', 4, 1.70]
orders_copy.loc[len(orders_copy)] = row

orders_copy.tail()

Columns

There are some ways to add columns in a dataframe.

You can do that using list, but in this case you need to assign for each row a value. Then, in our case we'd need to create a list with 74820 values. In general we do this by assigning a function or an expression to the column. But, if you want to assign the same values for all rows, you can do this just like the example below.

Let's say that all items had a discount.

orders_copy['discount_pct'] = 10

orders_copy.head()

It doesn't look so useful, does it? But in our case we'll use it to give the discount for each product price.

So let's see the interesting part. You can use an expression to fill the column. And here we're going to create the column that represents discount through price.

orders_copy['discount_price'] = (orders_copy['product_price'] *
                                 (orders_copy['discount_pct'])) / 100

orders_copy.head()

So all the discount_price column is representing the discounted value for each product (row).

You can also add columns using assign() function.

Let's create the total discount taking into account the quantity of products.

orders_copy = orders_copy.assign(
    discount_subtotal=lambda row: (row['quantity'] * row['discount_price']))

orders_copy.head()

One more way to do add columns is through the apply() function. In this case you'll create a column using a function that will be applied for each row. Like the examples above.

orders_copy['subtotal'] = orders_copy.apply(
    lambda row: (row['quantity'] *
                 (row['product_price'] - row['discount_price'])), axis=1)

orders_copy.head()

The axis=1 means that we are working with column.

Deleting rows and columns

It's important to evaluate the decision of deleting something before doing it. You can do some filters to check if it's really what you want to delete. Here we are working with the dataframe copy, so we don't need to worry about that.

To both, rows and columns, we use drop() function.

Row

Let's delete the row that we added. But before, let's check the row index.

orders_copy[orders_copy.item_name == 'Product Test']

Now we know that the index is 74818. Let's drop it.

orders_copy = orders_copy.drop(orders_copy.index[74818])

Check if it was deleted.

orders_copy[orders_copy.index == 74818]

It doesn't exist anymore.

Column

In the same way, let's delete the columns that were created.

orders_copy = orders_copy.drop(['discount_pct', 'discount_price',
                                'discount_subtotal', 'subtotal'], axis=1)

The axis=1 refers that it's a column, not a row.

Check it is was deleted.

orders_copy.head()

Linear logic

The pandas logic is very linear (compared to SQL, for example), you can chain operations one after the other. The input of the latter function is the output of the previous one. Let's see some examples.

Get the first 3 different products bought at '2019-08-03' . Consider that the dataset is sorted by the orders. Print only the product name.

orders[orders.order_date == '2019-08-03']['item_name'].head(3)

First we filtered by the orders at '2019-08-03', then we selected just the product name column and finally we printed the first 3 products.

Get the 3 last different products which was buying. Print the product name and the quantity.

orders[['item_name','quantity']].tail(3)

As you can see we selected the right columns and then we applied the tail function to get the last 3. Similarly we would first apply the tail function and then select the columns. The output would be the same.

And finally, let's print the five last orders that happened on '2019-08-03' and have a 'Plain Papadum' on it.

orders[(orders.order_date == '2019-08-03') &
       (orders.item_name == 'Plain Papadum')].tail()

As you can see, in this case we used the & operator to filter our dataset.

Wrapping up

In this post we learned how to create a dataframe by reading a file. We saw how it is easy with pandas. We saw the main information we need to know to read a file, such as file extension, how it is organized, and the file path directory.

We also learned about dataframe basic operations. Such as print sample of our dataframe; select, append and delete rows and columns; and filter specific rows through conditions.

Finally, I showed to you about how to do chained operations in only one line of code, explaining about the pandas linear logic.

In the next post we'll see about aggregation and grouping. How we can apply some interesting methods, like for example sum(), mean(), avg(), and grouping our dataset by a collection of elements in common. I'm looking forward to show you interesting things we can do with pandas.

Dataset original reference: Takeaway Food Orders

pandas #1: Getting started with pandas

Gabriela Trindade — Sun, 13 Oct 2019 18:44:54 +0000

What is pandas?

Pandas is an open source library in Python which allows us to handle data with two-dimensional data tables. It is built on top of NumPy package, and they are frequently used together. Pandas library depends on NumPy array to implement some of its objects.

Pandas was originally created by Wes McKinney. On the book "Python for Data Analysis", Wes McKinney says where the name "pandas" comes from:

The pandas name itself is derived from panel data, an econometrics term for multidimensional structured datasets, and a play on the phrase Python data analysis itself.

Pandas contains data structures and data manipulation tools meant to make data cleaning and analysis fast and easy.

It is one of the most preferred and used tools in data wrangling. If you don't know what it means: data wrangling or munging is the process of transforming and mapping data from one "raw" data form into another format, making it more appropriate for analysis, for example.

With pandas it is possible to load your data into a dataframe, which means that it aligns your data in a tabular fashion as rows and columns. Additionally you can select subset of data, merge dataframes, group by specific values, run functions, and much more. It's like a SQL but with Python.

Why should I use pandas for analysis?

When I started to learn Python and specifically pandas I was wondering why should I use it instead of plain NumPy. I can do several things with NumPy arrays as well, but I found that with pandas I can work with both tabular and heterogeneous data, and it is the biggest difference between pandas and NumPy. NumPy is for working with homogeneous numerical array data. If you intend to deal with data analysis you will have to work with different data types.

Besides, pandas DataFrame object is easier to work when you compare it to working with lists or dictionaries through loops or lists comprehension.

Installing

Run the following commands depending on your operational system.

Ubuntu

$ sudo apt-get install python3-pandas

MacOS

# install pip
$ sudo easy_install pip
$ sudo pip install --upgrade pip

# install pandas with pip
$ pip install pandas

Trying it out

Once you have pandas installed in your machine, in any Python runner you can try it out by running the following code:

import pandas as pd

df = pd.DataFrame({'result': ['It works']})
df

Note [1]: You can start Python by typing 'python3' on your terminal or you can use another environment to run Python, such as PyCharm and Jupyter Notebook.

Note [2]: If you don't have python3, update your python! Python 2.x will be supported until 2020.

Pandas main data structure

Pandas has two types of data structures: Series and DataFrame.

As you read at the beginning of this post, pandas loads your data into a dataframe composed of rows and columns³, where rows have index and columns have names. Each column is a Series object, so we can conclude that a DataFrame is a collection of Series objects.

Note [3]: You might see in some articles on the internet that columns are called 'variables', and rows as 'observations'.

Series

A Series is the data structure for a single column of a DataFrame, in other words it is a one-dimensional data structure. They are capable of holding any data type (integers, strings, float, Python objects, etc). Series has elements of a single type.

The first element in a Series is assigned the index 0, being it a zero-based indexing. Therefore the last element has the index equals N-1, where N represents the total number of elements in the Series.

The values of a Series are mutable, however the size of a Series is immutable and cannot be changed. So, to change the Series size, you need to change the DataFrame size too; in other words, if you want add a value in a Series you need to add values to all the Series in a DataFrame.

DataFrame

DataFrame is a collection of Series, like a two-dimensional data-structure, where each Series can have its own type.

As Series, the rows in a DataFrame works with zero-based indexing.

DataFrame is size-mutable which means that elements can be appended or deleted from it.

Wrapping up

In this post you learned that pandas is a library in Python which helps us to deal with two-dimensional data. It has a data structure to ease the data manipulation and data analysis, and that's why we can use pandas for analysis.

We also learned how to install and try it out in your machine.

And we finish it learning about its data structures, where pandas is composed by DataFrame that contains a collection of Series.

In the next post I will show you how to read a file in pandas and some of the basics DataFrame operations.

DEV Community: Gabriela Trindade

Creating an Exploratory Data Analysis Report with Pandas-Profiling

Installing pandas-profiling

Generating an Exploratory Data Analysis Report

Pandas-profiling limitation

Conclusion

pandas #4: Merging

When do we need to use merge operation?

How does it works?

Before we start

Merging Operation

Overview

Inner: intersection of dataframes

Outer: union of dataframes

Right: keys from right dataframe

Left: keys from left dataframe

merge vs join

Wrapping up

pandas #3: Aggregation and Grouping

Before we start

Data Aggregation functions

sum()

count()

min() and max()

mean() and median()

Grouping

Wrapping up

pandas #2: Reading files and basic DataFrame operations

Before we start

Reading a file

Basic DataFrame operations

Printing samples

head and tail

Sample

Selecting specific rows and columns

Rows

Columns

.iloc, .loc

Filtering by specific values

Adding rows and columns

Rows

Columns

Deleting rows and columns

Row

Column

Linear logic

Wrapping up

pandas #1: Getting started with pandas

What is pandas?

Why should I use pandas for analysis?

Installing

Ubuntu

MacOS

Trying it out

Pandas main data structure

Series

DataFrame

Wrapping up

When do we need to use `merge` operation?

`merge` vs `join`