If you are a data science enthusiast, want to work on data analytics or machine learning, and wondering where and how to start, what you would need to learn at first is to read and manipulate a dataset. While working with a data analytics or machine learning problem, you would most likely be given a set of data (probably an excel sheet), or you might be collecting data from some hardware, survey or some other source. When I first started working in this field, I had a hard time keeping track of the most common and widely used dataset manipulation commands. I would like to share some of my most used commands in the ‘Pandas’ library from Python in this article. The dataset I’ve used to show examples is taken from Kaggle (https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009). I’ve used Google colab to run my codes which you can easily use by visiting the link https://colab.research.google.com/notebooks/intro.ipynb#recent=true. You need to create a new notebook to write your code blocks.
At first you need to upload your dataset to Google colab. To do so, you need to write:
from google.colab import files uploaded = files.upload()
You will get a button to select the .csv file from your computer. Once you upload the file, check if the name is still the same, because uploading the same file multiple times in the same session would change the name of your dataset.
As your file is uploaded now, you need to read the dataset. You'll be using the 'Pandas' library to read the .csv file and mention it as 'pd'. The full form of CSV is Comma Separated Values, and these type of format is used to store data in a table (or spreadsheet) format, with rows and columns. Therefore we would need a two dimensional data structure to read data from the .csv files. The most common two dimensional data structure in Pandas is dataframes. We are taking a dataframe denoted by df, reading the .csv file and keeping the contents of the file into the dataframe df.
import pandas as pd df = pd.read_csv("winequality-red.csv") df
This is how your data looks like. You can find the total number of rows and columns in the bottom left corner of your output. There is another way to learn the dimension of your dataset:
The output is: (1599, 12), where the numbers mean rows and columns consecutively. As there are a number of columns, there might be a need of knowing what types of data they are, numbers, fractions or words. In order to check that, write:
You can see some statistical summaries such as count, mean, standard deviation, minimum and maximum value and 25th, 50th and 75th percentile of all the columns separately using the command:
You might have already noticed that all the rows are not shown in the output. The first and last rows are shown and some of the middle ones are not shown and replaced by '...' instead. Viewing all these rows might be too much at times and you might want to view only a few rows of data to check if your code is working. For example, if you want to see only the first five lines of data:
Similarly, if you want to see only the last few lines of your dataset:
What if you want to see the first 8 lines?
The number after the colon indicates how many rows starting from the first row (in this case, from 0th to 7th row) you want to see. Now if you want to see the last 8 rows, you will have to find out the 1591st row to 1598th row. To do so:
If you want to see all the rows of dataset at a time instead of the '...', do this:
pd.set_option('display.max_rows', None) df
This will allow you to see all the rows within a scrollable field.
You can also Transpose your dataframe, that means, turn the rows into columns and the columns into rows. To exchange rows and columns, write:
Here, you cannot see all the columns and the middle ones are replaced with '...' again. To change it, you can write:
pd.set_option('display.max_columns', None) df.T
As you have known the basic display command of Pandas, you are ready to dive into the dataset manipulation techniques.