Pandas #1: Working with data in Python and its main structures

We currently know that Python and R are the mainstream languages when talking about Data Science and also that one of the major characteristics of both languages is their huge communities and the willing of help each other. In the “Pythonic World” the common sense is the existence of a basic stack of libraries that are mandatory to anyone who want to manipulate data with speed, elegance and efficiency.

One of the most important is Pandas:

Pandas is being developed since 2008 as an open source project and for years has been saving working times of programmers and data analysts / scientists. But what makes Pandas so incredible? The heart of Pandas is its capabilities in manipulating data structures and provide tools to complex data analysis. You can find instructions to install Pandas here (obs: Pandas is dependent of Numpy).

Here you have an useful overview if you want to begin working with Pandas:

Basic Structures

Pandas has two main objects: Series and Dataframe.

SERIES

Series is an one-dimensional labeled array and holds any data type you want (int, float, list, …). The labels are called indexes and are useful to explicitly find the position of any data and manipulate it. I’ll create a simple list and use it to create an example of Pandas series:

import pandas as pd
s = pd.Series(["a", "b", "c", "d"])
print(s)

>>> 0  a
    1  b
    2  c
    3  d
    dtype: object

As you can see, using the Series() method I created a series with a 0 to 3 indexed items, but I know what you’re thinking: “Big deal..this is exactly what a traditional Python list does!”. Yes, but if you want more intuitive names to the items? Well, you can change the indexes passing a list of names to the series while creating effectively it:

s = pd.Series(["a", "b", "c", "d"], index=["first", "second", "third", "fourth"])

or calling the .index attribute

s.index = ["first", "second", "third", "fourth"]
print(s)

>>> first  a
    second b
    third  c
    fourth d
    dtype: object

Another way to create customized indexes is passing a dictionary to Series() method instead of a list. The keys will be considered as the indexes of the series:

s = pd.Series({"first": "a", "second": "b", "third": "c", "fourth": "d"})
print(s)

>>> first  a
    second b
    third  c
    fourth d
    dtype: object

And the data can be accessed as in a dictionary:

print(s.second)
>>> b

print(s['second'])
>>> b

Changing the indexes to names that are easy to remember is convenient, but that’s all. If you want to keep accessing by the original order you still being able to do it:

print(s)
>>> first  a
    second b
    third  c
    fourth d
    dtype: object
print(s[0])
>>> a

DATAFRAME

Dataframe is a two-dimensional labeled data structure with columns and rows like a spreadsheet or a table. Even as the series the dataframes can hold any kind of data. An important thing to highlight is that all columns in a dataframe are Pandas series. So, a dataframe is an association of many series as columns! I’ll create a simple dataframe using the DataFrame() method.

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])print(df)
>>>    0  1  2
    0  1  2  3
    1  4  5  6
    2  7  8  9

Note that, to create a dataframe is passed to the method a list of lists where each secondary list became a row and just like a series, each row and column has its indexes. But we can rename it in the same way:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=["first","second","third"], columns=["col1","col2","col4"])
print(df)

>>>       col1 col2 col3
    first    1    2    3
    second   4    5    6
    third    7    8    9

Access anything in a dataframe is only allowed using an index directly if you don’t give names to the columns. If you try to do this with the named dataframe above the interpreter will raise an error:

print(df[0])
>>> KeyError: 0

Directly using its name, you can access any column:

print(df["col2"])

>>> first     2
    second    5
    third     8
    Name: col2, dtype: int64

Well…and if you want to access rows or even one specific value? There are two different easy ways to access rows and columns and this is using the attributes loc and iloc.

How loc works?

With loc you can pass the row x columns coordinates with the respectively names to access a value:

print(df.loc["third","col4"])

>>> 9

Sometimes you have unnamed rows. In these cases the rows are automatically indexes with numbers and you can use them inside loc.

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["col1", "col2", "col4"])print(df)

>>>  col1 col2 col3
    0   1    2    3
    1   4    5    6
    2   7    8    9

print(df.loc[2,"col3"])

>>> 9

Previously I said that you couldn’t use indexes directly to access data, but iloc helps us in this case!!

How _iloc_ works?

This attribute works exactly the same way as loc except that you pass indexes instead of names inside it:

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["col1", "col2", "col4"])
print(df)

>>>  col1 col2 col3
    0   1    2    3
    1   4    5    6
    2   7    8    9

print(df.iloc[0, 0])

>>> 1

Alternative!!!

In addition to loc and iloc, there is another attribute called at. This atribute works exactly as loc, but it really access only one item, one data position while loc can acess a range of rows and columns.

df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["col1", "col2", "col4"])
print(df)

>>>  col1 col2 col3
    0   1    2    3
    1   4    5    6
    2   7    8    9

print(df.at[2,"col3"])

>>> 9