DEV Community

Cover image for Statistics with Python
Victor Alando
Victor Alando

Posted on

Statistics with Python

We can Calculate all of these operations with Python. We will use Python Package numpy. We will use numpy more later for manipulating arrays, but for now we will just use a few functions for statistical calculations: Mean, median, percentile, std, var

import numpy as np
Enter fullscreen mode Exit fullscreen mode

Let's initialize the variable data to have the list of ages.

data = [15, 16, 18, 19, 22, 24, 29, 30, 34]
Enter fullscreen mode Exit fullscreen mode

Now we can use the numpy functions. For the mean, median, standard deviation and variance functions, we just pass in the data list. For the percentile function, we pass the data list and the percentile (as a number between 0 and 100)

Make sure you download Anaconda Navigator. Here is the

download link to (https://www.anaconda.com/) then after installation select Jupyter Lab. The screen for Jupyter Lab appears as pictured below.

#Age Array
data = [15, 16, 18, 18, 19, 22, 24, 29, 30, 34]

#import numpy library
import numpy as np

print("mean:", np.mean(data))
print("median:", np.median(data))
print("50th percentile (median):", np.percentile(data, 50))
print("25th Percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))
print("Standard Deviation:", np.std(data))
print("Variance:", np.var(data))
Enter fullscreen mode Exit fullscreen mode

Image description

Numpy is a python library that allows fast and easy methematical operations to be performed on arrays.

Reading Data with Pandas

What is Pandas?

This course is in Python, one of the most commonly used languages for Machine Learning.

One of the reasons it is so popular is that there are numerous helpful python modules for working with data. The first we will be introducing is called Pandas

Pandas is a Python module that helps us read and manipulate data. What's cool about pandas is that you can take in data and view it as a table that's human readable, but it can also be interpreted numerically so that you can do lots of computations with it.

We call the table of data a DataFrame.

Python will satisfy all of our Machine Learning needs. We'll use the Pandas module for data manipulation.

Reading in Your Data

We need to start by importing Pandas. It's standard practice to nickname it pd so that it's faster to type later on.

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

We'll be working with a dataset of Titanic passengers. For each passenger, we'll have some data on them as well as whether or not they survived the crash.

Our data is stored as CSV(Comma Separated Values) file. The Titanic.csv file is below. The first line is the header and then each subsequent line is the data for a single
passenger.

Survived, Pclass, Sex, Age, Siblings/
Spouses, Parents/Children, Fare
0, 3, male, 22.0, 1, 0, 7.25
1, 1, female, 38.0, 1, 0, 71.2833
1, 3, female, 26.0, 0, 0, 7.925
1, 1, female, 35.0, 1, 0, 53.1
Enter fullscreen mode Exit fullscreen mode

We're going to pull the data into pandas so we can view it as a DataFrame.

The read_csv function takes in csv format and converts it into a Pandas DataFrame

df = pd.read_csv("Titanic.csv")
Enter fullscreen mode Exit fullscreen mode

The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the data.

print(df.head())

Enter fullscreen mode Exit fullscreen mode

Run this code to see the results

import pandas as pd
df = pd.read_csv("Titanic.csv")
print(df.head())

Enter fullscreen mode Exit fullscreen mode

Image description

Generally, data is stored in CSV (Comma Separated Values) files, which we can easily read with panda's read_csv function. The head method returns the first 5 rows.

Summarize the Data

Usually our data is much too big for us to be able to display it all. Looking at the first few rows is the first step to understanding our data , but then we want to look at some summary statistics.

In pandas, we can use the describe method. It returns a table of statistics about the columns.

print(df.describe())

Enter fullscreen mode Exit fullscreen mode

We add a line in the code below to force python to display all 6 columns. Without the line, it will abbreviate the results.

import pandas as pd
pd.options.display .max_columns = 6
df = read_csv("Titanic.csv")
print(df.describe())

Enter fullscreen mode Exit fullscreen mode

Image description

Top comments (0)