DEV Community

Cover image for Python 101 - Python for Data Science
Elvis Mburu
Elvis Mburu

Posted on

Python 101 - Python for Data Science

Python is a high-level, general purpose programming language.
Python is dynamically typed.
The language is object-oriented and supports functional programming too.
Python was developed by Guido van Rossum in the 1980's.

Since there are many programming languages lets look at why Python may be the best fit for you:

Advantages of Python

  • Simplicity in it's use and hence simple to understand.
  • It's free and Open-Source: This is made possible by a whole wide range diverse and vibrant community determined to develop and improve it.
  • Interpreted Language: This means that Python directly executes the code line by line. Incase of an error, it stops further execution and reports back the error which has occurred.
  • Extensive library: Python has an extensive library of different packages and methods thus reducing coding many functions from scratch.
  • Dynamically Typed
  • Portability: This ensures code developed in one machine runs in another machine including those having different architectures.
  • Supportive and vibrant large community.

Applications of Python
Python as a language has traversed many use cases and is now being used in many fields and domains.

Here's a few of them:

  • Web applications
  • Automation
  • Artificial Intelligence
  • Statistics
  • Data Analysis
  • Machine Learning
  • Desktop Applications
  • Back-end Development

Deep Dive Into Python

We'll put our focus on the Python for Data Science.
But first we'll build our Python muscles by understanding the basics.

Outline

  • Introduction to variables
  • Data types in python
  • Operators in Python
  • Data Structures
  • Control Flows
  • Functions
  • Packages
  • Data Science

Setting up Coding Environment
You can use the following tools:

  1. Jupyter notebooks Windows:
  2. install Python link: click here
  3. Download and install Anaconda here : link: click here
  4. Mac OS : click here
  5. Linux OS : click here
  6. Google colab : It's an online environment to run Python code
  7. you can access it : click here

1. Introduction to variables

What are variables? You might ask.
A variable is a value that can changes and is assigned a value to which it refers to.

Remember this in your O and A levels:
let x be 12 or even y=mx+c
In this case x and y are variables that refer to/represent something else

Something amazing with them is that they can be used multiple times and refer to different values each time.
Example:

x = 2 
x = 4
x = 8
or in the case of `y=mx+c` where c is a constant
y = 43 + 3
y = 45 + 4
Enter fullscreen mode Exit fullscreen mode

In Python variables are pretty much the same as the concept used in Mathematics.
They are used to refer to various values

Example

x=56
y=45.34
hello= "Hello, world"
Enter fullscreen mode Exit fullscreen mode

There are various rules governing variable naming.
Here's a few:

  • Variable names cannot be the same as Python keywords
  • variable names can only contain letters, digits or an underscore
  • Variable names can only start with a letter or an underscore
  • Variable names cannot contain spaces
  • Variables names are case-sensitive thus myName and MyName are regarded as different variable names

Here's a link to the official guide

2. Data Types

A data type is a classification that specifies which type of value of a variable has.
There are various data types used in Python
Here's a few that are supported in Python

  • strings: They refer to a sequence of characters, digits or symbols and are always treated as text.
  • Boolean: True or False values
  • Integer: Numeric data types that do not have fractions/decimals
  • Float: Numeric data types that have fractions
    Example in code

    num1 = 1 # Integer
    num2 = 2.0 # Float
    bool1 = True # Boolean True
    bool2 = False # Boolean False
    myStr = "Hello, world" # String

In the above code example you have noticed something new that we have not talked about: The # character.
This character is used to denote a comment.
What is a comment?
A comment is an explanation/annotation in the source code of a computer program
They are added to make the code easier to understand and are ignored by the interpreter hence not executed
Comments in Python are used in a single line

3. Operators in Python

There are two types of operators in Python

  • Arithmetic Operators
  • Conditional Operators

a. Arithmetic Operators
They perform basic Mathematical functions.
Here's a simple list:

  • + addition x + y
  • - Subtraction x-y
  • * Multiplication x*y
  • / Division x/y
  • % Modulus x%y
  • ** Exponentiation x**y
  • // Floor Division x//y

b. Conditional Operators
They are used in conditional statements that evaluate to True or False.
Examples:

  • and Logical AND: True if both the x and y operands are true
  • or Logical OR: True if one of the x or y operands is true
  • not Logical NOT:True if operand is not x false and vice versa
  • > Greater than: True if the left x>y operand is greater than the right
  • < Less than: True if the left operand x<y is less than the right one
  • >= Greater than or equal to x>=y
  • <= Less than or equal to x<=y

4. Data Structures

A data structures are a way of organizing data so that it can be accessed more efficiently depending upon the situation.

Here's a list of some of the main data structures in Python.

  • Lists
  • Dictionaries
  • Sets
  • Tuples

a. Lists
Lists refer to a data structure that is used to hold multiple items in one variable and can be created using [] brackets
Example
fruits = [] # Here we create an empty list
names = ['John', 'Doe'] # Here we create a list containing two items

Lists are ordered and their items can be accessed by what we call indexing.
In Python the first index is always 0.
So in order to access an item in a list we use:
list_name[index]
for example:

fruits = ['apple', 'mango', 'melon', 'orange'] # a list containing 4 items
fruits[0] # accessing the first item 'apple' from the list
fruits[1] # accessing the the second item 'mango' from the list
Enter fullscreen mode Exit fullscreen mode

Some list methods and manipulation

** slicing **
Refers to retrieving items from a specified portion in a list
Examples:

fruits = ['apple', 'mango', 'melon', 'orange']
fruits[:] # retrieving every item in the list
fruits[0:2] # retrieving items from the first element to the element at index 2 exclusive
fruits[-1] # negative indexing, retrieving the last item
Enter fullscreen mode Exit fullscreen mode

len()
the function returns the length of the list
Example:

fruits = ['apple', 'mango', 'melon', 'orange']
print(len(fruits)) # prints 4 which is the number of elements in the list fruits
Enter fullscreen mode Exit fullscreen mode

type()
Return the data type
Example:
print(type(fruits)) # prints <class 'list'>

Lists are mutable, this means that they can modified.
Thus:

  • you can add items to a list
  • you can remove an item from a list
  • you can change the list items

Examples:

fruits = ['apple', 'mango', 'melon', 'orange']
fruits.append('guava') # adding 'guava' at the end of the list
fruits.insert(1, 'passion') # inserting 'passion' at index 1 of the list
fruits.pop() # remove the last item in the list
fruits.remove('apple') # removing apple from the list
Enter fullscreen mode Exit fullscreen mode

b. tuple
Tuples are used to store multiple items in a single variable.
Tuples are immutable, thus you can not alter the form in which they were created.
They store items in ()

Example
thisTuple = ('apple', 'banana', 'berry') # creating a tuple named 'thisTuple' with three items

  • Tuples are ordered
  • Tuples are immutable
  • Tuples allow duplicates
  • Tuples can contain different data types

a type()
Returns the tuple's data type

mytuple = (1, 2, 3, 4)
print(type(mytuple)) # returns <class 'tuple'>
Enter fullscreen mode Exit fullscreen mode

c. Set

It is a collection which is unordered, immutable and un-indexed
No duplicate members

names = {'one', 'two', 'three'}
Enter fullscreen mode Exit fullscreen mode

d. Dictionary

It's a data structure that consists of key-value pairs.
It's ordered, mutable and doesn't allow duplicates.

Dictionaries are written with curly brackets and have keys and values.

Example:

myDict = {
    'brand': 'Ford',
    'model': 'Mustang',
    'year': 1964
} # creating a dictionary with 3 sets of elements (key-value pairs)
Enter fullscreen mode Exit fullscreen mode

5. Control Flows

a. if Statements
It is a conditional statement that is used to determine whether a block of code will be executed or not.

If the condition defined evaluates to true, it will continue to execute the code block in the if statement

Example of if-statement

age = 20
if (age > 18):
    print("You are an adult")
Enter fullscreen mode Exit fullscreen mode

What if you want to execute another block of code if age is not greater than 18?
We make use of the else statement

age = 20
if (age > 18):
    print("You are an adult")
else:
    print("You are still a minor")
Enter fullscreen mode Exit fullscreen mode

What if you want to test many conditions?
We'll make use of elif statement

age = 20
if (age < 18):
    print("You are a minor")
elif (age > 18 and age <= 35):
    print("You are an adult")
else:
    print("You are a senior adult")
Enter fullscreen mode Exit fullscreen mode

we can even use if statements inside other if statements.
They are called nested if statements.

Example:

age = 20
if (age > 18):
    if (age < 35):
        print("You are an youth")
Enter fullscreen mode Exit fullscreen mode

b. for Statements

It iterates over the items of any sequence, in the order that they appear in the sequence

words = ['cat', 'window', 'defenestrate']
for word in words:
    print(word, len(words))
Enter fullscreen mode Exit fullscreen mode

c. while Statement

It is used for repeated execution as long as an expression is true.
Example:

number = 5
x = 0
while ( x < number):
    print(x)
    x++
Enter fullscreen mode Exit fullscreen mode

The range() Function

It generates arithmetic progressions

for i in range(5):
    print(i)
Enter fullscreen mode Exit fullscreen mode
  • This generates 5 numbers 0 through 4 (remember python starts counting from 0)

The break and continue Statements

The break statement breaks out of the innermost enclosing for or while loop

for i in range(2, 10):
    for x in range(2, n):
        print(n, 'equals', x, '*', n//x)
        break

else:
    print(n, 'is a prime number')
Enter fullscreen mode Exit fullscreen mode

The break statement continues with the next iteration of the loop

for num in range(2, 10):
    if num % 2 == 0:
        print("Found an even number", num)

    print("Found an odd number", num)
Enter fullscreen mode Exit fullscreen mode

pass Statements

The pass statement does nothing.
It is often used when a statement is required syntantically but the program requires no action

Example

while True:
    pass
Enter fullscreen mode Exit fullscreen mode

6. Functions

A function is a block of code which only runs when it is called.
A function can return data

There are four types of Python Functions:

  • Built-in functions - they are functions embedded in the Python interpreter and are ready for use.
    You have certainly come across some by now example:

    • len() - finding the length of a list, tuple etc
    • print() - display a sequence of characters
    • type() - return the data type of a data structure etc
  • Recursion functions - refers to functions that call themselves

  • Lambda functions - they are anonymous function that are defined without a name

  • User defined functions - they are functions defined by the user to do a specific task

Example of user defined functions

def greetings(): #defining the function
    print("Hello All")

greetings() #calling the function
Enter fullscreen mode Exit fullscreen mode

7. Packages
Packages are collections of multiple Python files.
Packages are a directory of python scripts, where each script performs a specific function.

For Data Science, the commonly used packages are:

  • Numpy: Used for working with arrays
  • Matplotlib: Used for Data Visualization
  • Scikit-learn: For Machine Learning Algorithms The Python files are known as modules. This approach helps achieve modularization.

importing packages

Packages can contain sub-packages which also have modules
To load any package or module, we use the keyword import followed by the module name or package name

i. numpy
numpy - Numerical Python
It's a core library for scientific computing.
It provides high performance multi-dimensional array object and tools for working with these objects

  • numpy vs. python list
    Numpy is much faster in performance than purely Python based approach

  • creating numpy array from a python list

    import numpy as np # importing numpy package and giving it a np alias
    marks = [78, 47, 98, 43, 58] # creating a list
    marks_np = np.array(marks)
    print(type(marks_np)) # prints

  • ndarray attributes

    • ndim: number of dimensions of the array
    • shape: shape of the array array (n_rows, n_cols)
    • dtypes: data types stored in the array
    • size: the total number of elements in the array
    • strides: number of bytes that must be moved to store each row and column in memory (no_bytes_files, no_bytes_columns)

Example:

print('dimension ',mark_s.ndim)
print('shape ', mark_s.shape)
print('size ', mark_s.size)
print('dtype ', mark_s.dtype)
print('strides ', mark_s.strides)
Enter fullscreen mode Exit fullscreen mode

some key functions defined for numpy arrays

  1. zeros(shape=(n,m)) : creates a zero-array with the shape (n rows, m columns)

    x = np.zeros(shape=(3,5), dtype ="int32")
    print(x)

  2. arange(start=i, stop=j, step=u) : creates a 1-D array whose first value is i inclusive and last value of j exclusive, each values has a step of s to the next or from previous

    x = np.arange(start=100, stop=1000, step=100, dtype="int32")
    print(x)

  3. linspace(start=i, stop=j, num=n) : creates a 1-D array whose first value is i inclusive, last value is j inclusive and contains n values in total

    x_lin = np.linspace(start=10, stop=50, num=30)
    print(x_lin)

  4. full(shape=(n,m), fill_value=f) : allows to create an array with the shape (n rows, m columns), where all positions have the value f.

    x_ful = np.full(shape=(5,6), fill_value=3)
    print(x_ful)

ii. Pandas
Stands for Python Data Analysis Library
It is an open-source Python library
It is used by data scientists/analysts to:

  • read
  • write
  • manipulate
  • analyze the data

Why Pandas?

  • It helps you explore and manipulate data in an efficient manner
  • It helps you analyze large volumes of data with ease

Why is Pandas popular?

  • Easy to read and learn
  • Fast and powerful
  • Integrates well with other visualization libraries

importing pandas

import pandas
import pandas as pd # creating an alias for pandas
Enter fullscreen mode Exit fullscreen mode

Pandas Series
A series is:

  • a 1-D labelled array
  • can hold data of any type
  • similar to a table's column

A Series can have:

  • Integers
  • Strings
  • Both numbers and strings

The Series data type is object
Series are indexed, starting from 0

Creating a Series

import pandas as pd
numbers = [1, 3, 5, 5, 7, 9, 13, 56]
pd.Series(numbers) # A series from a list

country = {'Kenya': 'Nairobi', 'Tanzania': 'Dodoma', 'Uganda': 'Kampala'}
pd.Series(country) # Creating a series from a dictionary, the dict keys will be the index for the Series
Enter fullscreen mode Exit fullscreen mode

Pandas DataFrame
A DataFrame is:

  • a 2-D table
  • made up of a collection of Series
  • Structured with labeled axes (rows and columns)

You create a DataFrame with the .DataFrame() method

import pandas as pd
data = {'item_id': [1, 2, 3, 4, 5], 'item_name': ['chocolate', 'floor', 'sugar', 'ice cream', 'soap'], "item_price": [356.00, 200.00, 150.00, 55.00, 187.00]}
pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

The DataFrame has 3 columns each containing 5 entries.

Some pandas functions and methods

  • .head() shows the top entries in a DataFrame. Number of values to be shown can be specified in it.
  • .tail() shows the last entries in a DataFrame. Number of values to be shown can be specified.
  • .descibe() gives the statistical analysis of the each column in the DataFrame
  • .shape describes the rows and columns present in the DataFrame
  • .info() gives a summary of the DataFrame showing the sum of not null values

    data.shape
    data.head(5)
    data.tail(9)
    data.info()
    data.describe()

You can access a column by using it as the index of the DataFrame

print(data['item_name']) # This outputs the entries in the 'item_name' column
Enter fullscreen mode Exit fullscreen mode




8. Data Science

Data Science is a field that combines math and statistics, Specialized programming, advanced analytics, artificial intelligence and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization's data.

Steps involved in data science process:

  • Business understanding/analysis
  • Data Exploration and Preparation
  • Data Transformation and Representation
  • Data visualization
  • Data Modelling, Training, Validation and Deployment

Some of the Python Libraries used for Data Science:

  • NumPy
  • Pandas
  • Scipy
  • Matplotlib

Since Data Science is a team spot environments that allow collaboration such as sharing code.
Such environments are:

  • Jupyter notebooks
  • Github

We'll deep dive into Data Science in the next article

Top comments (0)