DEV Community

Cover image for Python 101: Introduction to Python for Data Science
Biruk Bizuayehu
Biruk Bizuayehu

Posted on • Edited on

Python 101: Introduction to Python for Data Science

PYTHON

Python is a high-level, object-oriented programming language. High-level refers to the fact that this language is simple for people to grasp and it is object-oriented, which means it emphasizes the use of objects to represent real-world entities and to allow interactions between objects.

Python is simple and easy to learn

Python is one of the best programming languages to learn if you're just getting started. Whether a user is experienced or not, they will be able to easily grasp each line of code and its purpose due to its basic syntax, English-based commands, and relatively simple layout.

Application of python

1 Data Science.
2 Machine Learning.
3 Web Development.
4 Computer Vision and Image Processing.
5 Game Development. etc

Python Basic

Variables

Variables are containers for storing data values.

example

x = str(3)    # x will be '3'
y = int(3)    # y will be 3
z = float(3)  # z will be 3.0
Enter fullscreen mode Exit fullscreen mode

Data Structure

List, set, tuples, and dictionary are some of the fundamental data structures used in Python. Every data structure is distinctive in its own way.

List  

fruits = ['apple','mango','banana']

set

fruits = {"apple", "banana", "mango"}

tuples

fruits = ('apple','mango','banana')

dictionary 

fruits =  {'fruit1':'apple','fruit2':'mango','fruit3':'banana'}

Enter fullscreen mode Exit fullscreen mode

If you want to learn more about python use this python documentation from their offical website.

Python offical doc

Python for Data Science

If you want to perform data analysis, you need to import specific libraries. Some examples include:

NumPy - A powerful library that helps you create n-dimensional arrays.

Pandas - Used for structured data operations.

SciPy - Provides scientific capabilities, like linear algebra and Fourier transform.

Matplotlib - Primarily used for visualization purposes.

Scikit-learn - Used to perform all machine learning activities.

Sea born - provides a high-level interface for drawing attractive and informative statistical graphics.

Let see some basic concepts about numpy and pandas.

Numpy

Numpy is a Python library for high-performance data analysis. Numpy provides an array-oriented interface to matrix and vector operations, making it easy to perform complex mathematical operations on large data sets.

To work with numpy first download numpy library. If you use pip, you can install NumPy with:

pip install numpy

let see some basic operations in numpy

# importing numpy

import numpy as np

#create 1d array

arr1 = np.array([10,20,30])
print(arr1)  output array([10, 20, 30])

# create 2d array

arr2 = np.array([[1,2,3],[4,5,6]])
print(arr2)  output   array([1, 2, 3],
                           [4, 5, 6]])

# to get individual element from numpy array we must pass it's index

arr1 = np.array([10,20,30])
print(arr1[0])  output = 10

Nb: index start from zero also we can perform advanced indexing with python slice method.

Enter fullscreen mode Exit fullscreen mode

If you want to learn more about numpy use this numpy documentation from their offical website.

Numpy offical doc

Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

to work with pandas first download pandas library. If you use pip, you can install Pandas with:

pip install pandas

Let see some basic operations in pandas

# importing pandas

import pandas as pd

# loading csv to dataframe

df = pd.read_csv('data.csv')

# It returns the first 5 rows of the Dataframe

 df.head()

    PassengerId Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
0   1   0   3   Braund, Mr. Owen Harris male    22.0    1   0   A/5 21171   7.2500  NaN S
1   2   1   1   Cumings, Mrs. John Bradley (Florence Briggs Th...   female  38.0    1   0   PC 17599    71.2833 C85 C
2   3   1   3   Heikkinen, Miss. Laina  female  26.0    0   0   STON/O2. 3101282    7.9250  NaN S
3   4   1   1   Futrelle, Mrs. Jacques Heath (Lily May Peel)    female  35.0    1   0   113803  53.1000 C123    S
4   5   0   3   Allen, Mr. William Henry    male    35.0    0   0   373450  8.0500  NaN S


# It returns the last 5 rows of the Dataframe

df.tail() 

    PassengerId Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
886 887 0   2   Montvila, Rev. Juozas   male    27.0    0   0   211536  13.00   NaN S
887 888 1   1   Graham, Miss. Margaret Edith    female  19.0    0   0   112053  30.00   B42 S
888 889 0   3   Johnston, Miss. Catherine Helen "Carrie"    female  NaN 1   2   W./C. 6607  23.45   NaN S
889 890 1   1   Behr, Mr. Karl Howell   male    26.0    0   0   111369  30.00   C148    C
890 891 0   3   Dooley, Mr. Patrick male    32.0    0   0   370376  7.75    NaN Q

# It helps in getting a quick overview of the dataset

df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

# Return a statistical summary for numerical columns present in the dataset

df.describe()

    PassengerId Survived    Pclass  Age SibSp   Parch   Fare
count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000  0.383838    2.308642    29.699118   0.523008    0.381594    32.204208
std 257.353842  0.486592    0.836071    14.526497   1.102743    0.806057    49.693429
min 1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25% 223.500000  0.000000    2.000000    20.125000   0.000000    0.000000    7.910400
50% 446.000000  0.000000    3.000000    28.000000   0.000000    0.000000    14.454200
75% 668.500000  1.000000    3.000000    38.000000   1.000000    0.000000    31.000000
max 891.000000  1.000000    3.000000    80.000000   8.000000    6.000000    512.329200

Enter fullscreen mode Exit fullscreen mode

If you want to learn more about pandas use this pandas documentation from their offical website.

Pandas offical doc

Thank you for taking the time to read my article and giving me honest feedback.

Top comments (0)