Haavish Sachdeva

Posted on Jan 21

"The Power of Python: Essential Skill for Data Science"

#python #pandas #numpy #matplotlib

Introduction

Python has taken the world by storm, becoming one of the most popular programming language. Whether you are a beginner taking your first step into coding.
In this blog, we will explore what makes Python special, why it is so widely used and how you can start mastering it.

Why Python?

Readability
Versatility
Large Community
Extensive Libraries

Getting Started with Python

If you are new to python here is the simple example to get you started

name = input("Enter your name: ")
print(f"Hello, {name}! Welcome to Python programming.")

Data Types of Python

1.List
2.Tuple
3.Set
4.Dictionary

List
List is a built in data structure that allows you to store multiple items in a single variable. List are mutable and allow duplicate values.

Key Features of a List
1.Ordered
2.Mutable
3.Allow Duplicates
4.Can stored different data types

# declaring an empty lists
lists=[]

# initializing and defining lists with different data types
list_Numbers = [1,2,3,4,5]

# initializing and defining lists with strings data types 
list_string = ["Bale", "Kane" , "Sterling"]

# initializing and defining lists with mixed data types
max_list = [9,4, "Anne"]

Open your python shell console

>>> print the first element
>>> lists= [1,2,3,4,5]
>>> print(lists[0])
1
>>>
>>> print(lists[0:3])
[1,2,3]
>>>
>>> # print the last element in the list 
>>> print(lists[-1])

>>> # sorting the list in descending order 
>>> lists.sort(reverse-True)
>>> print(lists)
[6,7,3,2,1]
>>>

As list is the mutable data type, to add more elements we can use these functions append(),extend() and insert().To delete elements we can use del(),pop() and remove() function.

Tuples
A tuple is a data type for storing immutable ordered sequences of elements. It can neither remove nor add elements from tuples. It stores multiple values in a single, immutable collection.

Key Features of a tuple
1.Ordered
2.Allow Duplicates
3.Indexed
4.Can store different data types

>>> # declaring an empty tuple
>>> empty_tuple = ()
>>> print(empty_tuple)
()

Set
A set is a built in data structure in python that is used to store multiple unique values in an unordered and mutable collection. It can permit us to remove duplicate quickly from the list.

Key Features of a set
1.Mutable
2.Can store different data types

my_set = {10, "python" , 3.14 , True}
print(my_set)

Dictionaries
It is a mutable and unordered data structure used to store data. It permits storing a pair of items which are in keys and values. It is the most efficient ways to store and retrieve data using unique keys.

Key Features of a Dictionary
1.Ordered
2.Mutable
3.Indexed by keys
4.Can store different data types

>>> Dict1 = {1: 'Hello', 2: 'To', 3: 'You'}
>>> print(Dict1)
{1: 'Hello', 2: 'To' , 3: 'You"}

What is data structure?

A data structure is a way of organizing, managing and storing data efficiently so that it can be accessed and modified easily. it defines how data is arranged in memory and how operations can be performed efficiently.

Key Features of a Data Structure
1.Efficient data organization
2.Optimized performance
3.Memory Management
4.Used in Algorithms

User Defined Data structure
Stacks
A stack is a LIFO(Last in first out) data structure which can be commonly found in many programming languages. This structure is named as "stack".
Operations carried out on stack are push and pop.
Push is used to inserting an element into the stack.
Pop is used in deleting an element in the stack.

Queue
Queue is also a linear data structure which uses first in and first out (FIFO) to store items. It is similar to the real world queue.

Operations that are performed on the queue are:

Enqueue:- It is used to add items to the queue.
Dequeue:- It removes an items from the queue.
Front:- The front item from the queue.
Rear:- The last item from the queue.

Why Data Structure Matter in Data Science
In Data Science, handling data efficiently is just as important as analyzing it. the choice of data structure can significantly affect the performance of your programs and the accuracy of your analysis.

1.Efficient Handling of the Large Datasets
Data Scientist often work with datasets containing millions of records.Using the right data structure ensures that data can be stored ,accessed and processed quickly.
*Example: using a pandas DataFrame instead of a list allows fast data filtering,grouping and aggregation.

2.Faster Searching and sorting
Certain data structures are optimized for specific operations.
*Example:

Dictionaries provide instant lookup for key-value pair,making searches extremely fast.
Heaps allow quick access to the largest or smallest elements, useful in ranking or recommendation systems.

3.Memory Optimization
Efficient data structures help save memory while storing data, which is critical when working with big data
*Example: Using a set instead of a list to store unique items avoids duplicate storage and reduces memory usage.

4.Improved Algorithm Performance
many algorithms rely on the underlying data structure for optimal performance. Chossing the right structure can reduce computational time from minutes to seconds.
*Example: Using a queue for streaming data allows smooth processing of tasks in the order they arrive.

Python Libraries for Data Science

A Python library is a collection of pre-written code that helps you perform specific tasks without having to write everything from scratch.
Think of it like a toolbox — instead of building your own hammer and screwdriver every time, you just open the toolbox (library) and use the tools (functions and modules) that are already there.

Important Python Libraries

NumPy
Pandas
Matplotlib

NumPy

INTRODUCTION
NumPy (Numerical Python) is one of the most important Python libraries used for numerical computing. It allows us to work with large datasets efficiently by providing powerful multi-dimensional arrays and a wide range of mathematical functions. Because of its speed and flexibility, it is widely used in fields where heavy calculations and data processing are required.

Key Features of NumPy

NumPy stands out from normal Python lists because it is built specifically for fast numerical calculations and large data handling. Some of its most powerful features are:

N-Dimensional Arrays (ndarray): NumPy uses special N-dimensional array objects that store data of the same type. This makes data processing much faster and more memory-efficient than normal Python lists.
High Performance: NumPy arrays store data in continuous memory blocks, which allows it to perform mathematical operations at a much higher speed compared to lists in Python.
Broadcasting: NumPy automatically matches the shape of different arrays so that you can perform element-wise operations without writing extra code or creating new arrays manually.
Vectorization: Most operations in NumPy can be applied directly to whole arrays at once. This avoids writing long nested loops in Python, making code simpler, cleaner, and faster.
Built-in Linear Algebra Support: NumPy has ready-to-use functions for matrix multiplication, matrix decomposition, determinant calculation and other advanced linear algebra operations.

Installing NumPy in Python


pip install numpy

Once installed, import the library with the alias np


import numpy as np

Creating NumPy Arrays
Using ndarray : The array object is called ndarray. NumPy arrays are created using the array() function.


import numpy as np

a1_zeros = np.zeros((3, 3))
a2_ones = np.ones((2, 2))
a3_range = np.arange(0, 10, 2)

print(a1_zeros)
print(a2_ones)
print(a3_range)

Output
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[1. 1.]
 [1. 1.]]
[0 2 4 6 8]

NumPy Array Indexing
Knowing the basics of NumPy array indexing is important for analyzing and manipulating the array object.

Basic Indexing: Basic indexing in NumPy allows you to access elements of an array using indices.


import numpy as np

# Create a 1D array
arr1d = np.array([10, 20, 30, 40, 50])

# Single element access
print("Single element access:", arr1d[2])  

# Negative indexing
print("Negative indexing:", arr1d[-1])  

# Create a 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Multidimensional array access
print("Multidimensional array access:", arr2d[1, 0])

Output
Single element access: 30
Negative indexing: 50
Multidimensional array access: 4

Slicing: Just like lists in Python, NumPy arrays can be sliced. As arrays can be multidimensional, you need to specify a slice for each dimension of the array.


import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])
#elements from index 1 to 3
print("Range of Elements:",arr[1:4])

#all rows, second column
print("Multidimensional Slicing:", arr[:, 1])

Output
Range of Elements: [[4 5 6]]
Multidimensional Slicing: [2 5]

Advanced Indexing: Advanced Indexing in NumPy provides more flexible ways to access and manipulate array elements.


import numpy as np
arr = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

# Integer array indexing 
indices = np.array([1, 3, 5])
print ("Integer array indexing:", arr[indices])

# boolean array indexing 
cond = arr > 0
print ("\nElements greater than 0:\n", arr[cond])

Output

Integer array indexing: [20 40 60]

Elements greater than 0:
 [ 10  20  30  40  50  60  70  80  90 100]

NumPy Basic Operations
Element-wise operations in NumPy allow you to perform mathematical operations on each element of an array individually, without the need for explicit loops.

Element-wise Operations: We can perform arithmetic operations like addition, subtraction, multiplication, and division directly on NumPy arrays.


import numpy as np

x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

# Addition
add = x + y  
print("Addition:",add)

# Subtraction
subtract = x - y 
print("substration:",subtract)

# Multiplication
multiply = x * y 
print("multiplication:",multiply)

# Division
divide = x / y  
print("division:", divide)

Output
Addition: [5 7 9]
substration: [-3 -3 -3]
multiplication: [ 4 10 18]
division: [0.25 0.4  0.5 ]

Unary Operation: These operations are applied to each individual element in the array, without the need for multiple arrays (as in binary operations).


import numpy as np

# Example array with both positive and negative values
arr = np.array([-3, -1, 0, 1, 3])

# Applying a unary operation: absolute value
result = np.absolute(arr)
print("Absolute value:", result)

Output
Absolute value: [3 1 0 1 3]

Binary Operators: Numpy Binary Operations apply to the array elementwise and a new array is created. We can use all basic arithmetic operators like +, -, /, etc. In the case of +=, -=, = operators, the existing array is modified.


import numpy as np

# Two example arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Applying a binary operation: addition
result = np.add(arr1, arr2)

print("Array 1:", arr1)
print("Array 2:", arr2)
print("Addition Result:", result)

Output
Array 1: [1 2 3]
Array 2: [4 5 6]
Addition Result: [5 7 9]

NumPy ufuncs
NumPy provides familiar mathematical functions such as sin, cos, exp, etc. These functions also operate elementwise on an array, producing an array as output.


import numpy as np

# create an array of sine values
a = np.array([0, np.pi/2, np.pi])
print ("Sine values of array elements:", np.sin(a))

# exponential values
a = np.array([0, 1, 2, 3])
print ("Exponent of array elements:", np.exp(a))

# square root of array values
print ("Square root of array elements:", np.sqrt(a))

Output
Sine values of array elements: [  0.00000000e+00   1.00000000e+00   1.22464680e-16]
Exponent of array elements: [  1.           2.71828183   7.3890561   20.08553692]
Square root of array elements: [ 0.          1.          1.41421356  1.73205081]

NumPy Sorting Arrays
We can use a simple np.sort() method for sorting Python NumPy arrays.


import numpy as np

# set alias names for dtypes
dtypes = [('name', 'S10'), ('grad_year', int), ('cgpa', float)]

# Values to be put in array
values = [('Hrithik', 2009, 8.5), ('Ajay', 2008, 8.7), 
           ('Pankaj', 2008, 7.9), ('Aakash', 2009, 9.0)]

# Creating array
arr = np.array(values, dtype = dtypes)
print ("\nArray sorted by names:\n",
            np.sort(arr, order = 'name'))

print ("Array sorted by graduation year and then cgpa:\n",
                np.sort(arr, order = ['grad_year', 'cgpa']))

Output
Array sorted by names:
 [(b'Aakash', 2009, 9. ) (b'Ajay', 2008, 8.7) (b'Hrithik', 2009, 8.5)
 (b'Pankaj', 2008, 7.9)]
Array sorted by graduation year and then cgpa:
 [(b'Pankaj', 2008, 7.9) (b'Ajay',...

** Pandas**
Pandas is open-source Python library which is used for data manipulation and analysis. It consists of data structures and functions to perform efficient operations on data. It is well-suited for working with tabular data such as spreadsheets or SQL tables. It is used in data science because it works well with other important libraries. It is built on top of the NumPy library as it makes easier to manipulate and analyze.

** The various tasks that we can do using Pandas:**

Data Cleaning, Merging and Joining: Clean and combine data from multiple sources, handling inconsistencies and duplicates.
Handling Missing Data: Manage missing values (NaN) in both floating and non-floating point data.
Column Insertion and Deletion: Easily add, remove or modify columns in a DataFrame.
Group By Operations: Use "split-apply-combine" to group and analyze data.
Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with Pandas.

Installing Pandas
First step in working with Pandas is to ensure whether it is installed in the system or not. If not then we need to install it on our system using the pip command.


pip install pandas

Importing Pandas

After the Pandas have been installed in the system we need to import the library. This module is imported using:


import pandas as pd

** Data Structures in Pandas Library**

Pandas Series

A Pandas Series is one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects etc.). The axis labels are collectively called indexes.

Pandas Series is created by loading the datasets from existing storage which can be a SQL database, a CSV file or an Excel file. It can be created from lists, dictionaries, scalar values, etc.

Example: Creating a series using the Pandas Library.
import pandas as pd
import numpy as np

ser = pd.Series()
print("Pandas Series: ", ser)

data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)
print("Pandas Series:\n", ser)

Pandas DataFrame

Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and columns). It is created by loading the datasets from existing storage which can be a SQL database, a CSV file or an Excel file. It can be created from lists, dictionaries, a list of dictionaries etc.

Example: Creating a DataFrame Using the Pandas Library


import pandas as pd 

df = pd.DataFrame() 
print(df)

lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks'] 

df = pd.DataFrame(lst) 
print(df)

With Pandas, we have a flexible tool to handle data efficiently whether we're cleaning, analyzing or visualizing it for our next project.

Matplotlib

Matplotlib is one of the most popular Python libraries for creating beautiful and informative data visualizations. Built on top of NumPy, it makes it easy to work with large datasets and display them using different types of charts—like line graphs, bar charts, scatter plots, and more. Whether you want static, animated, or interactive visuals, Matplotlib provides all the tools you need to bring your data to life.

Visualizing Data with Pyplot in Matplotlib

Pyplot is a powerful module within Matplotlib that makes creating visualizations quick and easy. It provides a simple, user-friendly interface to draw different types of plots such as line graphs, bar charts, and histograms with just a few lines of code. With Pyplot, you can turn raw data into clear and attractive visuals, making it easier to understand patterns and insights. Let’s look at some simple examples to see how Pyplot works in action.

Line Chart Line chart is one of the basic plots and []can be created using plot() function. It is used to represent a relationship between two data X and Y on a different axis.

Syntax:

matplotlib.pyplot.plot(x, y)

Parameter: x, y Coordinates for data points.

Example: This code plots a simple line chart with labeled axes and a title using Matplotlib.


import matplotlib.pyplot as plt

x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

plt.plot(x, y)
plt.title("Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()

Bar Chart

A bar chart is a simple and effective way to visualize categorical data. It uses rectangular bars to represent values, where the length or height of each bar corresponds to the value it represents. Bar charts can be drawn either vertically or horizontally, making it easy to compare different categories side by side.

Syntax:


matplotlib.pyplot.bar(x, height)

Parameters:

x: Categories or positions on the x-axis

height: Heights of the bars (values on the y-axis)

Example:
The following example shows a basic bar chart that represents total bills on different days. The x-axis displays the days, while the y-axis shows the total bill amount for each day.


import matplotlib.pyplot as plt

x = ['Thur', 'Fri', 'Sat', 'Sun']
y = [170, 120, 250, 190]

plt.bar(x, y)
plt.title("Bar Chart")
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.show()

Histogram

A histogram is a great way to visualize the distribution of numerical data. It groups data values into intervals called bins and shows how frequently each range of values occurs. The hist() function in Matplotlib makes it easy to create histograms and understand the overall spread of your data.

Syntax:


matplotlib.pyplot.hist(x, bins=None)

Parameters:

x: The input data

bins: The number of bins (intervals) to divide the data into

Example:
The following example creates a histogram to display how often different total bill amounts appear in a dataset. It uses 10 bins and includes axis labels and a title to make the chart clear and informative.


import matplotlib.pyplot as plt

x = [7, 8, 9, 10, 10, 12, 12, 12, 13, 14, 14, 15, 16, 16, 17, 18, 18, 19, 20, 20,
     21, 22, 23, 24, 25, 25, 26, 28, 30, 32, 35, 36, 38, 40, 42, 44, 48, 50]

plt.hist(x, bins=10, color='steelblue')
plt.title("Histogram")
plt.xlabel("Total Bill")
plt.ylabel("Frequency")
plt.show()

Scatter Plot Scatter plots are used to observe relationships between variables. The scatter() method in the matplotlib library is used to draw a scatter plot.

Syntax:


matplotlib.pyplot.scatter(x, y)

Parameter: x, y Coordinates of the points.

Example: This code creates a scatter plot to visualize the relationship between days and total bill amounts using scatter().


import matplotlib.pyplot as plt

x = ['Thur', 'Fri', 'Sat', 'Sun', 'Thur', 'Fri', 'Sat', 'Sun']
y = [170, 120, 250, 190, 160, 130, 240, 200]

plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.show()

Pie Chart Pie chart is a circular chart used to show data as proportions or percentages. It is created using the pie(), where each slice (wedge) represents a part of the whole.

Syntax:


matplotlib.pyplot.pie(x, labels=None, autopct=None)

Parameters:

x: The data values that determine the size of each pie slice.

labels: The names or categories for each slice.

autopct: A format string used to display percentages on the chart (for example, '%1.1f%%' shows values with one decimal place).

Example:
The example below creates a simple pie chart that shows the distribution of different car brands. Each slice of the pie represents the share of cars for a particular brand in the dataset, making it easy to compare their proportions at a glance.


import matplotlib.pyplot as plt
import pandas as pd

cars = ['AUDI', 'BMW', 'FORD','TESLA', 'JAGUAR',]
data = [23, 10, 35, 15, 12]

plt.pie(data, labels=cars)
plt.title(" Pie Chart")
plt.show()

DEV Community

"The Power of Python: Essential Skill for Data Science"

Introduction

Why Python?

Getting Started with Python

Data Types of Python

What is data structure?

Top comments (0)