The technique of deriving useful and practical insights from data is known as data science. Tools like Python, R, Jupyter, Spider, etc. just make it possible for us to accomplish that. It's also crucial to realize that the basics of data science remain the same regardless of the technologies used https://datascienceparichay.com/python-for-data-science/introduction/.
Python is one of the most well-known and widely used programming languages due to its versatility; one of its applications is in data science.It includes a number of libraries that make it simple to perform core data analysis tasks.
It has a simple syntax and structure that makes it easy to learn for both novice and experienced users. In this article, we'll discuss why python is so popular in data science, give a quick overview of the basics of python, look at some python libraries, and show an example of python in data science in action.
Overview
Python is a high level programming language that was initially made available in 1991. It is an interpreted language, meaning that no compilation is necessary before running any given program in Python. Its popularity has created a sizable community, which makes it simpler to locate information and help.
Because it contains so many helpful libraries and tools for manipulating, analyzing, and visualizing data, Python is your go-to programming language for data science. NumPy, Pandas, Dask, Seaborn, and Matplotlib are some of the most widely used libraries for data science applications.
Installation
Installing anaconda is highly recommended, anaconda does not only install Python for you but also other crucial tools like Jupyter, Spyder, RStudio, etc.
The installation procedure is not that difficult.
Follow the steps below;
Navigate to Anaconda’s Individual Edition’s page download the Anaconda installer depending on your system requirements https://www.anaconda.com/.
Utilize the downloaded installer to install Anaconda.Provide guidelines on how you want the anaconda installed during the installation procedure. Use the default configurations if you're unsure or go to this guide for further information.
After a successful installation, the Anaconda Navigator would be accessible. To quickly test the installation, launch a Jupyter Notebook and issue a straightforward Python command.
comparable to print("Hey Fellas").
A Jupyter Notebook can be opened by launching it from the Anaconda Navigator.A Jupyter notebook is an open-source web-based application that allows you to create and share live documents that contain code, equations, visualizations, and narrative text. Jupyter notebooks are used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
The Jupyter notebook combines the interactive capabilities of IPython with the versatile document format of the notebook. The notebook is capable of running code in a wide range of programming languages, including Python, R, Julia, and Scala. It can also be used to produce rich, interactive visualizationshttps://jupyter.org/try-jupyter/retro/notebooks/?path=notebooks/Intro.ipynb.
That's all there is to it. After successfully completing the aforementioned steps, you will have all the tools required to execute Python, whether it is directly in the command prompt or through a straightforward program like Jupyter Notebook.
Syntax Basics
Python's very nature implies that it was created to be easily read and written.
The following are some Python syntax examples;
Variables in Python
A variable in Python is a term that designates a value or an object.
In a program, data is stored and managed using variables.
In Python, you just have to use the = (assignment) operator to pair a name and a value to create a variable.
Examples include:
yob = 12
print(yob)
In the example 'yob' is our variable while '5' is the assigned value.
Since variables in Python are dynamically typed, the type of the variable is decided upon during execution based on the value that has been set to it. In contrast to statically typed languages, where a variable's type must be specified prior to use, this one does not.
Python allows letters, numbers, and underscores to be used in variable names, although digits cannot be the first character.
As a matter of tradition, words are separated by underscores in lowercase when writing variable names https://www.pythontutorial.net/python-basics/python-variables/.
Understanding Data types
Python includes a number of data types, each with its own set of operations and methods.
Here are some of the most common Python data types:
Numbers: Python is capable of handling complicated, floating-point, and integer numbers.
Floating-point numbers are represented by the float class, while integers are represented by the int class.
The complex class is a representation for complex numbers.Strings: Using strings, you can represent text data.
They are enclosed in single quotes (') or double quotes ("), and are represented by the str class.Booleans: Truth values are represented by Booleans. They are represented by the bool class and can only have True or False values.
Lists: They are used to organize a collection of items. The list class represents them, and they can contain items of any data type.
Tuples: Tuples are similar to lists in that their contents cannot be changed, but they are immutable. The tuple class is used to represent them.
Dictionary: A dictionary is a collection of key-value pairs. They are created with curly braces and are represented by the dict class.
Sets: Sets are used to store values that are unique. They are created with curly braces and are represented by the set class.
Here are a few Python examples of how to use these data types:
# Numbers
X = 5 # integer
Y = 3.14 # float
Z = 2 + 3j # complex number
# Strings
S = ‘Hey, There!’
T = “Python is amazing”
# Booleans
A = True
B = False
# Lists
My_list = [1, 2, 3, “four”, 5.0, 6]
# Tuples
My_tuple = (1, 2, 3, “four”, 5.0, "six")
# Dictionaries
My_dict = {“name”: “Yankho”, “age”: 22, “city”: “Blantyre”}
# Sets
My_set = {2, 4, 6, 8, 10}
Functions in Python
In Python, a function is a block of code that performs a specific task or set of tasks. Functions are used to make code reusable, modular, and easier to read and maintain.
To define a function in Python, you use the def keyword, followed by the function name and parentheses, and a colon. The function body is indented beneath the function header.
Here’s a simple example:
Def say_hello():
Print(“Hello, world!”)
In this example, we’ve defined a function called say_hello() that simply prints the string “Hello, world!” when it is called.
To call a function in Python, you simply write the function name followed by parentheses.
For example:
Copy code
Say_hello()
This will call the say_hello() function, and the output will be “Hello, world!”.
Functions can also take parameters, which are values that you pass to the function https://www.pythontutorial.net/python-basics/python-functions/.
Control Statements in Python
In Python, control statements are used to control the flow of the program. They allow you to perform different actions depending on conditions or iterate over data structures.
Here are the three main types of control statements in Python:
Conditional statements (if, else, and elif)
- Conditional statements are used to execute different code depending on certain conditions. The basic syntax of an if statement is:
If condition:
# code to be executed if condition is True
An if statement can be followed by an optional else statement, which is executed if the condition is False:
If condition:
# code to be executed if condition is True
Else:
# code to be executed if condition is False
If you have multiple conditions to check, you can use the elif statement:
If condition1:
# code to be executed if condition1 is True
Elif condition2:
# code to be executed if condition2 is True
Else:
# code to be executed if both condition1 and condition2 are False
Loops (for and while)
- Loops are used to execute a block of code repeatedly. The for loop is used to iterate over a sequence (such as a list or a string), while the while loop is used to repeat a block of code as long as a certain condition is True.
The basic syntax of a for loop is:
For element in sequence:
# code to be executed for each element in sequence
The basic syntax of a while loop is:
While condition:
# code to be executed as long as condition is
Control statements (break, continue, and pass)
- Control statements are used to change the normal flow of a loop or a conditional statement. The break statement is used to exit a loop, the continue statement is used to skip the current iteration and move on to the next one, and the pass statement is used as a placeholder when you don’t want to execute any code.
Here’s an example that combines these control statements:
For i in range(1, 10):
If i % 2 == 0:
Continue # skip even numbers
If i == 7:
Break # exit the loop when i is 7
If i == 3:
Pass # do nothing when i is 3
Print(i)
This code will print the numbers 1, 5, and 9. It skips even numbers using the continue statement, exits the loop when i is 7 using the break statement, and does nothing when i is 3 using the pass statement.
Python Libraries for Data Science
Data science projects benefit greatly from the many libraries that Python has to offer. The following list of well-known Python data science libraries includes usage examples:
- NumPy: NumPy is a fundamental library for scientific computing in Python. It provides powerful array manipulation capabilities and mathematical functions and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more https://numpy.org/doc/stable/user/whatisnumpy.html.
The library offers numpy arrays that resemble lists and can be up to 50 times quicker than Python lists.
Import numpy as np
# Creating an array
Arr = np.array([1, 2, 3, 4, 5])
# Mathematical operations on arrays
Print(np.sin(arr))
Print(np.exp(arr))
- Pandas: Pandas is a library for analyzing and manipulating data. It offers data structures that make processing and analyzing tabular data efficient. Additionally, Pandas has a flexible dataframe object that can read data from numerous well-known formats, including Excel, SQL, CSV, and more. It offers highly helpful tools for both reshaping and performing various types of analytics on your data https://pandas.pydata.org/docs/user_guide/index.html.
Consider the example below;
Import pandas as pd
# Reading a CSV file
Data = pd.read_csv(‘data.csv’)
# Grouping and aggregating data
Grouped = data.groupby(‘category’)
Averages = grouped.mean()
- Matplotlib: Matplotlib is a library for creating visualizations in Python. It provides a wide range of plotting tools for visualizing data in various formats https://matplotlib.org/stable/index.html.
See the example below;
Import matplotlib.pyplot as plt
# Creating a scatter plot
X = [1, 2, 3, 4, 5]
Y = [2, 4, 6, 8, 10]
Plt.scatter(x, y)
# Adding labels and a title
Plt.xlabel(‘X values’)
Plt.ylabel(‘Y values’)
Plt.title(‘Scatter plot of X and Y’)
- Scikit-learn: Scikit-learn is a machine learning library for Python. It provides a wide range of algorithms and tools for machine learning tasks such as classification, regression, and clustering.
See the example below;
From sklearn.linear_model import LinearRegression
# Creating a linear regression model
Model = LinearRegression()
# Fitting the model to data
X = [[1, 2], [3, 4], [5, 6]]
Y = [3, 7, 11]
Model.fit(X, y)
# Making predictions with the model
Print(model.predict([[7, 8]]))
- Seaborn: Seaborn is a library for creating statistical visualizations in Python. It provides a wide range of tools for creating advanced statistical plots https://seaborn.pydata.org/.
See the example below;
Import seaborn as sns
# Creating a heatmap
Data = pd.read_csv(‘data.csv’)
Corr = data.corr()
Sns.heatmap(corr)
# Adding a title
Plt.title(‘Correlation heatmap of variables in data.csv’)
- Dask: Dask is a library for parallel computing in Python. It provides tools for handling large datasets that do not fit into memory by partitioning them across multiple processors or machines. This ease of transition between single-machine to moderate cluster enables users to both start simple and grow when necessary.
Dask is convenient on a laptop. It installs trivially with conda or pip and extends the size of convenient datasets from “fits in memory” to “fits on disk” https://docs.dask.org/en/stable/index.html.
Example:
Import dask.dataframe as dd
# Reading a CSV file
Df = dd.read_csv(‘large_file.csv’)
# Computing the mean of a column
Mean = df[‘column’].mean().compute()
- Pyforest: Pyforest is a lazy-import library for data science. It automatically imports commonly used data science libraries when they are first used in a script, so you don’t have to manually import them. pyforest offers the following solution:
- You may utilize all of your libraries as usual.
- Pyforest will import a library if it isn't already and add an import statement to the first Jupyter cell.
- A library won't be imported if it isn't being used.
- Your notebooks continue to be duplicated and shared without your having to worry about imports.
After setting up pyforest and its Jupyter extension, you can use your preferred Python Data Science tools as usual without having to write import statements https://pypi.org/project/pyforest/.
For example, if you want to read a CSV with pandas:
Import pyforest
# No need to explicitly import pandas
Df = pd.read_csv(‘data.csv’)
# No need to explicitly import matplotlib
Plt.plot([1, 2, 3], [4, 5, 6])
Note that while Pyforest can make your code more concise, it can also make it less clear where your functions are coming from, which can be a downside in larger codebases.
Conclusion
Python is a powerful and versatile programming language for data science. It has become increasingly popular due to its user-friendly syntax and the extensive range of libraries available for data analysis, manipulation, and visualization.
The Jupyter Notebook environment is an essential tool for data scientists, as it allows for efficient documentation, visualization, and communication of code and analysis. Moreover, the popular Python libraries such as NumPy, Pandas, Matplotlib, and Scikit-learn are indispensable for a range of data science tasks.
With Python, data scientists can explore, manipulate, and visualize data in a variety of formats, and can create predictive models and make data-driven decisions. Python’s popularity has led to a growing community of developers and data scientists, who share best practices, libraries, and techniques.
In summary, Python is a powerful and versatile tool for data science, and it is a necessary skill for anyone in the field. With continued practice and experience, data scientists can leverage Python’s capabilities to analyze and draw insights from large and complex data sets, and make data-driven decisions that are crucial in today’s data-driven world.
Top comments (1)
Easy to understand. Nice work!