Python Definition
Python is an interpreted, high-level, general-purpose programming language. It was first released in 1991 by Guido van Rossum and has since become one of the most popular programming languages in the world. Its syntax has made it popular as its easy to learn and use.
Advantages of Python for Data Analysis
There are several data analysis preferred programming language including R, Stata and SAS.
Python is better compared to others due to:
- It's ease of use. It's syntax makes it easy to learn, write, and maintain code, even for beginners.
- Range of libraries: Python has a large number of libraries that provide a range of functionalities for data analysis, such as NumPy, Pandas, and Matplotlib.
- Open-source: Python is open-source, which means that it is freely available and can be used and modified by anyone.
Installing Python
Python application file can be accessed and downloaded from its main website for different operating systems. I will mainly use Windows for this article.
After installing python you have to choose an IDE (Integrated Development Environment) which is a software application that provides a comprehensive environment for software development.
The common IDE for data science are:
- Pycharm
- Spyder
- Visual Studio Code
- Jupyter Notebooks
- IDLE( comes with python installation).
First Code
A successful setup of your code writing environment means you are ready to code. Python for data science requires several basic libraries to simplify your coding processes. They can be installed using a package manager used in python in your command prompt (cmd)by running the following:
- NumPy
pip install numpy
- Pandas
pip install pandas
- Matplotlib
pip install matplotlib
- Scikit-Learn
pip install scikit-learn
The libraries are installed for ease of use when dealing with tabular data (Pandas
), arrays(NumPy
), visualizations(Matplotlib
) and Machine Learning (Scikit Learn
). As you advance you come to know more libraries that come handy in your data science projects.
Using Python Libraries for Data Science
The installed libraries are not usable until they are called or modules in them are called. This is done easily through using the import
and from library import module
. Example:
#we use `as` as an alias so as to simplify our code
#pandas library
import pandas as pd
#numpy library
import numpy as np
#matplotlib library
import matplotlib.pyplot as plt
#scikit learn library
from sklearn.pipeline import make_pipeline
As you may have noted, from is used to import a certain method or module from a library depending on the project you are working on.
Python Syntax
Operators in Python for Data Science
- Arithmetic operators: used for performing arithmetic operations such as addition(
+
), subtraction(-
), multiplication(*
), division(/
), and modulus(%
). - Comparison operators: used for comparing two values and returning a Boolean value (True or False). They include equal to(
==
), not equal to(!=
), greater than(>
), less than(<
), greater than or equal to(>=
) and less than or equal to(<=
). - Logical operators: used for combining Boolean values and returning a Boolean result. These include
logical AND, logical OR and logical NOT
. - Assignment operators: used for assigning a value to a variable and performing an operation on the variable at the same time. These include:
a = 5
a += 3 # equivalent to a = a + 3
a -= 2 # equivalent to a = a - 2
a *= 4 # equivalent to a = a * 4
a /= 2 # equivalent to a = a / 2
Python Data Structures
Python comes with inbuilt data structures that enables data scientist store and manipulate data sets. They are the foundations that makes easy to integrate with the data science libraries.
The most common data structures are:
- Lists:
A list is a collection of ordered elements, which can be of any data type. Example:
mylist = [1,2,3,4]
- Tuples:
A tuple is a collection of ordered elements, similar to a list. However, tuples are immutable, which means that once a tuple is created, its elements cannot be modified. Example:
mytuple = (1, 2, 3, 4, 5)
-Dictionaries:
A dictionary is a collection of key-value pairs, where each key is associated with a value. Dictionaries are unordered and mutable, which means that you can add, remove, or modify key-value pairs in a dictionary. Example:
mydict= {"a": 2, "b": 3, "c": 4}
These are the most commonly used data structures but others include sets and arrays.
Conclusion
Data science involves lot of projects from data collection to machine learning. The kind of project will dictate the kind of library and code to write. The most common data sources are apis, excel(flat databases), structured databases(SQL), unstructured databases(mongo) and mixed sometimes.
Python offers easy integration of data sources e.g pymongo library for mongodb databases, sqlite3 for sql databases and pandas for flat databases(excel, csv etc)
.
Examples:
-Pymongo
from pymongo import MongoClient
client = MongoClient(host="local host", port=27017)
import sqlite3
%load_ext.sql
%sql sqlite://path
df=pd.read_csv(filepath)
The kind of data also will determine the type of code and libraries to install and use.
Its always advisable to come up with a clear plan of how to handle your project to avoid wrong method or libraries.
Data science with python is fun and easy to learn with dedication.
Thank you and for any clarification feel free to reach out.
Top comments (0)