DEV Community

Erik Marsja
Erik Marsja

Posted on • Edited on

Essential Python Libraries for Data Science, Machine Learning, and Statistics

In this post I will list some very valuable libraries for people who intend to use Python for data science, Machine Learning, and Statistics. These libraries are very extensive and are developed by a big number of experts around the world and together, the libraries, make Python a very powerful tool for data analysis.

I really recommend that you install, and use, Anaconda the scientific Python distribution. This will give you loads of Python libraries installed. In the example below we will use one very handy library; Pandas.

How to import and use Pandas

The convention is to load pandas as pd and then we can use the methods and classes very easily. For instance, we can write pd.read_csv(‘datafile.csv’) to load a CSV file to a dataframe object.

Essential Libraries in Python

In this section I will list some of the most essential Python libraries when it comes to data science.

NumPy (Numerical Python)

NumPy is an extensive library for data storage and calculations. This library contains data structures, algorithms, and other things that are used to handle numerical data in Python. Furthermore, NumPy includes methods for arrays (lists) that are more efficient than Python's built-in methods. This makes NumPy faster than Python's standard methods. NumPy also contains features that can be used to load data to Python, as well as export data from Python.

If you are migrating from MATLAB, for instance, you will like NumPy (see here)

http://numpy.org

Pandas

Pandas is the most powerful library for data manipulation. Pandas contains a wide range of data import and export functions, as well as for indexing and manipulating data. This library is inevitable for those who use the Python for data science. Pandas also includes sophisticated methods for data structures. The most used data structure in pandas is dataframe (series of columns and rows) and series (a 1-dimensional array).

Pandas are extremely effective for reshaping, merging, splitting, aggregating, and selecting (subsetting) data. In fact, the absolute majority of the code in a data science project usually consists of data wrangling, which are the steps required to prepare data so that analyses can be performed. Having a coherent library for all data wrangling is, of course, advantageous.
Unlike the statistical programming environment konwn as R, there is no built-in variant of dataframes in Python. Dataframes are central to basically all data analysis. A dataframe is a table of columns and rows. Here's a very nice Pandas Dataframe tutorial I wrote, aimed at the beginner.

http://pandas.pydata.org

matplotlib

Matplotlib is used to visualize data. Although matplotlib is quite easy to use and you have a lot of control over your plots I would recommend using Seaborn.

http://matplotlib.org

Seaborn

Seaborn is Python package for data visualization that is based on matplotlib. This package gives us a high-level interface for drawing beautiful and informative visualizations. It's possible to draw bar plots, histograms, scatter plots, and many other nice plots.

Here's an example using NumPy to generate some data and plotting it using Seaborn:

import numpy as np
import seaborn as sns

# Generate some normally distributed data
dat = np.random.normal(0.0, 0.2, 1000)

# Create a histogram using seaborn
sns.distplot(dat)
Enter fullscreen mode Exit fullscreen mode

Histogram created with Seaborn

SciPy (Scientific Python)

SciPy includes features for advanced calculations.

http://scipy.org

scikit-learn

scikit-learn is a huge library of data analysis features. In scikit-learning there are classification models (e.g., Support Vector Machines, random forest), regression analysis (linear regression, ridge regression, lasso), cluster analysis (e.g, k-means clustering), data reduction methods (e.g., Principal Component Analysis,, feature selection), model tuning and selection (with features like grid search, cross validation, etc), pre-processing of data among many other things.
http://scikit-learn.org/

Top comments (0)