Data Science toolset summary from 2021

#datascience #machinelearning

The year 2021 is about to end so let us recall and recollect what different tools have been used by Data Professionals throughout the entire year. I am using the term Data Professionals to refer to all different jobs associated with data like Data Scientists, Data Analysts, Data Engineers. To become a better Data Professional we need to have knowledge of different domains but the most important skill set required are knowledge of Databases and SQL, languages like Python, R, Julia, JavaScript, etc., experience in Data Visualization tools like Tableau and PowerBI, and knowledge of Bigdata and Cloud Technologies.

In this post, I am going to give a list of different tools and technologies which have been used extensively by Data Professionals throughout the year and the expertise of these can make you one of the best in the industry. This list is based on a survey conducted by Kaggle (the biggest community of Data Scientists). I have used the term toolset because it is a comprehensive list of tools from different domains.

IDE

The most common languages used for Data Science are Python, R, JavaScript, MATLAB, Julia along with SQL. These languages are used for data analysis and visualization, building machine learning algorithms, implementing data pipelines, and various other things related to Data science. The most important tool we require are IDEs (Integrated Development Environments) where we write code, compile them and then execute them to view the output. Here is a list of most common IDEs used by different Data professionals for development which makes their life easier.

Jupyter Notebook - Jupyter Notebook is a web-based interactive computational environment for creating Jupyter notebook documents. It supports several languages like Python (IPython), Julia, R etc. and is largely used for data analysis, data visualization and further interactive, exploratory computing.
Visual Studio Code - Visual Studio Code (VS Code) is a source-code editor made by Microsoft for Windows, Linux and macOS. Features include support for debugging, syntax highlighting, intelligent code completion, snippets, code refactoring, and embedded Git. It can be used for writing code in many languages and is one of the most popular IDE among Software engineers as well for its wide variety of features.
Jupyter Lab - JupyterLab is the next-generation user interface including notebooks. It has a modular structure, where you can open several notebooks or files (e.g. HTML, Text, Markdowns etc) as tabs in the same window. It offers more of an IDE-like experience.
PyCharm - PyCharm is an IDE used specifically for the Python language. It is developed by the Czech company JetBrains. It provides code analysis, a graphical debugger, an integrated unit tester, integration with version control systems, and supports web development with Django as well as data science with Anaconda. PyCharm is cross-platform, with Windows, macOS and Linux versions.
R Studio - RStudio is an IDE for R, a programming language for statistical computing, data science, and data visualization. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.
Spyder - Spyder is an open-source cross-platform IDE for scientific programming in the Python language. Spyder integrates with a number of prominent packages in the scientific Python stack, including NumPy, SciPy, Matplotlib, pandas, IPython, SymPy and Cython, as well as other open-source software.
Notepad++ - Notepad++ is a text and source code editor for use with Microsoft Windows. It supports tabbed editing, which allows working with multiple open files in a single window.
Sublime text - Sublime Text is a commercial source code editor. It natively supports many programming languages and markup languages. Users can expand its functionality with plugins, typically community-built and maintained under free-software licenses. To facilitate plugins, Sublime Text features a Python API.
Vim or Emacs - Vim is a free and open-source, screen-based text editor program for Unix. Emacs or EMACS (Editor MACroS) is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, self-documenting, real-time display editor". These two are used in the UNIX and LINUX based systems and are one of the oldest text editors.
MATLAB - MATLAB is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages.

Algorithms

Machine Learning is an integral part of Data Science and most of us are fascinated by the type of things it is doing in our day-to-day life like Self driven cars, Robots and AI assistants talking in almost human language, detection of diseases like cancer, facial recognition, etc. All these things are only possible because of data and the ML algorithms which work on this data. ML algorithms which are most widely used by Data scientists is listed below. It includes a wide variety of algorithms from most basic algorithms like regression and decision trees to high profile Deep Learning algorithms like Transformers, GANs and RNNs.

Linear and Logistic Regression - These are the most basic algorithms of the ML ecosystem. Almost every data scientist learns these two algorithms as their first ML algorithm. Linear regression algorithm is basically a curve fitting algorithm which is used to determine trends and predict the value of dependent variable from independent variables. Logistic regression is used for classification tasks and finds probability of class.
Decision Trees or Random Forests - Decision Trees are another popular ML algorithm where based on certain decisions the possible consequences are defined. It can be used for both classification and regression task. It is an ensemble technique that combines many different decision trees to give the output. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.
Gradient Boosting Machines - Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees.
Convolutional Neural Networks - A convolutional neural network (CNN) is a class of artificial neural network, most commonly applied to analyze visual imagery. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain-computer interfaces, and financial time series. CNNs are regularized versions of multilayer perceptrons.
Bayesian Approaches - Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
Dense Neural Networks - It is another class of neural networks that is connected deeply, which means each neuron in the dense layer receives input from all neurons of its previous layer. The dense layer is found to be the most commonly used layer in the models.
Recurrent Neural Networks - Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are fed as input to the current step. It is the first algorithm that remembers its input, due to an internal memory, which makes it perfectly suited for machine learning problems that involve sequential data. The different variants of RNN architecture are Bidirectional recurrent neural networks (BRNN), Long short-term memory (LSTM), and Gated recurrent units (GRUs).
Transformer Networks - A transformer is a deep learning model that adopts the mechanism of attention, differentially weighting the significance of each part of the input data. It is used primarily in the field of natural language processing (NLP) and in computer vision (CV). Like recurrent neural networks (RNNs), transformers are designed to handle sequential input data, such as natural language, for tasks such as translation and text summarization. Some famous Transformer architectures are BERT and GPT.
Generative Adversarial Network - Generative Adversarial Networks, or GANs for short, are an approach to generative modeling using deep learning methods, such as convolutional neural networks. Generative modeling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset. The GAN model architecture involves two sub-models: a generator model for generating new examples and a discriminator model for classifying whether generated examples are real, from the domain, or fake, generated by the generator model.
Evolutionary Approaches - Evolutionary algorithms are a heuristic-based approach to solving problems that cannot be easily solved in polynomial time, such as classically NP-Hard problems, and anything else that would take far too long to exhaustively process. Genetic Algorithm is the most common evolutionary algorithm. It is used in Optimization of the neural networks and ML models.

Machine Learning Frameworks

There are many frameworks built in many languages but mostly Python which have the code for implementing the various ML algorithms discussed above. These frameworks make the life of Data scientists quite easier as they have to just call a simple Python function to implement the most complex of ML algorithms without getting into the nitty-gritty of them. Some of the most prominent ML frameworks are listed below.

Scikit-learn - It is one of the most widely used frameworks for Python based Data science tasks. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Link - https://scikit-learn.org/
Tensorflow - It is mainly used for training ML models which are based on Neural networks and Deep Learning. TensorFlow was developed by the Google Brain team for internal Google use. It can be used in a wide variety of programming languages, most notably Python, as well as Javascript, C++, and Java. This flexibility lends itself to a range of applications in many different sectors.
Link - https://www.tensorflow.org/
Xgboost - XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It implements machine learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
Link - https://xgboost.readthedocs.io/
Keras - Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library.
Link - https://keras.io/
PyTorch - PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab. It is free and open-source software released under the Modified BSD license.
Link - https://pytorch.org/
LightGBM - LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks.
Link - https://lightgbm.readthedocs.io/
Catboost - CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm.
Link - https://catboost.ai/
Huggingface - It is open source library for building transformer based language models. It is used in the field of Natural Language Processing. Large language models like BERT, GPT, etc. are implemented using this library.
Link - https://huggingface.co/
Prophet - It is a time-series forecasting library built by Facebook. Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Link - https://github.com/facebook/prophet
Caret - The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation as well as other functionality.
Link - https://topepo.github.io/caret/
Pytorch Lightning - PyTorch Lightning is an open-source Python library that provides a high-level interface for PyTorch, a popular deep learning framework. It is a lightweight and high-performance framework that organizes PyTorch code to decouple the research from the engineering, making deep learning experiments easier to read and reproduce. It is designed to create scalable deep learning models that can easily run on distributed hardware while keeping the models hardware agnostic.
Link - https://www.pytorchlightning.ai/
Fast.ai - It is open source library for deep learning called fastai (without a period), sitting atop PyTorch.
Link - https://www.fast.ai/
Tidymodels - The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. It is built using R language.
Link - https://www.tidymodels.org/
H20-3 - H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.
Link - https://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html
MXNet - Apache MXNet is an open-source deep learning software framework, used to train, and deploy deep neural networks.
Link - https://mxnet.apache.org/versions/1.8.0/
JAX - JAX is NumPy on the CPU, GPU, and TPU, with great automatic differentiation for high-performance machine learning research. JAX is Autograd and XLA, brought together for high-performance machine learning research. What’s new is that JAX uses XLA to compile and run your NumPy programs on GPUs and TPUs.
Link - https://jax.readthedocs.io/en/latest/notebooks/quickstart.html

Cloud Data Storage Products

The most important aspect of Data science is data, without it nothing is possible. We need resources to store data. With the advent of Cloud technologies it has become easier to store data and manage it smoothly. The below list has the best Cloud Data Storage Products from the best in the business tech-giants like Google, Amazon, and Microsoft.

Google Cloud Filestore - https://cloud.google.com/filestore
Amazon Elastic Filesystem - https://aws.amazon.com/efs/
Microsoft Azure Disk Storage - https://azure.microsoft.com/en-in/services/storage/disks/
Microsoft Azure Data Lake Storage - https://azure.microsoft.com/en-in/services/storage/data-lake-storage/
Google Cloud Storage - https://cloud.google.com/storage
Amazon Simple Storage Service - https://aws.amazon.com/s3/

Enterprise Machine Learning Tools

These are the tools used by large business organizations.

Amazon Sagemaker - Amazon SageMaker is a cloud machine-learning platform that was launched in November 2017. SageMaker enables developers to create, train, and deploy machine-learning models in the cloud. SageMaker also enables developers to deploy ML models on embedded systems and edge-devices
Link - https://aws.amazon.com/sagemaker/
Databricks - Databricks is an enterprise software company founded by the creators of Apache Spark. The company has also created Delta Lake, MLflow and Koalas, open source projects that span data engineering, data science and machine learning.
Link - https://databricks.com/
Azure Machine Learning Studio - Azure Machine Learning studio is a web portal in Azure Machine Learning that contains low-code and no-code options for project authoring and asset management.
Link - https://studio.azureml.net/
Google Cloud Vertex AI - Vertex AI brings together the Google Cloud services for building ML under one, unified UI and API. In Vertex AI, you can now easily train and compare models using AutoML or custom code training and all your models are stored in one central model repository. These models can now be deployed to the same endpoints on Vertex AI.
Link - https://cloud.google.com/vertex-ai
DataRobot - DataRobot, the Boston-based Data Science company, enables business analysts to build predictive analytics with no knowledge of Machine Learning or programming. It uses automated ML to build and deploy accurate predictive models in a short span of time.
Link - https://www.datarobot.com/
Rapidminer - RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics.
Link - https://rapidminer.com/
Alteryx - Alteryx empowers analysts to prep, blend, and analyze data faster with hundreds of no-code, low-code analytic building blocks that enable highly configurable and repeatable workflows.
Link - https://www.alteryx.com/
Dataiku - Dataiku enables teams to create and deliver data and advanced analytics using the latest techniques at scale. The software Dataiku Data Science Studio (DSS) supports predictive modelling to build business applications.
Link - https://www.dataiku.com/

Database Products

The databases are very important for Datascience, in this list there are SQL and No-SQL databases along with big data related database products. These are profoundly used databases.

MySQL - https://www.mysql.com/
PostgreSQL - https://www.postgresql.org/
Microsoft SQL Server - https://www.microsoft.com/en-in/sql-server/sql-server-downloads
MongoDB - https://www.mongodb.com/
Google Cloud BigQuery - https://cloud.google.com/bigquery
Oracle Database - https://www.oracle.com/database/
Microsoft Azure SQL Database - https://azure.microsoft.com/en-in/products/azure-sql/database/
Amazon Redshift - https://aws.amazon.com/redshift/
Snowflake - https://www.snowflake.com/
Google Cloud SQL - https://cloud.google.com/sql
Amazon DynamoDB - https://aws.amazon.com/dynamodb/
Microsoft Azure Cosmos DB - https://docs.microsoft.com/en-us/azure/cosmos-db/
Google Cloud Bigtable - https://cloud.google.com/bigtable
IBM Db2 - https://www.ibm.com/in-en/products/db2-database
Google Cloud Firestore - https://firebase.google.com/docs/firestore
Amazon Aurora - https://aws.amazon.com/rds/aurora
Google Cloud Spanner - https://cloud.google.com/spanner

Machine Learning Experiment Tools

The list below shows tools which are used for machine learning explainability and helping us better understand the ML algorithms like Tensorboard. It also contains tools for MLOPs like Weights and Biases, ClearML, Neptune.ai, etc. They are used to measure performance of models, keep logs, optimize ML pipelines, automate pipelines, and tune hyperparameters.

TensorBoard - https://www.tensorflow.org/tensorboard
MLflow - https://mlflow.org/
Weights & Biases - https://wandb.ai/site
Neptune.ai - https://neptune.ai/
ClearML - https://clear.ml/
Guild.ai - https://guild.ai/
Polyaxon - https://polyaxon.com/
Comet.ml - https://www.comet.ml/site/
Domino Model Monitor - https://www.dominodatalab.com/product/domino-model-monitor

Automated Machine Learning Frameworks

Automated machine learning (AutoML) is the process of applying machine learning (ML) models to real-world problems using automation. More specifically, it automates the selection, composition and parameterization of machine learning models. These frameworks help in implementing AutoML. The different steps in traditional ML are data pre-processing, feature engineering, feature extraction, feature selection, algorithm selection, and hyperparameter optimization. AutoML helps automate this entire pipeline. AutoML dramatically simplifies these steps for non-experts.