As a practicing data scientist, you’ll need to become familiar with machine learning concepts. Both data science and machine learning overlap, but in the most basic terms you could say:
- Data science is used to gain insights and understanding of data.
- Machine learning is used to produce predictions based on data.
That said, the boundary between them is not a distinct one, and most data science practitioners need to be capable of switching back and forth between the two domains.
Knowing where to start your data science journey can be overwhelming, so with the help of Microsoft Senior AI Engineer, Samia Khalid, this post breaks down what you need to know to get your career started in data science. If you’re looking for something more in-depth and hands-on, you can visit Samia’s interactive data science course, Grokking Data Science.
Here’s what will be covered today:
Essential Python libraries for a data scientist
A quick mention on Jupyter notebooks: a must-have for data scientists
Machine learning 101 for data scientists
Machine learning project checklist
Essential Python libraries for a data scientist
NumPy
NumPy (Numerical Python) is a powerful, and extensively used library for storage and calculations. It’s designed for dealing with numerical data. It allows for data storage and calculations by providing data structures, algorithms, and other useful utilities. For example, this library contains basic linear algebra functions, Fourier transforms, and advanced random number capabilities. It can also be used to load data to Python and export from it.
Here are some NumPy basics you should start to become familiar with:
NumPy basics
- Creating NumPy arrays and array attributes
- Array indexing and slicing
- Reshaping and concatenation
NumPy arithmetic and statistics basics
- Computations and aggregations
- Comparison and boolean masks
Pandas
Pandas is a library that you can’t avoid when working with Python on a data science project. It’s a powerful tool for data wrangling, a process required to prepare your data so that it can actually be consumed for analysis and model building.
Pandas contains a large variety of functions for data import, export, indexing, and data manipulation. It also provides handy data structures like DataFrames (series of columns and rows, and Series (1-dimensional arrays), and efficient methods for handling them. For example, it allows you to reshape, merge, split, and aggregate data.
Here are some basic Pandas concepts you should start to become familiar with:
Pandas core components
- The Series Object
- The DataFrame Object
Pandas DataFrame Operations
- Read, view, and extract information
- Selection, slicing, and filtering
- Grouping and sorting
- Dealing with missing and duplicates
- Pivot tables and functions
Scikit-learn
Scikit-learn is an easy to use library for Machine Learning. It comes with a variety of efficient tools for machine learning and statistical modeling such as:
- Classification models (e.g., Support Vector Machines, Random Forests, Decision Trees)
- Regression Analysis (e.g., Linear Regression, Ridge Regression, Logistic Regression)
- Clustering methods (e.g, k-means)
- Data reduction methods (e.g., Principal Component Analysis, feature selection)
- Model tuning
- Selection with features like grid search, cross-validation. It also allows for pre-processing of data.
Matplotlib and Seaborn
Matplotlib is widely used for data visualization exercises, like plotting histograms, line plots, and heat plots.
Seaborn is another great library for creating attractive and information rich graphics. Its goal is to make data exploration and understanding easier, and it does it very well. Seaborn is based on Matplotlib which is its child, basically.
A quick mention on Jupyter notebooks: A must-have for data scientists
The Jupyter Notebook is an incredibly powerful and sleek tool for developing and presenting data science projects. It can integrate code and its output into a single document, combining visualizations, narrative text, mathematical equations, and other rich media. It’s simply awesome.
Jupyter Notebook can handle many other languages, like R, as well. Its intuitive workflows, ease of use, and zero-cost have made it THE tool at the heart of any data science project.
Machine learning 101 for data scientists
Takeaways from this section:
- To introduce the fundamental concepts of machine learning.
- To learn about several of the most important machine learning algorithms and develop an intuition into how they work and when and where they are applicable.
- To get an understanding of what are the necessary steps and how they can be applied to a machine learning project via a real end-to-end example.
Main components of machine learning
There are three basic components you need to train your machine learning systems:
Data
Data can be collected both manually and automatically. For example, users’ personal details like age and gender, all their clicks, and purchase history are valuable data for an online store. Do you recall “ReCaptcha” which forces you to “Select all the street signs”? That’s an example of some free manual labor! Data is not always images; it could be tables of data with many variables (features), text, sensor recordings, sound samples etc., depending on the problem at hand.
Features
Features are often also called variables or parameters. These are essentially the factors for a machine to look at — the properties of the “object” in question, e.g., users’ age, stock price, area of the rental properties, number of words in a sentence, petal length, size of the cells. Choosing meaningful features is very important, but it takes practice and thought to figure out what features to use as they are not always as clear as in this trivial example.
Algorithms
Machine learning is based on general purpose algorithms. For example, one kind of algorithm is classification. Classification allows you to put data into different groups. The interesting thing is that the same classification algorithm used to recognize handwritten numbers could also be used to classify emails into spam and not-spam without changing a line of code! How is this possible?
Although the algorithm is the same, it’s fed different input data, so it comes up with different classification logic. However, this is not meant to imply that one algorithm can be used to solve all kinds of problems! The choice of the algorithm is important in determining the quality of the final machine learning model. Acquiring as much data as possible is a very important first step in getting started with machine learning systems.
Machine learning algorithms: categorization and the most used algorithms
Machine Learning algorithms can be broadly categorized into the following four groups:
- Supervised Learning
- Unsupervised Learning
- Semisupervised Learning
- Reinforcement Learning
Supervised learning
In Supervised Learning, the training data provided as input to the algorithm includes the final solutions, called labels or classes because the algorithm learns by “looking” at the examples with correct answers. In other words, the algorithm has a supervisor or a teacher who provides it with all the answers first, like whether it’s a cat in the picture or not. And the machine uses these examples to learn one by one.
Another typical task, of a different type, would be to predict a target numeric value like housing prices from a set of features like size, location, number of bedrooms. To train the system, you again need to provide many correct examples of known housing prices, including both their features and their labels.
While categorizing emails or identifying whether the picture is of a cat or a dog was a supervised learning algorithm of type classification, predicting housing prices is known as regression.
What’s the difference?
In regression the output is a continuous value or a decimal number like housing prices. In classification, the output is a label like “spam or not-spam” and not a decimal number; the output only takes values like 0 or 1 where you could have 1 for “spam” and 0 for “non-spam”. Basically, the type of algorithm you choose (classification or regression) depends on the type of output you want.
Most used Supervised Learning Algorithms: Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees, Random Forests, K-Nearest Neighbors, Artificial Neural Networks.
Check out this article on ML algorithms to learn more.
Unsupervised Learning
In Unsupervised Learning the data has no labels; the goal of the algorithm is to find relationships in the data. This system needs to learn without a teacher and finds relationships based on some hidden patterns in the data. This type of segmentation is an example of what is known as clustering, classification with no predefined classes and based on some unknown features.
Most used Unsupervised Learning Algorithms: K-Means, Visualization and dimensionality reduction, Principal Component Analysis (PCA), t-distributed, Stochastic Neighbor Embedding (t-SNE), Association rule learning.
Semi-supervised Learning
Semi-supervised learning deals with partially labeled training data, usually a lot of unlabeled data with some labeled data. Most semi-supervised learning algorithms are a combination of unsupervised and supervised algorithms.
Reinforcement Learning
Reinforcement Learning is a special and more advanced category where the learning system needs to learn to make specific decisions. The learning system observes the environment to which it is exposed, it selects and performs actions, and gets rewards or penalties in return. Its goal is to choose actions which maximize the reward over time. So, by trial and error, and based on past experience, the system learns the best strategy, called policy, on its own.
Machine learning project checklist
You have been hired as a new Data Scientist and you have an exciting project to work on! How should you go about it?
Here are some best practices and a checklist that you should consider adopting when working on an end-to-end ML project.
Frame the problem and look at the big picture: Understand the problem, both formally and informally. Figure out the right questions to ask and how to frame them. Understand the assumptions based on domain knowledge
Get the data: Do NOT forget about data privacy and compliance here, they are of paramount importance! Ask questions and engage with stakeholders, if needed.
Explore the data and extract insights: Summarize the data. Find the type of variables or map out the underlying data structure, find correlations among variables, identify the most important variables, check for missing values and mistakes in the data etc. Visualize the data to take a broad look at patterns, trends, anomalies, and outliers. Use data summarization and data visualization techniques to understand the story the data is telling you.
Start simple: Begin with a very simplistic model, like linear or logistic regression, with minimal and prominent features (directly observed and reported features). This will allow you to gain a good familiarity with the problem at hand and also set the right direction for the next steps.
More Feature Engineering: Prepare the data to extract the more intricate data patterns.
Combine and modify existing features to create new features.Explore many different models and short-list the best ones based on comparative evaluation, e.g., compare RMSE or ROC-AUC scores for different models.
Fine-tune the parameters of your models and consider combining them for the best results.
Present your solution to the stakeholders in a simple, engaging, and visually appealing manner. Use the art of story-telling, it works wonders! Remember to tailor your presentation based on the technical level of your target audience.
If the scope of your project is more than just extracting and presenting insights from data, proceed with launching, monitoring, and maintaining your system.
Of course, this checklist is just a reference for getting started. Once you start working on real projects, adapt, improvise, and at the end of each project reflect on the takeaways, learning from mistakes is essential!
Where to go from here?
We’ve covered some of the basics of machine learning for data scientists, but there is still a lot more to learn and explore if you really want to get your career started, and you don’t have to go through it alone.
Industry expert and Microsoft Senior AI Engineer, Samia Khalid, has compiled her learnings into a comprehensive course, Grokking Data Science. Samia has laid out everything you’ll need in one place to get started and thrive in a data science career.
Thoughts? Questions? Comments?
Further readings and resources
Course Track: Become a Machine Learning Engineer
Article: How to ace your next ML interview
Article: How to deploy machine learning models with Azure Machine Learning
Article: How and why to become a machine learning engineer
Article: The practical approach to machine learning for software engineers
Article: My experience working with ML at Google and Microsoft
Top comments (3)
Hi Amanda,
Great article, really comprehensive. Your point regarding theory/practice is important. Academia tends very heavily toward the theory and not the when's, where's and how to's for using product A for this method or that. Coming from an MS program it is tough to train yourself on these nuts and bolts issues though. They require a lot of experimentation over time and that is tough to do in 2-6-12 months.
Thanks
I started playing with Google Colab as well.
colab.research.google.com/
I think it's a good tool to add to your list.
Thank you for sharing this tool. I will take a look at it!