Percival Villalva for Apify

Posted on Oct 26, 2023 • Originally published at blog.apify.com on Jul 30, 2023

Python and machine learning

#ai #python

Learn how Python and machine learning intersect to solve complex problems that defeat traditional programming methods. Find out about Pandas, TensorFlow, Scikit-learn, and how they can transform data.

What is machine learning?

Machine learning, a subset of artificial intelligence (AI), is a rapidly evolving field with numerous practical applications in various domains. Recently, the popularity and impact of AI, exemplified by advancements like ChatGPT, have boosted interest in the field and its potential to enhance our daily lives. But what exactly is machine learning and when would we want to use it? And how does Python fit in with machine learning?

To answer these questions, let's consider an example to understand its significance. Imagine you're tasked with developing a program to analyze an image and determine whether it contains a cat, a dog, or another animal. To accomplish such a broad task, traditional programming techniques would quickly lead to overwhelming and time-consuming complexity. Devising multiple rules to detect curves, edges, and colors in the image would be prone to flaws. For example, black-and-white photos would require rule revisions, and unanticipated angles of cats or dogs would make any rules we create ineffective. In other words, attempting to solve this problem through traditional programming methods would prove excessively complicated or even impossible.

And this is where machine learning comes into play. It offers a technique for us to address such problems effectively. Instead of relying on explicit programming rules, we can construct a model or an engine and provide it with an abundance of data. For instance, to solve our dogs and cats problem, we could supply thousands or even tens of thousands of pictures of cats and dogs to a model that would then analyze this input data and learn its patterns autonomously.

Now, suppose we present the model with a new, unseen picture of a cat and inquire whether the picture depicts a cat, a dog, or a horse. The model, based on its learned patterns, will provide us with a response, accompanied by a certain level of accuracy. The more data we feed into the model the better its accuracy becomes, especially if the data is relevant and high quality.

Although this example is simplistic, machine learning has extensive applications, including self-driving cars, robotics, natural language processing, image recognition, and forecasting, such as predicting stock market trends or weather patterns.

How Python and machine learning come together

That all sounds great, but what can we use to build those models? While there is no single best programming language for machine learning, Python has emerged as the de facto language for machine learning due to its simplicity, flexibility, and vibrant ecosystem of libraries and tools.

In this article, we will explore the best Python libraries for developing machine-learning models, such as Pandas, TensorFlow, Scikit-learn, and more, to understand their role in the various stages of the machine-learning process.

5 steps in developing a machine learning model with Python

Developing a machine learning model involves several essential steps that collectively form a pipeline from data preparation to model deployment. Understanding these steps is crucial for building effective and accurate machine-learning models. Let's take a quick look at each step and what popular Python libraries we could use to fulfill the requirements of each step:

1. Data preparation and exploration

Data preparation and exploration lay the foundation for any successful machine-learning project. This step involves tasks such as data cleaning , handling missing values , feature scaling , and data visualization. Properly preparing and exploring the data can help identify patterns, outliers, and relationships that will influence the model's performance.

To accomplish this step, we can leverage libraries such as:

Pandas: In the context of machine learning, Pandas is a crucial tool for handling and analyzing structured data. By leveraging its powerful data structures, such as DataFrames, we can efficiently manipulate and transform datasets. To that end, Pandas provides an extensive range of functions for data cleaning, handling missing values, and performing descriptive statistics. These capabilities are crucial in the data preparation phase of machine learning, enabling us to preprocess the data, remove outliers, impute missing values, and extract meaningful insights.
Matplotlib: As a widely-used plotting library, Matplotlib offers a versatile set of visualization techniques, including line plots, scatter plots, and histograms. These visualizations are invaluable in the, help researchers identify patterns, trends, and anomalies in the dataset in the data exploration phase. By visualizing the data, machine learning practitioners can make informed decisions about feature engineering, data preprocessing, and model selection.

❗ The code examples provided in this article are for demonstration and educational purposes only and should not be considered production-ready.

To get an idea of how we would go about this step, let's consider a situation where we use Pandas to explore and visualize data retrieved from a CSV file:

import pandas as pd

# Load the dataset
data = pd.read_csv('sample_data.csv')

# Explore the data
print(data.head()) # Display the first few rows
print(data.describe()) # Get statistical summary
print(data.info()) # Get information about the columns

# Handle missing values
data = data.fillna(0) # Replace missing values with 0

# Visualize the data
data['age'].plot.hist() # Plot a histogram of the age column
data.plot.scatter(x='income', y='purchase') # Create a scatter plot of income vs. purchase

To obtain high-quality datasets for machine learning, there are several options available. One approach is to download existing datasets from machine learning communities like Kaggle, where you can find a wide range of datasets for free.

Alternatively, if you require a dataset tailored to your specific project, web scraping can be an effective solution. Web scraping platforms like Apify offer access to numerous pre-built scrapers in Apify Store, allowing you to extract data from data-rich websites such as Google Maps, YouTube, and Meta's Threads. Additionally, for those interested in flexing their web scraping skills, building and deploying custom scrapers is an option.

2. Feature engineering and selection

Feature engineering involves transforming raw data into meaningful features that capture the underlying patterns and relationships. This step often requires domain expertise and creativity. Feature selection aims to identify the most relevant features for the model, reducing complexity and improving efficiency.

To assist with feature engineering and selection, we can utilize libraries such as:

Scikit-learn: Scikit-learn offers a wide range of feature extraction and transformation techniques. It helps us handle different data types, encode categorical variables for numerical representation, scale numerical features, generate new informative features, and perform feature selection to improve model performance. In short, Scikit-learn streamlines feature engineering, making data preprocessing and transformation easier, resulting in more effective machine learning models.
Featuretools: Featuretools is a library designed for automated feature engineering in machine learning. It enables us to create new features by combining existing ones, making it easier to capture complex relationships and patterns in the data.

To illustrate how this step let's consider a text classification task where we want to classify news articles into different categories. We can use Scikit-learn to preprocess the text data, convert it into numerical features, and select the most important features using the TF-IDF (Term Frequency-Inverse Document Frequency) method.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

# Preprocess the text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)

# Select the most important features
selector = SelectKBest(chi2, k=1000)
X_selected = selector.fit_transform(X, labels)

3. Model building and training

Model building involves selecting an appropriate algorithm or model architecture to solve the problem at hand. Python offers a wide range of algorithms and models, each suited for different types of problems. Once the model is chosen, it needs to be trained on labeled data to learn the patterns and make accurate predictions.

To build and train machine learning models, we can rely on libraries such as:

Scikit-learn : Scikit-learn not only can help us with step 2 (Feature Engineering and Selection) but it also offers a consistent API that facilitates the training process with functions for model fitting, hyperparameter tuning, and model serialization.
TensorFlow: TensorFlow is a popular deep-learning framework that allows us to build and train neural networks for various tasks. It offers a wide range of pre-built neural network architectures and supports custom model creation. TensorFlow provides efficient computation on GPUs and TPUs, enabling faster training for large-scale models.

To illustrate this, lets take a look at how we would implement this step in a real project using Scikit-learn and TensorFlow.

Let's take a classification problem as an example. We can use logistic regression from Scikit-learn to train a model on labeled data and make predictions on new, unseen data.

from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

Now lets take it a step further and see how we can use TensorFlow not only to build and train the model but also to make predictions and, finally, deploy it.

For example, imagine we are building a handwritten digit recognition system. The neural network architecture defined in the code below could be trained on a dataset of handwritten digit images along with their corresponding labels. Once trained, the model can make predictions on new, unseen digit images, accurately classifying them into their respective digits (0 to 9).

Then, the trained model can be saved and deployed in a production environment, where it can be integrated into a larger application or used as an API to provide digit recognition functionality to end users.

import tensorflow as tf

# Creating a simple neural network
model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compiling the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the model
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))

# Making predictions
predictions = model.predict(x_test)

# Deploying the model
model.save('model.h5')

4. Model evaluation and validation

After training the model, it is essential to assess its performance and validate its ability to generalize well on unseen data. Evaluation metrics such as accuracy , precision , recall , and F1 score provide insights into the model's effectiveness. Validation techniques like cross-validation help estimate how well the model will perform in the real world.

Before we get to the libraries we use for model evaluation and validation, lets understand what exactly the metrics and techniques mentioned above measure and why they are important for building reliable machine-learning models.

Evaluation metrics

Accuracy : Measures the proportion of correctly classified instances out of the total instances. It is calculated as the number of correct predictions divided by the total number of predictions. Accuracy provides a general measure of how well the model performs overall. For example, in email spam detection, accuracy measures the percentage of correctly classified emails as spam or non-spam.
Precision : The proportion of correctly predicted positive instances out of all instances predicted as positive. It represents the model's ability to avoid false positive errors, indicating how precise the positive predictions are. Precision is important in scenarios where false positives are costly. For instance, in medical diagnosis, precision is crucial to accurately identify patients with a specific disease to avoid unnecessary treatments or interventions.
Recall : Also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It captures the model's ability to find all positive instances, avoiding false negatives. Recall is particularly important when the cost of false negatives is high. For example, in fraud detection, recall is essential to identify as many fraudulent transactions as possible, even if it means a higher number of false positives.
F1 score : The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the model's performance, considering both precision and recall simultaneously. The F1 score is useful when there is an uneven class distribution or when both precision and recall are equally important. For example, in information retrieval systems, the F1 score is commonly used to evaluate search algorithms, where both precision and recall are crucial in providing accurate and comprehensive search results.

Validation techniques (cross-validation)

Cross-validation helps assess a model's generalization performance and mitigate the risk of overfitting. It plays a crucial role in machine learning for the following reasons:

Performance estimation: Cross-validation provides a more reliable estimate of how well a model will perform on unseen data by evaluating it on multiple validation sets. This helps determine if the model has learned meaningful patterns or is simply memorizing the training data.
Hyperparameter tuning : Cross-validation aids in selecting the best set of hyperparameters for a model. By comparing performance across different parameter configurations, it helps identify the optimal combination that maximizes performance on unseen data.
Model selection : Cross-validation allows for a fair comparison between different models or algorithms. By evaluating their performance on multiple validation sets, it assists in choosing the most suitable model for the given problem, considering accuracy, precision, recall, or specific requirements.
Data leakage prevention : Cross-validation mitigates data leakage by creating separate validation sets that are not used during model training. This ensures a fair evaluation and avoids unintentional over-optimization based on the test set.

In real-life applications, cross-validation is particularly valuable in tasks such as credit risk assessment, where accurate predictions on unseen data are essential for decision-making.

In summary, cross-validation is essential for the development of robust models that generalize well to new instances and provides confidence in their performance outside the training data.

To evaluate and validate machine learning models, we can utilize libraries such as:

Scikit-learn : Scikit-learn offers a wide range of evaluation metrics for classification, regression, and clustering tasks. It provides functions for calculating accuracy, precision, recall, F1 score, and more. Scikit-learn also includes techniques for cross-validation, which allows for robust performance estimation.
Yellowbrick: Yellowbrick is a visualization library that integrates with Scikit-learn and provides visual tools for model evaluation and diagnostics. It offers visualizations for classification reports, learning curves, confusion matrices, and feature importances, aiding in the analysis of model performance.

So, lets take a look at how we can use some of Scikit-learns various evaluation metrics and validation techniques. Remember our previous example of a classification model? We can use Scikit-learn to evaluate the model's performance by calculating accuracy, precision, recall, and F1 score, and while we are at it, we can also use cross-validation to estimate the model's performance on unseen data.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

# Evaluate the model
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=5)

5. Model deployment and monitoring

Once a satisfactory model is obtained then the exciting part begins: deploying it to production environments for real-world usage. This step involves integrating the model into an application or system and ensuring its performance is continuously monitored and optimized over time.

To deploy and monitor machine learning models, we can rely on libraries such as:

Flask: Flask is a lightweight web framework that allows us to build APIs for serving machine learning models. It provides a simple and scalable way to expose our models as web services, enabling seamless integration into applications or systems.
TensorBoard: TensorBoard is a powerful visualization tool that comes bundled with TensorFlow. It helps monitor and analyze the performance of deep learning models by providing interactive visualizations of metrics, model architectures, and training progress.
Prometheus and Grafana: Prometheus is a monitoring and alerting toolkit, while Grafana is a visualization tool. Together, they offer a robust solution for monitoring the performance and health of machine learning models in real time, providing valuable insights and enabling proactive optimization.

The choice of deployment and monitoring tools for machine learning models depends on the project and libraries you are comfortable with. For example, if you are building TensorFlow models, using TensorBoard to deploy them would be a great option.

But we are also not restricted to choosing a single library. To deploy and monitor machine learning models, we can use a combination of libraries. For instance, we can use Flask to create an API to serve the model predictions, while using Prometheus to access its monitoring and alerting capabilities, and Grafana for visualization of performance metrics. Together, they provide a robust solution for deploying and monitoring machine learning models.

from flask import Flask, request, jsonify
import prometheus_client
from prometheus_flask_exporter import PrometheusMetrics
import json

app = Flask( __name__ )
metrics = PrometheusMetrics(app)

@app.route('/predict', methods=['POST'])
def predict():
    data = json.loads(request.data)
    # Process the data and make predictions
    predictions = model.predict(data)
    return jsonify(predictions)

if __name__ == ' __main__':
    app.run()

# Monitor the model using Prometheus and Grafana...

Whats next in machine learning and Python?

In this article, we have explored the world of machine learning with Python and discussed some of the best libraries available for developing machine learning models. Python's simplicity, flexibility, and extensive library ecosystem make it an ideal choice for both beginners and experienced developers venturing into the field of machine learning.

As you embark on your machine-learning journey with Python, we encourage you to explore these libraries further. Dive into their documentation, experiment with different algorithms and techniques, and leverage the vast online resources and communities available to you.

Remember, machine learning is a rapidly evolving field, and staying up to date with the latest advancements and techniques is crucial. If youre interested in continuing, why not try training your own language model to create a personalized ChatGPT using LangChain, OpenAI, Pinecone, and Apify?

DEV Community