Jawad Sadiq

Posted on May 26, 2024

Managing Machine Learning Projects

#machinelearning #ai #projectmanagement #scrumforml

Intro

There is a lot of technical advice and trainings available online about engineering of AI, ML or data-science related projects, but very little practical advice is available about how to plan and manage and execute them efficiently. This article suggests a set of tools and processes that ML Teams can use.

The Problem

Gartner claims that 85% of A.I. Projects end up failing citing various reasons including, but not limited to:

Inability to ship an ML-enabled product to production
Shipping products that customers don’t use
Deploying defective products that customers don’t trust
Inability to evolve and improve models in production quickly enough

While almost all software engineering projects can be complex, AI and ML projects are particularly challenging due to their inherent element of uncertainty; the fact that these projects are fundamentally based on hypotheses that can fail.

In traditional software engineering, projects typically involve implementing well-defined requirements where inputs lead to predictable outputs. The success of such projects hinges primarily on the correctness of the code and adherence to specifications; things that can be managed through "good" planning and execution.

In contrast, ML projects start with a hypothesis about patterns in the data. For example, an ML project might hypothesize that certain features can predict an outcome, such as customer churn or product recommendations. This hypothesis is tested through the development and training of a model. However, there is no guarantee that the chosen features, the model architecture, or the available data will validate the hypothesis. The model may not perform as expected, leading to outcomes that can be suboptimal or outright failures. Hence, despite the most thorough planning, unforeseen issues can still arise, echoing Murphy's Law: "Whatever can go wrong, will go wrong" (Edward A. Murphy Jr.).

The motivation of this article is to not avoid the problems, but to find them as soon as possible; failing fast in order to find the right solution faster.

Problem Discovery

The first step for an ML project, like every other project, is Problem Discovery & definition.

Discovery is a set of activities that helps us better understand the problem, the opportunity, and potential solutions. It provides a structure for navigating uncertainty through rapid, time-boxed, iterative activities that involve various stakeholders and customers.

As eloquently articulated in Lean Enterprise (O’Reilly), the process of creating a shared vision always starts with clearly defining the problem, because having a clear problem statement helps the team focus on what is important and ignore distractions.

For ML projects it is particularly important to understand the problem not only to see what ML/AI based solution is needed, but more importantly to see why it is needed? Why can't the problem be solved with traditional approach?

The problem with emerging technologies is that when they are in the "peak of inflated expectations" on the Gartner Hype-cycle, everyone expects them to do everything. And we are now living in a time when A.I., specially generative A.I. is on the peak of inflated expectations as shown in the diagram above. Therefore, a detailed requirement analysis is required to figure out the best solution for the given problem.

Some tools that can help Product Owners in Machine Learning teams at this phase of the projects are as follows:

Double Diamond design process

Effective Machine Learning Teams (O' Reilly) suggests that, in addition to the context of ML, one tool that can be useful for every problem-solving scenario is the Double Diamond design process.

There are four phases to the process:

Discover: Understand the problem rather than merely assuming it. This involves speaking to and spending time with people— e.g., customers and users—who are affected by the problem.
Define: The insight gathered from the Discovery phase can help to define the challenge or problem in a different way.
Develop/Design:
Generate different answers to the clearly defined problem, seeking inspiration from elsewhere and co-designing with a range of different people.
Deliver: Test out different solutions at a small scale, rejecting those that will not work and improving the ones that will.

The general principle of divergent and then convergent thinking in first the problem and then the solution space is applicable in almost any problem-solving scenario, and you might also find yourself using this model to run meetings, write code, or plan a dinner party!

Additionally another tool suggested in the same book for ML product discovery is the Data Product Canvas, a tool that provides a decent framework for connecting all the dots between data collection, ML efforts and Value creation.

Both these tools, the double diamond design process and the data product canvas, produce artifacts that inform and guide the team during execution.

Fail Fast, Pivot Fast Execution

Once we have understanding of the problem to be solved, we shift our focus on delivery which has its own unique challenges because businesses and customers often lack clear expectations or understanding of what an ML product can achieve because:

It’s difficult to predict how well an ML system will perform with the data available.
During product ideation;
- we might conceive ideas that are technically infeasible.
- or, we may be unaware of which features are feasible until we conduct experiments and develop functional prototypes.

It is therefore, essential to adopt a strategy of failing fast and pivoting fast:

Rapid Prototyping: Start with simple models and prototypes to quickly test hypotheses. Use these early results to guide further development rather than investing heavily in complex solutions upfront.
Frequent Testing and Validation: Regularly test models on validation data to catch issues early. Implement automated testing for model performance, ensuring that every change is evaluated rigorously.
Small Iterations: Break down the project into smaller, manageable tasks or sprints. Each iteration should deliver something that can be tested and evaluated, providing frequent feedback loops.
Flexible Roadmap: Maintain a flexible project roadmap that allows for changes based on new insights or data. Being rigid can hinder the ability to pivot when something isn’t working.
Early User Feedback: If applicable, get early feedback from end-users or stakeholders. Their insights can reveal practical issues and guide adjustments that improve the project’s relevance and effectiveness.
Automated Monitoring and Alerting: Implement monitoring tools to track model performance in production. Automated alerts for performance degradation can help in quickly identifying when a pivot or retraining is needed.
Post-Mortem Analysis: After each iteration or sprint, conduct a thorough post-mortem analysis to understand what went wrong and why. Use these insights to inform future pivots and improvements.

Sounds Familiar? What else do we, as engineers, know that inherently deals with complications and requires small iterations, early user feedback and flexible adjustable roadmap? Scrum. So can scrum be used for AI/ML Projects? Yes:

Scrum FTW

Here’s a step-by-step guide to implementing Scrum in ML projects:

Step 1: Form the Scrum Team

Product Owner (PO): Responsible for defining the features and requirements of the ML project, managing the product backlog, and ensuring that the team delivers value.
Scrum Master (SM): Facilitates Scrum processes, removes impediments, and ensures that the team adheres to Scrum principles.
Development Team: Comprises data scientists, ML engineers, software developers, and possibly domain experts. The team is cross-functional and collaborative.

Step 2: Define the Product Backlog

The product backlog for an ML project includes all the tasks and features needed to achieve the project goals. This might include:

Data collection and preprocessing tasks
Model selection and training experiments
Feature engineering tasks
Model evaluation and validation
Deployment and monitoring tasks
Documentation and reporting tasks

Even though the product owner here is responsible for creating the product backlog items, by using the discovery tools mentioned above, the scrum team can help refine them by making sure the following questions are answered:

Have we documented the motivations behind our data gathering strategies?
Is our approach to data gathering aligned with the project goals and requirements?
Are there any gaps or issues in the current data collection process?
How will we systematically built the data pipeline infrastructure to support the project's later stages?
Will we have a pipeline that handles data ingestion, transformation, and access for the modeling team?
Can we think of bottlenecks in our data pipeline that would need to be addressed later?
How are we going to establish model repositories and versioning infrastructures for all project artifacts?
Are the repositories commissioned and ready for use?
Is the team using these repositories consistently to track versions of models, datasets, and code?
Have we established model repositories and versioning infrastructures for all project artifacts?
Are the repositories commissioned and ready for use?
Is the team using these repositories consistently to track versions of models, datasets, and code?

Also the scrum team can review the following tools that can help in operations of ML engineering:

Step 3: Plan the Sprint

Sprints in ML projects can last between 2-4 weeks. Scrum team can decide what they want to achieve, whether the sprint is only to curate data and create the right data pipeline for the project or to run a training. During Sprint Planning, the team selects items from the product backlog to work on during the sprint. These items should be broken down into smaller, manageable tasks (sprint backlog).

Step 4: Execute the Sprint

During the sprint, the team works on the selected tasks. Key Scrum ceremonies include:

Daily Stand-ups: Short daily meetings where team members discuss what they did yesterday, what they plan to do today, and any blockers they face. This helps maintain transparency and address issues promptly.
Sprint Reviews: At the end of each sprint, the team demonstrates the completed work to stakeholders. This could involve presenting a trained model, showcasing new features, or sharing performance metrics.
Sprint Retrospectives: After the sprint review, the team reflects on what went well, what didn’t, and how processes can be improved for the next sprint.

Step 5: Manage and Prioritize the Backlog

The Product Owner continuously refines the product backlog, prioritizing tasks based on feedback from sprint reviews, changes in project requirements, and new insights. This might involve:

Adding new data sources
Adjusting model requirements
Incorporating feedback from stakeholders or users

Step 6: Iterative Development and Validation

ML projects benefit from iterative cycles of development, testing, and validation. During each sprint, the team can focus on specific aspects:

Early Sprints: Data collection, cleaning, and exploratory data analysis (EDA).
Middle Sprints: Model prototyping, training, and initial validation. Experiment with different algorithms and hyperparameters.
Later Sprints: Model tuning, extensive validation, and deployment.

Step 7: Adopt a Fail Fast, Pivot Fast Approach

Incorporate the following practices to align with the fail fast, pivot fast methodology:

Rapid Prototyping: Start with simple models to quickly test hypotheses and gather preliminary results.
Continuous Feedback: Regularly evaluate model performance using validation data and user feedback.
Flexible Roadmaps: Be prepared to pivot based on new data, feedback, or changes in project direction. Update the backlog and sprint goals accordingly.

Example (generated from chatGPT): Implementing Scrum for an ML-Based Recommendation System

Sprint 1: Data Collection and Exploration

-   Tasks: Collect user interaction data, clean the dataset, perform EDA.
-   Deliverable: Cleaned dataset, initial insights from EDA.

Sprint 2: Basic Model Development

-   Tasks: Implement a simple collaborative filtering model, evaluate its performance.
-   Deliverable: Baseline model, performance metrics.

Sprint 3: Model Improvement and Validation

-   Tasks: Experiment with different algorithms (e.g., content-based filtering), validate models with cross-validation.
-   Deliverable: Improved model, comparative performance metrics.

Sprint 4: Deployment and Monitoring

-   Tasks: Deploy the model, set up monitoring and feedback loops.
-   Deliverable: Deployed model, monitoring dashboard.

Sprint 5: Refinement and Iteration

-   Tasks: Incorporate user feedback, fine-tune the model, address any performance issues.
-   Deliverable: Refined model, updated metrics based on real-world usage.

By adopting Scrum, ML projects can benefit from structured, iterative development processes that enable continuous improvement, flexibility, and the ability to rapidly respond to new insights and changes.

More Tools

Hypothesis Canvas

Another canvas that helps systematically articulate and test our ideas in rapid cycles, and keep track of learnings over time is the Hypothesis Canvas:

The C4 Software Architecture Model

The C4 model is a framework for visualizing the architecture of software systems at 4 levels of details; Context, Container, Component, and Code.

For the engineers in the ML teams, applying the C4 model can help clarify the architecture and design of ML systems, ensuring that all stakeholders have a clear understanding of how the system is structured and how its components interact. For example, here's a concise summary of applying the C4 model to ML engineering:

Level 1: Context Diagram

Identify External Entities: Users (e.g., data scientists, end-users) and external systems (e.g., data sources, APIs).
Define ML System: Specify the system’s boundaries and main purpose (e.g., recommendation engine).

Level 2: Container Diagram

Identify Containers: ML service, data ingestion, model training, data storage, user interface.
Define Interactions: Describe data flows and API calls between containers.

Level 3: Component Diagram

Decompose Containers: Break down into components (e.g., data preprocessing, model training).
Define Interactions: Specify interactions within each container.

Level 4: Code Diagram (optional)

Detail Component Implementation: Show classes and methods for key components.

Where to go from here

In this articles I have shared some tools and processes that may assist ML teams. I would like to further recommend two books that they can read:

Effective Machine Learning Teams (O' Reilly)
Managing Machine Learning Projects (Manning)

Hope this article is useful. Please feel free to share your thoughts in comments below.

Top comments (0)

Check out Episode 1: How a Hackathon Project Became a Web3 Startup 🚀

Ever wondered what it takes to build a web3 startup from scratch? In the Stellar Dev Diaries series, we follow the journey of a team of developers building on the Stellar Network as they go from hackathon win to getting funded and launching on mainnet.