Jesse Williams for Jozu

Posted on Jul 16, 2024 • Originally published at jozu.com

Tools to ease collaboration between data scientists and application developers

#programming #ai #productivity #learning

As a CTO or an Engineering Manager, you will often face problems with your data scientists and application developers not being on the same page. Their unique expertise and methodologies frequently make working together difficult, and traditional tools can only do so much to bridge this gap.

These two groups have different mindsets and workflows that can lead to notable slowdowns, especially when passing models between them. Data scientists prioritize experimentation, while developers focus on creating clean, maintainable code. This mismatch and the use of incompatible tools can lead to disjointed development processes and breakdowns in communication.

This article discusses the problems arising from the varying approaches and toolsets data scientists and developers use. It emphasizes the importance of improving their collaboration and introduces a tool to help them work together more efficiently.

To understand their effective collaboration, let’s first examine the key differences between both groups and what requirements decide the tools they use and prioritize.

Differences between Data Scientists and Application Developers

Your team may encounter problems when data scientists and app developers interface due to their:

Engineering Approaches
Workflows
Toolsets

	Data Scientists	Application Developers
Engineering Approaches	A research-oriented approach that priorities results and innovation over code quality and best practices	An engineering approach that’s keen on clean code, efficiency, maintainability and stability
Workflows	Flexible, iterative, trial-and-error	Structured, linear or agile/DevOps
Toolsets	They require an interactive environments like Jupyter Notebooks	They require an integrated development enviromnent (IDEs) such as visual studio and intelliJ)

Engineering approaches
Data scientists apply a research approach to developing solutions primarily using statistical and machine learning techniques. Their expertise lies in analyzing and interpreting complex datasets, extracting valuable insights, and building predictive models. This causes them to lean towards experimentation and exploration, favoring innovation over strict adherence to code quality or software development best practices.

On the other hand, developers take a software engineering approach, focusing on designing, developing, and maintaining applications tailored to specific user needs. This causes them to prioritize writing clean, efficient, and maintainable code when building applications.

Based on these different approaches, you can see how your team often has opposing priorities. Data scientists prioritize results over clean code, detailed documentation, or rigorous testing, while developers are meticulous and organized.

Workflows
Data scientists embrace a flexible and iterative approach throughout their model development. They employ a trial-and-error process of combining data variations and machine learning algorithms to uncover insights from data and produce the most suitable model. As a result, they don't employ the standard scripting, testing, and debugging practices in their development.

Developers follow a more structured and linear workflow to develop applications. They design and develop software based on strict requirements, then test to ensure quality and standard. They also emphasize stability and functionality by adhering to more structured methodologies like Agile or DevOps.

Because of this, the handoff of models from the experimental stage to production often becomes a significant bottleneck, leading to miscommunication, delays, and frustration for everyone involved. This mismatch in workflow leaves you wondering how to bridge this gap and streamline your machine-learning pipeline.

Toolsets
Data scientists require toolsets that support active experimentation, rapid prototyping, and model development. This requirement causes them to work in an interactive environment like Jupyter Notebooks. However, the Notebooks' flexible development structure makes them impractical for sharing code or validating the assets associated with a model.

Developer tools are designed for software development and integration. Therefore, they rely on robust IDEs like Visual Studio or IntelliJ, which offer advanced coding, debugging, and project management features.

The disparities between each group’s toolset requirements prevent their tools from integrating and interoperating seamlessly. Their traditional tools are designed to accommodate their unique workflow and engineering approach, making it challenging to use each other's traditional tools efficiently. Let’s highlight the friction data scientists or developers encounter when collaborating with these traditional tools.

Version Control Systems (VCS) like Git
Specialized ML platforms like Amazon SageMaker and Google Vertex AI
Containers like Docker

Version Control Systems (VCS):
Your developers depend on version control systems to enable them to track, manage, and share code changes they make in their IDEs. VCS, like Git, allows multiple developers to work concurrently while tracking different versions of their modifications. It makes it easier to resolve code discrepancies and enables rollbacks to previous versions.

However, it is impractical for data scientists to use Git to manage and share different versions of their models and model artifacts with developers. These models are complex pieces of software in binary code, and Git can't show the differences between versions of binary model files, making it difficult to track any changes. Git also can't handle the large data and model file sizes your data scientists require.

Specialized ML platforms:
Specialized ML platforms offer comprehensive tools and services tailored to enable data scientists to develop models. These platforms often provide them with a pre-configured development environment (cloud-based notebooks), eliminating the hassle of manual setup and configuration. They enable data scientists to share and manage their experiments, models, and model artifact versions from one place.

Developers can deploy the data scientist models from the platform, but the platform doesn’t enable them to manage the model versions and model assets in a way that suits their workflows.

Containers:
Containers like Docker shine at creating consistent environments for code to run in, making deployment more manageable and reliable. This makes Docker perfect for developers to package and deploy their applications quickly. Containers are also suitable for packaging and deploying models but not model assets, which help debug or modify models. Therefore, developers can struggle to trace model artifacts after deployment.

The more time your teams spend wrestling with this incompatibility, the less time they spend building solutions for your customers. That's why seamless collaboration between data scientists and developers is crucial.

Streamline collaboration with a unified tool

Siloed information makes it difficult for data scientists and app developers to have a holistic view of the project's progress and identify potential issues early on. For example, model handoff using separate tools often requires manual transfer and translation between different systems. This leaves room for model tampering and the loss of valuable insights about the model lineages.

A unified tool mitigates the pitfalls of this error-prone process and ensures consistency throughout the model development and deployment phases. Unified MLOps tools, such as KitOps, Kubeflow, and MetaFlow, provide different ways for your team to work on the same platform without manual intervention.

KitOps is an open source MLOps tool that provides data scientists and developers with a shared package through ModelKits, where they can access the same information, track development progress in real-time, and collaborate more effectively. They can use this package during model creation, development, and deployment. The package makes it easier for them to integrate their workflows while using their existing traditional tools.

These ModelKits are compatible with the standard development tools they already use, ensuring they can work together, from experimentation to application deployment. Data scientists and developers can efficiently package and version ML models, datasets, code, model assets, and requirements.

ModelKits also eliminate common obstacles that slow development or introduce risk, such as version control issues, model reliability, workflow disparities, etc.

What to expect from an ideal tool

Some of the key features that make an excellent collaborative tool for any ML organization are:

Reliable tagging system
A reliable tool should ensure that your team's models and model assets remain authentic, untampered, and traceable, mitigating the risk of unauthorized modifications throughout the software development lifecycle. This allows developers to confidently deploy models developed by data scientists, knowing they are working with the exact, verified artifacts.

KitOps’ traits as an OCI artifact and its t agging system establish lineage across your ModelKit versions and create visibility into the origin and evolution of your ModelKit artifacts (i.e., models and model assets). The system uses immutable, content-addressable storage, which doesn't allow two ModelKits to have the same content version. This implies that a new ModelKit version is only stored when its content changes. Therefore, packaging ModelKits with the same contents and different tags only results in one ModelKit.

This tagging system eliminates the chances of your data scientists having ModelKits with the same model and model assets. It also facilitates the challenge of model provenance for developers when they retrieve and deploy specific models and model assets.

Synchronized versioning
KitOps enables your data scientists to simultaneously version their models and model assets (i.e., data, code, configurations, etc.) when they package their ModelKit. This feature assures that their ModelKit's models and model assets are consistent. Your developers can confidently retrieve and deploy specific data scientists' models and model assets without worrying about a mix-up.

Unified development package
ModelKits can store data, code, and model as a single unit—a unified development package. With a unified package like ModelKit, your data scientists and developers don't need to operate and communicate in silos. They can develop, integrate, test, and deploy machine learning models from the ModelKit.

Your data scientists can share this unified package containing their models and associated artifacts with developers, who can then seamlessly integrate them into applications and be sure of the ML model's functionalities, requirements, or dependencies.

Standard package format
An ideal tool should adhere to OCI standard formats to ensure compatibility with existing DevOps/MLOps pipelines. KitOps packages ModelKits following the OCI standard, making it compliant with all containers. Therefore, your data scientists can package their machine learning models, data, dependencies, and model assets and store them in the team's container registries. Your developers can then easily pull these standardized containers and confidently integrate them into their applications.

The OCI standard packaging format enables your data scientists and developers to continue using existing infrastructure. It promotes a unified workflow where both teams can work independently yet seamlessly integrate their contributions without compatibility or reproducibility issues.

CLI interface
CLI tools and workflows are must-haves for rapid development and fine-grained control that both data science and app development teams require. KitOps provides a CLI interface—KitCLI—for creating, managing, and deploying the ModelKits. This simplifies the technical requirements of using KitOps, making it easy for your data scientists and developers to adopt.

As a CLI-based tool, your data scientists and developers can leverage its CLI commands, remote access, and troubleshooting functionalities to automate tasks, streamline workflows, and enable efficient interaction with data, models, and code.

How do various collaborative AI/ML tools compare?

Compared to other collaborative tools like Git and Docker, KitOps offers an effective way to package, version, and share AI project assets beyond your data scientist. This makes it easier to manage and distribute AI models and related resources. Here are some ways KitOps compares to these other tools:

Git vs. KitOps

Git excels at handling software projects with numerous small files, but struggles with large binary objects crucial for AI/ML, like serialized models and datasets. Storing model development code in ModelKits ensures synchronization with Jupyter Notebooks, serialized models, and datasets throughout the development.

Criteria	KitOps	Git
Primary purpose	Versioned packaging for AI/ML projects	Version control for source code
Content	Models, datasets, code, metadata, artifacts, and configurations	Source code, text files, and configurations
Integration	Works with existing registries, supports OCI	Integrates with CI/CD tools, GitHub, GitLab
Target users	Data scientists, DevOps, software development teams	Anyone who works with code
Versioning	Built-in versioning for AI assets	Version control for files
Security	Immutable packages with provenance tracking	Branch protection, commit signing
Ease of use	Simple commands for packing and unpacking	Git commands (add, commit, push, pull)
Collaboration	Packages all AI assets for team use	Branching, merging, and pull requests
Compatibility	Compatible with various MLOps and DevOps tools	Works with various coding and CI/CD environments

Docker vs. KitOps

Docker enables a consistent package to run and deploy models. KitOps excels at creating a unified package for model development and deployment and ensures smooth integration into existing deployment workflows.

Criteria	KitOps	Docker
Primary purpose	Versioned packaging for AI/ML projects	Containerization and deployment of applications
Content	Models, datasets, code, metadata	Application binaries, libraries, configurations
Standards	Uses OCI, JSON, YAML, TAR	Uses OCI, Dockerfile
Target users	Data scientists, DevOps, application teams	Developers, DevOps
Versioning	Built-in versioning for AI assets	Supports versioning through image tags
Security	Immutable packages with provenance tracking	Image signing and vulnerability scanning
Ease of use	Simple commands for packing and unpacking	Familiar Docker CLI commands
Compatibility	Compatible with various MLOps and DevOps tools	Broad support for various platforms and tools

Jupyter Notebook vs. KitOps

Jupyter notebooks are great for developing models. However, they need help with state management and versioning. To address this, you can add notebooks to ModelKit to enable effective versioning and sharing. This lets your data scientist continue using notebooks while allowing the software development teams to access and use respective models efficiently.

Criteria	KitOps	Jupyter Notebook
Security	Immutable packages with provenance tracking	Limited built-in security features
Versioning	Built-in versioning for AI assets	Limited version control (Git can be used)
Primary purpose	Versioned packaging for AI/ML projects	Interactive development environment for data scientists
Standards	Uses OCI, JSON, YAML, TAR	Uses JSON for notebook storage
Collaboration	Packages all AI assets for team use	Collaborative features through JupyterHub

Effective collaboration between data scientists and software developers is critical to successful ML software development. However, differences in tools, mindsets, and workflows often need to be addressed. While tools like Git, Docker, and Jupyter Notebooks offer some collaboration features, they have limitations in managing AI/ML assets. KitOps bridges this gap by providing a unified platform for packaging, versioning, and deploying AI models, making collaboration smoother.

If you have any questions about making collaboration smoother for your team with KitOps, start a conversation with us on Discord. Integrating KitOps will allow your team to maintain productivity, streamline deployments, and ensure consistent version control across the project life cycle.

Top comments (1)

Dun • Jul 24 '24

This article offers great insights! How would KitOps handle a scenario where multiple data scientists are simultaneously working on different versions of the same model?