freiberg-roman

Posted on Jan 14

Don't Know How to Structure Your Data Science Project? Try the Extended Clean Architecture Method

#architecture #datascience #ai #machinelearning

This project includes a PyTorch template. You just need to understand the basic concept behind it to start using it.

Data Researchers aim to solve problems and contribute meaningfully to science.

Nevertheless, most research projects - sometimes even those submitted to top conferences - lack any real structure. After the hacked-together prototype is done, the project often gets abandoned. Two months later, the original author might not even remember how to navigate their own codebase.

This is bad news when the project turns out to be useful and needs continuation.

The architectures commonly used in commerce do not meet the needs of modern research scientists.

Projects need to be submitted in an extremely short amount of time. Opting for Micro-services or Server-Client architectures does not make sense, as radical code changes might happen just three days before submission. Long-term planning becomes impossible. Therefore, guiding principles should be used, which guarantee the most maintainable code for minimal overhead.

Fortunately, creating a template structure that suits most of the everyday needs of a data scientist is simple to build.

One just needs to know a few principles and a bit of discipline. Once learned, placing a class in the right file becomes second nature. Furthermore, these principles are meant to save you time in the long run and shouldn't be seen as an additional overhead for achieving perfection.

This writing is highly opinionated. Use it as an idea marketplace on how it could be done, not as a definitive must.

Introduction to Clean Architecture

The concept is straightforward: in the lifespan of a project, let modules less likely to change not depend on those more prone to alteration.

Bob Martin's original Clean Architecture model delineates an application into four distinct layers. Dependencies are allowed in only one direction, towards the core. Consider the coupling of core business logic to a database as a prime example. Without a properly built adapter, you essentially vendor-lock your application by design. While the specific layers Martin discusses may not hold the same importance for data scientists, the overarching principles certainly do.

For a more in-depth understanding, I recommend checking out Bob Martin's original blog post.

For our purposes, we will adapt this model to include four layers, plus an additional core layer (which we will discuss later). Layer four forms the outer shell, encompassing data sources and data source adapters. Layer three involves adapter interfaces and gateways, while layers two and one focus on our core method. Layer zero concerns the language itself.

Below, you can see the original structure proposed by Bob Martin, which resembles what we need closely.

How do we achieve independence between logic and data source?

The answer lies in dependency inversion. This approach might take some getting used to, but it is quite straightforward once you get the hang of it. Essentially, you use an interface to break the dependence. Let us examine a minimal example of database dependency:

class BillingInformation:
    def __init__(self, db: VendorxyzDB):
        self.info = db.query("some SQL statement")
        ...

If BillingInformation is a part of your core logic, tying it directly to a specific database sets you up for trouble should that database become obsolete.

Now, let us apply dependency inversion to the same example:

class BillingInfoQuery(ABC):
    @abstractmethod
    def query_billing_info():
    ...

class VendorxyzDBAdapter(BillingInfoQuery, ...):
    def query_billing_info():
        self.db.query("some SQL statement")

class BillingInformation:
    def __init__(self, db: BillingInfoQuery):
        self.info = db.query_billing_info()

Yes, this results in more code. However, consider a scenario where the data source changes. The only component that needs reimplementing is the Adapter, leaving your core logic untouched. In this setup, BillingInformation and BillingInfoQuery would be in either layer one or two, while VendorxyzDBAdapter belongs in layer four, where our database resides.

Next, we will tackle a commonly overlooked issue in ML frameworks.

ML Frameworks and Dependencies

So ML frameworks are external dependencies, they are layer four dependencies, aren't they?

This is obviously impractical. Imagine you would have to call an adapter for every torch.Tensorcall and you would end up in an unmaintainable mess. Personally, I would argue that frameworks such as PyTorch, Tensorflow, NumPy etc. are part of the language definition. This means, their operations can be seen as a core part of the programming language such as Pythonor C++ and therefore can be used throughout the project without dependency inversion.

To be fair, a conventional software architect would not bother mentioning a specific language such as Java in a software architecture.

They reason beyond such concepts, but for a researcher, most programming language syntax is simply not effective enough. They require operations such as matrix multiplication or tensor products. Think of R or MatLab, where these concepts are already integrated by default. This also means, you could write your own language extensions of common operations under utility functions and use them throughout your project, without violating any principles.

It is obvious, but choosing a language extension means locking yourself into a framework.

But be careful with ML libraries.

Most frameworks like PyTorch provide also higher concepts like Dataloaders and complete built-in modules. Using them without architectural principles could result in the same mess we try to avoid in the first place. Think it through once and the work will pay off multiple times later on.

We will walk through common components you might encounter in your project.

Common Modules in an ML-Research Project

You can find a template for PyTorch here.
It showcases a small example that adheres to the principles we have discussed. I will be using the code from this template as a running example to demonstrate these principles in action.

Let us assume you are set to implement an experimental method and train a network eventually. Typically, you would need a data source (like AWS S3, an external server, a folder, or a Replay Buffer), several network modules, a training loop, helper functions, and a way to manage configuration. As we have discussed, our helper functions, such as data type converters between libraries and common operations in our utils, are integral to our language, placing them in layer zero.

Data Sources, a typical layer four component, require an Adapter for integration. PyTorch offers the Dataset class precisely for this purpose. Here’s an example:

@dataclass
class SourceCoordinates(CircleCoordinates):
    coords: torch.Tensor

class RandomCoordinatesDS(Dataset): 
    ...
    def __getitem__(self, _) -> SourceCoordinates:
        # Retrieves an item from the dataset
        # ...
        return SourceCoordinates(coords)

    @classmethod
    def collate_fn(cls, batch: List[SourceCoordinates]):
        # Function for batching items, required by torch
        stacked_coords = torch.stack([s.coords for s in batch])
        return SourceCoordinates(stacked_coords)

    ...

Here, CircleCoordinates is a dataclass coming from our method data interface. The collate_fn is a requirement in torch for batching custom elements. Here is the method outline for the full picture.

@dataclass
class CircleCoordinates:
    coords: torch.Tensor

class CircleMethod(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # Network definitions go here
        ...

        def forward(self, circle_data: CircleCoordinates) -> ...:
            # Processing 
            ...
            return out

CircleMethod encapsulates the project's method, including high-level interfaces for our pipeline, like computing bounding boxes for images, generating images, classifications, etc. The dependency inversion principle allows us to define our data layout without being constrained by the source. The dataset interface empowers us to load from any source and transform it into the format our network requires.

In general, your method should reside in layer two, and the networks and procedures used within this method in layer one.

Consequently, our training loop, which also serves as our command-line interface in layer four, now resembles a basic PyTorch tutorial script.

dataset = RandomCoordinatesDS(...)
dataloader = DataLoader(dataset, ...)
method = CircleMethod(...)
optimizer = AdamW(...)

for epoch in range(cfg.train.num_epochs):
    for data in dataloader:

        target = method.get_target(data)
        out = method(data)
        loss = method.compute_loss(out, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # Logging and model saving operations go here
        ...

This example is somewhat simplified, but it does not take much to extend this principle to any method. A more flexible toy example can be found in the PyTorch-Starter project.

Regarding configurations, they are a necessary evil. From a software architecture standpoint, configurations are part of the UI, residing in layer four. However, they often define hyper-parameters crucial to the core logic. Ideally, we would avoid this dependency, and thankfully, frameworks like hydra provide a solution with their hierarchical configuration structure, helping to decouple this dependency.

It mostly works seamlessly. However, if it does not, the need for quick prototyping might require some flexibility. It remains a topic where some artistic license could be warranted.

If you think your project can not be structured using these principles, I would love to hear from you. I am always on the lookout for counterexamples and thinking about how one could work with them.

Closing Thoughts

Sometimes, implementing radical changes quickly is necessary, and strictly following coding guidelines might only add additional overhead.

In such cases, it is advisable to focus on the implementation first and tidy up later. Everything presented in this discussion is intended as a toolkit to enhance efficiency in the long term. However, these tools should not overshadow the primary goal if the success of the project is at risk.

I invite you to try these methods and judge their effectiveness based on your own experiences. Adaptation and modification are key. Reflect on how these principles can be integrated into your current or upcoming projects, and make decisions that best suit your needs. Ultimately, the value of these guidelines is realized through their practical application and adaptability to your unique scenarios.

This article references specific tools like PyTorch and hydra primarily for illustrative purposes.

It is crucial to understand that the core principles of structured project management and clean architecture are not confined to these tools alone. They are broadly applicable across various technologies and frameworks in research and development. The fundamental goal is to promote a mindset of structured yet adaptable project development, applicable irrespective of the specific technologies employed.

DEV Community

Don't Know How to Structure Your Data Science Project? Try the Extended Clean Architecture Method

Introduction to Clean Architecture

ML Frameworks and Dependencies

Common Modules in an ML-Research Project

Closing Thoughts

Top comments (0)

Read next

What are the pros and cons of using GitHub Copilot?

Annotation is dead

Using Feature Flags with Machine Learning Models

JavaScript Histogram of Gaussian Distribution