DEV Community: David Herron

Reflinks vs symlinks vs hard links, and how they can help machine learning projects

David Herron — Wed, 14 Aug 2019 20:52:17 +0000

Hard links and symbolic links have been available since time immemorial, and we use them all the time without even thinking about it. In machine learning projects they can help us, when setting up new experiments, to rearrange data files quickly and efficiently in machine learning projects. However, with traditional links, we run the risk of polluting the data files with erroneous edits. In this blog post we’ll go over the details of using links, some cool new stuff in modern file systems (reflinks), and an example of how DVC (Data Version Control, https://dvc.org/) leverages this.

As I am studying machine learning, I’m wishing for a tool that would let us inspect ML projects the way we do regular software engineering projects. That is, to retrieve the state of the project at any given time, create branches or tags (with Git) based on an earlier state of a project, handle collaboration with colleagues, and so on. What makes ML projects different is the tremendous amount of data, thousands of images, audio, or video files, the trained models, and how difficult it is to manage those files with regular tools like Git. In my earlier articles I went over why Git by itself is insufficient, and why Git-LFS is not a solution for machine learning projects, as well as some principles that seem to be useful for tools to manage ML projects.

DVC has proved to be very good at managing ML project datasets and workflow. It works hand-in-hand with Git, and can show you the state of the datasets corresponding to any Git commit. Simply by checking out a commit, DVC can rearrange the data files to exactly match what was present at the time of that commit.

The speed is rather magical, considering that potentially many gigabytes of data are being rearranged nearly instantaneously. So I was wondering: How does DVC pull off this trick?

The trick to rearranging gigabytes of training data

Turns out DVC’s secret to rearranging data and model files as quickly as Git is to link files rather than copy them. Git, of course, copies files into place when it checks out a commit, but Git typically deals with relatively small text files, as opposed to the large binary blobs used in ML projects. Linking a file, like DVC does, is incredibly fast, making it possible to rearrange any amount of files in the blink of an eye, while avoiding copying, thus saving disk space.

Using file linking techniques is nothing new to the field, actually. Some data science teams use symlinks to save space and avoid copying large datasets. But symlinks are not the only sort of link which can be used. We will start with a strategy of copying files into place, then using hard links and symbolic links, then ending up with a new type of link, reflinks, which implements Copy On Write capabilities in the file system. We will use DVC as an example of how tools can use different linking strategies.

Test setup

Since we’ll be testing different linking strategies, we need a sample workspace. The workspace was setup on my laptop, a MacBook Pro where the main drive is formatted with the APFS file system. Further tests were done on Linux on a drive formatted with XFS.

The data used is two “stub-articles” dumps of the Wikipedia website retrieved from two different days. Each is about 38 GB of XML, giving us enough data to be similar to an ML project. We then set up a Git/DVC workspace where one can switch between these two files by checking out different Git commits.

$ ls -hl wikidatawiki-20190401-stub-articles.xml
-rw-r--r--  1 david  staff    35G Jul 20 21:35 wikidatawiki-20190401-stub-articles.xml
$ time cp wikidatawiki-20190401-stub-articles.xml wikidatawiki-stub-articles.xml

real    14m16.918s
...

As a base-line measure we’ll note that copying these files into the workspace took about 15 minutes apiece. Obviously an ML researcher would not have a pleasant life if it took 15 minutes to switch between commits in the repository.

Instead, we’ll be exploring another technique DVC and some other tools utilize - linking. There are two types of links all modern OSs support: Hard Link, Symbolic Link. A new type of link, the Reflink (copy-on-write), is starting to be available in newer releases of Mac OS X and Linux (for which one needs the desired filesystem driver). We’ll use each of them in turn and see how well they work.

DVC discusses the four strategies corresponding to the three link types (the 4th strategy is to just copy files) in its documentation. The strategy used depends on the file system capabilities, and whether the "dvc config cache.type" command has changed the configuration.

DVC defaults to using reflinks and, if not available, to fall back to file copying. It avoids using symlinks and hardlinks because of the risk of accidental cache or repository corruption. We’ll see all this in the coming sections.

Versioned datasets using File Copying

The basic (or naive) strategy of copying files into place when checking out a Git tag is the equivalent of these commands:

$ rm data/wikidatawiki-stub-articles.xml
$ cp .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml

This will run on any filesystem, but it will take a long time to copy the file, and it will consume twice the disk space.

What’s with the strange file name? It’s the filename within the DVC cache, the hex digits being the MD5 checksum. Files in the DVC cache are indexed by this checksum, allowing there to be multiple versions of the same file. The DVC documentation contains more details about the DVC cache implementation.

In practice this is how it works in DVC:

To set up the workspace we created two Git tags, one for each file we downloaded. The DVC example on versioned datasets (or this alternate tutorial) should give you an idea of what’s involved with setting up the workspace. To test different file copying/linking modes we first change the DVC configuration then check out the given Git commit. Running "dvc checkout" causes the corresponding data file to be inserted into the directory.

Oh boy, that sure took a long time. The Git portion of this is very fast, but DVC took a very long time. This is as expected, since we told DVC to perform a file copy, and we already knew it took about 16 minutes or so to use the cp command to copy the file.

As for disk space, obviously there are now two copies of the data file. There is the copy in the DVC cache directory, and the other that was copied into the workspace.

Source: https://www.xkcd.com/981/

Versioned datasets using Hard Links and Symlinks

Clearly copying files around to handle dataset versioning is slow and an inefficient use of disk space. An option that has existed since time immemorial in Unix-like environments are both hard links and symbolic links. While Windows historically did not support file links, the "mklink" can do both styles of links as well.

Hard links are a byproduct of the Unix model for file systems. What we think of as the filename is really just an entry in a directory file. The directory file entry contains the file name, and the “inode number” which is simply an index into the inode table. Inode table entries are data structures containing file attributes, and a pointer to the actual data. A hard link is simply two directory entries with the same inode number. In effect, it is the exact same file appearing at two locations in the filesystem. Hard links can only be made within a given mounted volume.

A symbolic link is a special file where the attributes contains a pathname specifying the target of the link. Because it contains a pathname, symbolic links can point to any file in the filesystem, even across mounted volumes or across network file systems.

The equivalent commands in this case are:

$ rm data/wikidatawiki-stub-articles.xml
$ ln .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml

This is for a hard link. For a symbolic link use "ln -s".

Then to perform the hard link scenario:

The symbolic link scenario is the same, but setting cache.type to symlink. The timing is similar for both cases.

Two seconds (or less) is sure a lot faster than the 16 minutes or so it took to copy the files. It happens so fast we use the word “instantaneous”. File links are that much faster than copying files around. This is a big win.

As for disk space consumption, consider this:

$ ls -l data/
total 8
lrwxr-xr-x  1 david  staff   70 Jul 21 18:43 wikidatawiki-stub-articles.xml -> /Users/david/dvc/linktest/.dvc/cache/2c/82d0130fb32a17d58e2b5a884cd3ce

The link takes up a negligible amount of disk space. But there is a wrinkle to consider.

Ok, looks great, right? Fast, no extra space consumed … But, let’s think about what would happen if you were to edit data/wikidatawiki-stub-articles.xml in the workspace. Because that file is a link to a file in the DVC cache, the file in the cache would be changed, polluting the cache. You’ll need to take extra measures, and learn how to avoid that problem. The DVC documentation has instructions on avoiding the problem when using DVC. It means always remembering to use a specific process for editing a data file that, while not a deal breaker, is less than convenient. The better option though is to use reflinks.

Versioned Datasets using Reflinks

Hard links and symbolic links have been in the Unix/Linux ecosystem for a long time. I first used symbolic links in 1984 on 4.2BSD, and hard links date back even further. Both hard links and symbolic links can be used to do what DVC does, namely quickly rearranging data files in a working directory. But surely in the last 35+ years there has been an advancement or two in file systems?

Indeed there has, and the Mac OS X “clonefile” and Linux “reflink” features are examples.

Copy On Write links, a.k.a. reflinks, offer a solution to quickly linking a file into the workspace while avoiding any risk of polluting the cache. The hard link and symbolic link approaches are big wins because of their speed, but doing so runs the risk of polluting the cache. With reflinks, the copy-on-write behavior means that if someone were to modify the data file the copy in the cache would not be polluted. That means we’d have the same performance advantage as traditional links, with the added advantage of data safety.

Maybe, like me, you don’t know what a reflink is. This technique means to duplicate a file on the disk such that the “copy” is a “clone” similar to a hard link. Unlike a hard link where two directory entries refer to the same inode entry, with reflinks there are two inode entries, and it is the data blocks that are shared. It happens as quickly as a hard link, but there is an important difference. Any write to the cloned file causes new data blocks to be allocated to hold that data. The cloned file appears changed, and the original file is unmodified. The clone is perfectly suitable for the case of duplicating a dataset, allowing modifications to the dataset without polluting the original dataset.

Like with hard links, reflinks only work within a given mounted volume.

Reflinks are easily available on Mac OS X, and with a little work is available on Linux. This feature is supported only on certain file systems:

Linux
- BTRFS
- XFS
- OCFS2
Mac OS X
- APFS

APFS is supported out of the box on macOS, and Apple strongly suggest we use it. For Linux, XFS is the easiest to set up as shown in this tutorial.

For APFS the equivalent commands are:

$ rm data/wikidatawiki-stub-articles.xml
$ cp -c .dvc/cache/40/58c95964df74395df6e9e8e1aa6056 data/wikidatawiki-stub-articles.xml

With the -c option, the macOS cp command uses clonefile(2) system call. The clonefile function sets up a reflink clone of the named file. On Linux the cp command uses the --reflink option instead.

Then to run the test:

The performance is, as expected, similar to the hard links and symbolic links strategies. What we learn is that reflinks are about as fast as hard links and symlinks, and disk space consumption is again negligible.

The cool stuff about this link is even though the files are connected you can edit the file without modifying the file in the cache. The changed data are copied under the hood.

On Linux the same scenario runs with similar performance.

Conclusion

We’ve learned something about how to efficiently manage a large dataset, like is typical in machine learning projects. If we need to revisit any development stage in such projects, we’ll wanta system for efficiently rearranging large datasets to match each stage.

We’ve seen it is possible to keep a list of files that were present at any Git commit. With that list we can link or copy those files into the working directory. That is exactly how DVC manages data files in a project. Using links, rather than file copying, lets us quickly and efficiently switch between revisions of the project.

Reflinks are an interesting new feature for file systems, and they are perfect for this scenario. Reflinks are as fast to create as traditional hard links and symbolic links, letting us quickly duplicate a file, or a whole directory structure, while consuming negligible extra space. And, since reflinks keeps modifications in the linked file, they give us many more possibilities than traditional links. In this article we examined using reflinks in machine learning projects, but they are used in other sorts of applications. For example, some database systems utilizing them to manage data on disk more efficiently. Now that you’ve learned about reflinks, how will you go about using them?

Principled Machine Learning: Practices and Tools for Efficient Collaboration

David Herron — Thu, 20 Jun 2019 15:36:30 +0000

Machine learning projects are often harder than they should be. We’re dealing with data and software, and it should be a simple matter of running the code, iterating through some algorithm tweaks, and after a while we have a perfectly trained AI model. But fast forward three months later, the training data might have been changed or deleted, and the understanding of training scripts might be a vague memory of which does what. Have you created a disconnect between the trained model and the process to create the model? How do you share work with colleagues for collaboration or replicating your results?

As is true for software projects in general, what’s needed is better management of code versions and project assets. One might need to revisit the state of the project as it was at any stage in the past. We do this (review old commits) in software engineering all the time. Shouldn’t a machine learning project also need to occasionally do the same? It’s even more than that. What about the equivalent of a Pull Request, or other sorts of team management practices routinely used in other fields?

Myself, I am just beginning my journey to learn about Machine Learning tools. Among the learning materials, I watch tutorial videos and the instructors sometimes talk about problems reminding me of a period early in my software engineering career. In 1993-4, for example, I was the lead engineer of a team developing an e-mail user agent. We did not have any kind of Source Code Management (SCM) system. Every day I consulted all other team members to see what changes they had made that day. The only tool I had was to run a diff between their source tree and the master source tree (using diff -c | less), then manually apply the changes. Later, team members manually updated their source tree from the master source tree. That was a mess until we found an early SCM system (CVS). That one tool made the project run much more smoothly.

As I learn the tools used in machine learning and data science projects, the stories feel similar to this. Even today ML researchers sometimes store experiments (data, code, etc) in parallel directory structures to facilitate diffing, just like I did in 1993.

Principles

Let’s start with a brief overview of some principles that might be useful to improve the state of software management tools for machine learning projects.

In any machine learning project the scientist will run many experiments to develop the best trained model for the target scenario. Experiments contain:

Code and Configuration: The software used in the experiment, along with configuration parameters
Dataset: Any input data used - this can easily be many gigabytes in size such as projects to recognize content of audio, image or video files
Outputs: The trained ML model and any other outputs from the experiment

A machine learning project is just running software. But often there are difficulties in sharing files with colleagues or reproducing the results. Getting repeatable results that can be shared with colleagues, and where you can go back in time to evaluate earlier stages of the project, requires more comprehensive management tools.

The solution needs to encompass ideas like these (abstracted from a talk by Patrick Ball titled Principled Data Processing):

Transparency: Inspecting every aspect of an ML project.
- What code, configuration and data files are used
- What processing steps are used in the project, and the order of the steps
Auditability: Inspecting intermediate results of a pipeline
- Looking at both the final result, but any intermediate results
Reproducibility: Ability to re-execute precisely the project at any stage of its development, and the ability for co-workers to re-execute precisely the project
- Recording the processing steps such that they’re automatically rerunnable by anyone
- Recording the state of the project as the project progresses. “State” means code, configuration, and datasets
- Ability to recreate the exact datasets available at any time in the project history is crucial for Auditability to be useful
Scalability: Ability to support multiple co-workers working on a project, and the ability to work on multiple projects simultaneously

What makes ML projects different from regular software engineering?

Are you already concluding that if ML projects are the same as software engineering, then why don’t we just use regular software engineering tools in machine learning projects? Not so fast!

There are many tools used in regular software engineering projects that could be useful to ML researchers. The code and experiment configuration can be easily managed in a regular source code management system like Git, and techniques like pull requests can be used to manage updates to those files. CI/CD (Jenkins, etc) systems can even be useful in automating project runs.

But ML projects have differences preventing regular software developer tools from serving every need. Here’s a few things:

Metrics-Driven development versus Feature-Driven development: In regular software engineering “whether to release” decisions are based on whether the team has reached feature milestones. By contrast, ML researchers look at an entirely different measurement - the predictive value of the generated machine learning model. The researcher will iteratively generate dozens (or more) models, measuring the accuracy of each. The project is guided by metrics achieved in each experiment, since the goal is to find the most accurate model.
ML Model’s require huge resources to train: Where a regular software project organizes the files to compile together a software product, an ML project instead trains a “model” that describes an AI algorithm. In most cases compiling a software product takes a few minutes, which is so cheap many teams follow a continuous integration strategy. Training an ML model takes so long that it’s desirable to avoid doing so unless necessary.
Enormous datasets and trained models: A generalization of the previous point is that machine learning development phases almost always require enormous datasets that are used in training the ML model, plus trained models can be enormous. Normal source code management tools (Git et al) do not handle large files very well, and add-ons like Git-LFS are not suitable for ML projects. (See my previous article)
Pipelines: ML projects are a series of steps such as downloading data, preparing data, separating data into training/validation sets, training a model, and validating the model. Many use the word “pipeline”, and it is useful to structure an ML project with discrete commands for each step versus cramming everything into one program.
Special purpose hardware: Software organizations can host their software infrastructure on any kind of server equipment. If they desire a cloud deployment, they can rent every-day normal VPS’s from their favorite cloud computing provider. ML researchers have huge computation needs. High-power GPU’s not only speed up video editing, but they can make ML algorithms fly, slashing the time required to train ML models.

What if that intermediate result was generated three months ago and things have changed such that you don’t remember how the software had been run at that time? What if the dataset has been overwritten or changed? A system supporting transparency, auditability and reproducibility for an ML project must account for all these things.

Now that we have a list of principles, let’s look at some open source tools in this context.

There are a large number of tools that might be suitable for data science and machine learning practitioners. In the following sections we’re specifically discussing two tools (MLFlow and DVC) while also talking about general principles.

Principled data and models storage for ML projects

One side of this discussion boils down to:

Tracking which data files were used for every round of training machine learning models.
Tracking resulting trained models and evaluation metrics
Simple method to share data files with colleagues via any form of file sharing system.

A data tracking system is required to transparently audit, or to reproduce the results. A data sharing system required to scale the project team to multiple colleagues.

It may already be obvious, but it is impractical to use Git or other SCM (Source Code Management system) to store the data files used in a machine learning project. It would be attractively simple if the SCM storing the code and configuration files could also store the data files. Git-LFS is not a good solution either. My earlier article, Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis, went into some detail about the reasoning.

Some libraries provide an API to simplify dealing with files on remote storage, and manage uploading files to or from remote storage. While this can be useful for shared access to a remote dataset, it does not help with the problem described here. First, it is a form of embedded configuration since the file names are baked into the software. Any program where configuration settings are embedded in the source code is more difficult to reuse in other circumstances. Second, it does not correlate which data file was used for each version of the scripts.

Consider the example code for MLFlow:

mlflow.pytorch.load_model("runs:/<mlflow_run_id>/run-relative/path/to/model")

This supports several alternative file access “schemes” including cloud storage systems like S3. The example here loads a file, in this case a trained model, from the “run” area. An MLFlow “run” is generated each time you execute “a piece of data science code”. You configure a location where “run” data is stored, and obviously a “run ID” is generated for each run that is used to index into the data storage area.

This looks to be useful as it will automatically associate the data with commits to the SCM repository storing code and configuration files. Additionally, as the MLFlow API is available for several languages, you’re not limited to Python.

DVC has a different approach. Instead of integrating a file API into your ML scripts, your scripts simply input and output files using normal file-system APIs. For example:

model = torch.load(‘path/to/model.pkl’)

Ideally this pathname would be passed in from the command line. The point is that nothing special is required of the code because DVC provides its value outside the context of the code used in training or validating models.

DVC makes this transparent because the data file versioning is paired with Git. A file or directory is taken under DVC control with the command:

$ dvc add path/to/model.pkl

The data is stored in a natural place, in your working directory. Navigating through the results of various runs is a simple matter of navigating through your Git history. Viewing a particular result is as simple as running git checkout, and DVC will be invoked to ensure the correct data files are linked into the workspace.

A “DVC file” is created to track each file or directory, and are inserted into the workspace by DVC. They have two purposes, one of which is tracking data and model files, the other is recording the workflow commands, which we’ll go over in the next section.

These DVC files record MD5 checksums of the files or directories being tracked. They are committed to the Git workspace, and therefore the DVC files record the checksum of each file in the workspace for each Git commit. Behind the scenes, DVC uses what’s called a “DVC cache directory” to store multiple instances of each file. The instances are indexed by the checksum, and are linked into the workspace using reflinks or symlinks. When DVC responds to the git checkout operation, it is able to quickly rearrange linked files in the workspace based on the checksums recorded in the DVC files.

DVC supports a remote cache directory that is used to share data and models with others.

$ dvc remote add remote1 ssh://user@host.name/path/to/dir
$ dvc push
$ dvc pull

A DVC remote is a pool of storage through which data can be shared. It supports many storage services including S3 and other services, HTTP, and FTP. Creating one is very simple. The dvc push and dvc pull commands are purposely similar to the git push and git pull commands. Where dvc push sends data to a remote DVC cache, we retrieve data from a DVC cache using dvc pull.

Principled workflow descriptions for ML projects

Another side of the discussion is about how to best describe the workflow, or pipeline, used in the ML project. Do we pile the whole thing into one program? Or do we use multiple tools?

The greatest flexibility comes from implementing the workflow as a pipeline, or a directed acyclic graph, of reusable commands that take configuration options as command-line arguments. This is purposely similar to The Unix Philosophy of small well-defined tools, with narrow scope, that work well together, where behavior is tailored by command-line options or environment variables, and that can be mixed and matched as needed. There is a long collective history behind this philosophy.

By contrast many of the ML frameworks take a different approach in which a single program is written to drive the workflow used by the specific project. The single program might start with the step of splitting data into training and validation sets, then proceed through training a model and running validation of the model. This gives us limited chance to reuse code in other projects.

Structuring an ML project as a pipeline serves some benefits.

Managing complexity: Implementing the steps as separate commands improves transparency, and lets you focus
Optimize execution: Ability to skip steps that do not need to be rerun if files have not changed.
Reusability: The possibility of using the same tool between multiple projects.
Scalability: Different tools can be independently developed by different team members.

In MLFlow the framework has you write a “driver program”. That program contains whatever logic is required, such as processing and generating a machine learning model. Behind the scenes the MLFlow API sends requests to an MLFlow server, which then spawns the specified commands.

The MLFlow example for a multi-step workflow makes this clear. Namely:

...
load_raw_data_run = _get_or_run("load_raw_data", {}, git_commit)
ratings_csv_uri = os.path.join(load_raw_data_run.info.artifact_uri,
                    "ratings-csv-dir")
etl_data_run = _get_or_run("etl_data",
                   {"ratings_csv": ratings_csv_uri,
                    "max_row_limit": max_row_limit},
                    git_commit)
…
als_run = _get_or_run("als", 
                  {"ratings_data": ratings_parquet_uri,
                   "max_iter": str(als_max_iter)},
                   git_commit)
…
_get_or_run("train_keras", keras_params, git_commit, use_cache=False)
...

The _get_or_run function is a simple wrapper around mlflow.run. The first argument to each is an entrypoint defined in the MLproject file. An entry point contains environment settings, the command to run, and options to pass to that command. For example:

etl_data:
    parameters:
      ratings_csv: path
      max_row_limit: {type: int, default: 100000}
    command: "python etl_data.py --ratings-csv {ratings_csv} --max-row-limit {max_row_limit}"

At first blush this appears to be very good. But here are a few questions to ponder:

What if your workflow must be more complex than a straight line? You pass false to the synchronous parameter to mlflow.run then wait on the SubmittedRun object to indicate the task finished. In other words, it is possible to build a process management system on top of the MLFlow API.
Why is a server required? Why not just run the commands at a command line? Requiring that a server be configured makes setup of a MLFlow project more complex.
How do you avoid running a task that does not need to execute? In many ML projects, it takes days to train a model. That resource cost should only be spent if needed, such as changed data, changed parameters or changed algorithms.

DVC has an approach that works with regular command-line tools, but does not require setting up a server nor writing a driver program. DVC supports defining a workflow as a directed acyclic graph (DAG) using the set of DVC files mentioned earlier.

We mentioned DVC files earlier as associated with files added to the workspace. DVC files also describe commands to execute, such as:

$ dvc run -d matrix-train.p -d train_model.py \
          -o model.p \
          python train_model.py matrix-train.p 20180226 model.p
$ dvc run -d parsingxml.R -d Posts.xml \
          -o Posts.csv \
          Rscript parsingxml.R Posts.xml Posts.csv

The dvc run command defines a DVC file that describes a command to execute. The -d option documents a dependency on a file where DVC will track its checksum to detect changes to the file. The -o option is an output from the command. Outputs of one command can of course be used as inputs to another command. By looking at dependencies and outputs DVC can calculate the execution order for commands.

All outputs, including trained models, are automatically tracked in the DVC cache just like any other data file in the workspace.

Because it computes checksums, DVC can detect changed files. When the user requests DVC to re-execute the pipeline it only executes stages where there are changes. DVC can skip over your three-day model training task if none of its input files changed.

Everything executes at a regular command line, there is no server to set up. If you want this to execute in a cloud computing environment, or on a server with attached GPU’s, simply deploy the code and data to that server and execute DVC commands on the command line in that server.

Conclusion

We’ve come a long way with this exploration of some principles for improved machine learning practices. The ML field, as many recognize, needs better management tools so that ML teams can work more efficiently and reliably.

The ability to reproduce results means others can evaluate what you’ve done, or collaborate on further development. Reproducibility has many prerequisites including the ability to examine every part of a system, and the ability to precisely rerun the software and input data.

Some of the tools being used in machine learning projects have nice pretty user interfaces, such as Jupyter Notebook. These kind of tools have their place in machine learning work. However GUI tools do not fit well with the principles discussed in this article. Command line tools are well suited for processing tasks running in the background, and can easily satisfy all the principles we outline, while typical GUI’s interfere with most of those principles.

As we’ve seen in this article some tools and practices can be borrowed from regular software engineering. However, the needs of machine learning projects dictate tools that better fit the purpose. A few worthy tools include MLFlow, DVC, ModelDb and even Git-LFS (despite what we said earlier about it).

Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis

David Herron — Fri, 14 Jun 2019 22:34:54 +0000

Some claim the machine learning field is in a crisis due to software tooling that's insufficient to ensure repeatable processes. The crisis is about difficulty in reproducing results such as machine learning models. The crisis could be solved with better software tools for machine learning practitioners.

The reproducibility issue is so important that the annual NeurIPS conference plans to make this a major topic of discussion at NeurIPS 2019. The "Call for Papers" announcement has more information https://medium.com/@NeurIPSConf/call-for-papers-689294418f43

The so-called crisis is because of the difficulty in replicating the work of co-workers or fellow scientists, threatening their ability to build on each other's work or to share it with clients or to deploy production services. Since machine learning, and other forms of artificial intelligence software, are so widely used across both academic and corporate research, replicability or reproducibility is a critical problem.

We might think this can be solved with typical software engineering tools, since machine learning development is similar to regular software engineering. In both cases we generate some sort of compiled software asset for execution on computer hardware hoping to get accurate results. Why can't we tap into the rich tradition of software tools, and best practices for software quality, to build repeatable processes for machine learning teams?

Unfortunately traditional software engineering tools do not fit well with the needs of machine learning researchers.

A key issue is the training data. Often this is a large amount of data, such as images, videos, or texts, that are fed into machine learning tools to train an ML model. Often the training data is not under any kind of source control mechanism, if only because systems like Git do not deal well with large data files, and source control management systems designed to generate delta's for text files do not deal well with changes to large binary files. Any experienced software engineer will tell you that a team without source control will be in a state of barely managed chaos. Changes won't always be recorded and team members might forget what was done.

At the end of the day that means a model trained against the training data cannot be replicated because the training data set will have changed in unknown-able ways. If there is no software system to remember the state of the data set on any given day, then what mechanism is there to remember what happened when?

Git-LFS is your solution, right?

The first response might be to simply use Git-LFS (Git Large File Storage) because it, as the name implies, deals with large files while building on Git. The pitch is that Git-LFS "replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise." One can just imagine a harried machine learning team saying "sounds great, let's go for it". It handles multi-gigabyte files, speeds up checkout from remote repositories, and uses the same comfortable workflow. That sure ticks a lot of boxes, doesn't it?

Not so fast, didn't your manager instruct you to evaluate carefully before jumping in with both feet? Another life lesson to recall is to look both ways before crossing the street.

The first thing your evaluation should turn up is that Git-LFS requires an LFS server, and that server is not available through every Git hosting service. The big three (Github, Gitlab and Atlassian) all support Git-LFS, but maybe you have a DIY bone in your body. Instead of using a 3rd party Git hosting service, you might prefer to host your own Git service. Gogs, for example, is a competent Git service you can easily run on your own hardware, but it does not have built-in support for Git-LFS.

Depending on your data needs this next could be a killer: Git LFS lets you store files up to 2 GB in size. That is a Github limitation rather than Git-LFS limitation, however all Git-LFS implementations seem to come with various limitations. Gitlab and Atlassian both have their own lists of Git-LFS limitations. Consider this 2GB limit from Github: One of the use-cases in the Git-LFS pitch is storing video files, but isn't it common for videos to be way beyond 2GB in size? Therefore GIt-LFS on Github is probably unsuitable for machine learning datasets.

It's not just the 2GB file size limit, but Github places such a tight limit on the free tier of Git-LFS use that one must purchase a data plan covering both data and bandwidth usage.

An issue related to bandwidth is that when using a hosted Git-LFS solution, your training data is stored in a remote server and must be downloaded over the Internet. The time to download training data is a serious user experience problem.

Another issue is the ease of placing data files on a cloud storage system (AWS, GCP, etc) as is often required when to run cloud-based AI software. This is not supported, since the main Git-LFS offerings from the big 3 Git services store your LFS files on their server. There is a DIY Git-LFS server that does store files on AWS S3 at https://github.com/meltingice/git-lfs-s3 But setting up a custom Git-LFS server of course requires additional work. And, what if you need the files to be on GCP instead of AWS infrastructure? Is there a Git-LFS server which stores data on the cloud storage platform of your choice? Is there a Git-LFS server that utilizes a simple SSH server? In other words, GIt-LFS limits your choices of where the data is stored.

Does using Git-LFS solve the so-called Machine Learning Reproducibility Crisis?

With Git-LFS your team has better control over the data, because it is now version controlled. Does that mean the problem is solved?

Earlier we said the "key issue is the training data", but that was a lie. Sort of. Yes keeping the data under version control is a big improvement. But is the lack of version control of the data files the entire problem? No.

What determines the results of training a model or other activities? The determining factors include the following, and perhaps more:

Training data - the image database or whatever data source is used in training the model
The scripts used in training the model
The libraries used by the training scripts
The scripts used in processing data
The libraries or other tools used in processing data
The operating system and CPU/GPU hardware
Production system code
Libraries used by production system code

Obviously the result of training a model depends on a variety of conditions. Since there are so many variables to this, it is hard to be precise, but the general problem is a lack of what's now called Configuration Management. Software engineers have come to recognize the importance of being able to specify the precise system configuration used in deploying systems.

Solutions to machine learning reproducibility

Humans are an inventive lot, and there are many possible solutions to this "crisis".

Environments like R Studio or Jupyter Notebook offer a kind of interactive Markdown document which can be configured to execute data science or machine learning workflows. This is useful for documenting machine learning work, and specifying which scripts and libraries are used. But these systems do not offer a solution to managing data sets.

Likewise, Makefiles and similar workflow scripting tools offer a method to repeatedly execute a series of commands. The executed commands are determined through file-system time stamps. These tools offer no solution for data management.

At the other end of the scale are companies like Domino Data Labs or C3 IoT offering hosted platforms for data science and machine learning. Both package together an offering built upon a wide swath of data science tools. In some cases, like C3 IoT, users are coding in a proprietary language and storing their data in a proprietary data store. It can be enticing to use a one-stop-shopping service, but will it offer the needed flexibility?

In the rest of this article we'll discuss DVC. It was designed to closely match Git functionality, to leverage the familiarity most of us have with Git, but with features making it work well for both workflow and data management in the machine learning context.

DVC (https://dvc.org) takes on and solves a larger slice of the machine learning reproducibility problem than does Git-LFS or several other potential solutions. It does this by managing the code (scripts and programs), alongside large data files, in a hybrid between DVC and a source code management (SCM) system like Git. In addition DVC manages the workflow required for processing files used in machine learning experiments. The data files and commands-to-execute are described in DVC files which we'll learn about in the following sections. Finally, with DVC it is easy to store data on many storage systems from the local disk, to an SSH server, or to cloud systems (S3, GCP, etc). Data managed by DVC can be easily shared with others using this storage system.

DVC uses a similar command structure to Git. As we see here, just like git push and git pull are used for sharing code and configuration with collaborators, dvc push and dvc pull is used for sharing data. All this is covered in more detail in the coming sections, or if you want to skip right to learning about DVC see the tutorial at https://dvc.org/doc/tutorial.

DVC remembers precisely which files were used at what point of time

At the core of DVC is a data store (the DVC cache) optimized for storing and versioning large files. The team chooses which files to store in the SCM (like Git) and which to store in DVC. Files managed by DVC are stored such that DVC can maintain multiple versions of each file, and to use file-system links to quickly change which version of each file is being used.

Conceptually the SCM (like Git) and DVC both have repositories holding multiple versions of each file. One can check out "version N" and the corresponding files will appear in the working directory, then later check out "version N+1" and the files will change around to match.

On the DVC side, this is handled in the DVC cache. Files stored in the cache are indexed by a checksum (MD5 hash) of the content. As the individual files managed by DVC change, their checksum will of course change, and corresponding cache entries are created. The cache holds all instances of each file.

For efficiency, DVC uses several linking methods (depending on file system support) to insert files into the workspace without copying. This way DVC can quickly update the working directory when requested.

DVC uses what are called "DVC files" to describe both the data files and the workflow steps. Each workspace will have multiple DVC files, with each describing one or more data files with the corresponding checksum, and each describing a command to execute in the workflow.

cmd: python src/prepare.py data/data.xml
deps:
- md5: b4801c88a83f3bf5024c19a942993a48
  path: src/prepare.py
- md5: a304afb96060aad90176268345e10355
  path: data/data.xml
md5: c3a73109be6c186b9d72e714bcedaddb
outs:
- cache: true
  md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir
  metric: false
  path: data/prepared
wdir: .

This example DVC file comes from the DVC Getting Started example (https://github.com/iterative/example-get-started) and shows the initial step of a workflow. We'll talk more about workflows in the next section. For now, note that this command has two dependencies, src/prepare.py and data/data.xml, and an output data directory named data/prepared. Everything has an MD5 hash, and as these files change the MD5 hash will change and a new instance of changed data files are stored in the DVC cache.

DVC files are checked into the SCM managed (Git) repository. As commits are made to the SCM repository each DVC file is updated (if appropriate) with new checksums of each file. Therefore with DVC one can recreate exactly the data set present for each commit, and the team can exactly recreate each development step of the project.

DVC files are roughly similar to the "pointer" files used in Git-LFS.

The DVC team recommends using different SCM tags or branches for each experiment. Therefore accessing the data files, and code, and configuration, appropriate to that experiment is as simple as switching branches. The SCM will update the code and configuration files, and DVC will update the data files, automatically.

This means there is no more scratching your head trying to remember which data files were used for what experiment. DVC tracks all that for you.

DVC remembers the exact sequence of commands used at what point of time

The DVC files remember not only the files used in a particular execution stage, but the command that is executed in that stage.

Reproducing a machine learning result requires not only using the precise same data files, but the same processing steps and the same code/configuration. Consider a typical step in creating a model, of preparing sample data to use in later steps. You might have a Python script, prepare.py, to perform that split, and you might have input data in an XML file named data/data.xml.

$ dvc run -d data/data.xml -d code/prepare.py \
            -o data/prepared \
            python code/prepare.py

This is how we use DVC to record that processing step. The DVC "run" command creates a DVC file based on the command-line options.

The -d option defines dependencies, and in this case we see an input file in XML format, and a Python script. The -o option records output files, in this case there is an output data directory listed. Finally, the executed command is a Python script. Hence, we have input data, code and configuration, and output data, all dutifully recorded in the resulting DVC file, which corresponds to the DVC file shown in the previous section.

If prepare.py is changed from one commit to the next, the SCM will automatically track the change. Likewise any change to data.xml results in a new instance in the DVC cache, which DVC will automatically track. The resulting data directory will also be tracked by DVC if they change.

A DVC file can also simply refer to a file, like so:

md5: 99775a801a1553aae41358eafc2759a9
outs:
- cache: true
  md5: ce68b98d82545628782c66192c96f2d2
  metric: false
  path: data/Posts.xml.zip
  persist: false
wdir: ..

This results from the "dvc add file" command, which is used when you simply have a data file, and it is not the result of another command. For example in https://dvc.org/doc/tutorial/define-ml-pipeline this is shown, which results in the immediately preceeding DVC file:

$ wget -P data https://dvc.org/s3/so/100K/Posts.xml.zip
$ dvc add data/Posts.xml.zip

The file Posts.xml.zip is then the data source for a sequence of steps shown in the tutorial that derive information from this data.

Take a step back and recognize these are individual steps in a larger workflow, or what DVC calls a pipeline. With "dvc add" and "dvc run" you can string together several Stages, each being created with a "dvc run" command, and each being described by a DVC file. For a complete working example, see https://github.com/iterative/example-get-started and https://dvc.org/doc/tutorial

This means that each working directory will have several DVC files, one for each stage in the pipeline used in that project. DVC scans the DVC files to build up a Directed Acyclic Graph (DAG) of the commands required to reproduce the output(s) of the pipeline. Each stage is like a mini-Makefile in that DVC executes the command only if the dependencies have changed. It is also different because DVC does not consider only the file-system timestamps, like Make does, but whether the file content has changed, as determined by the checksum in the DVC file versus the current state of the file.

Bottom line is that this means there is no more scratching your head trying to remember which version of what script was used for each experiment. DVC tracks all of that for you.

DVC makes it easy to share data and code between team members

A machine learning researcher is probably working with colleagues, and needs to share data and code and configuration. Or the researcher may need to deploy data to remote systems, for example to run software on a cloud computing system (AWS, GCP, etc), which often means uploading data to the corresponding cloud storage service (S3, GCP, etc).

The code and configuration side of a DVC workspace is stored in the SCM (like Git). Using normal SCM commands (like "git clone") one can easily share it with colleagues. But how about sharing the data with colleagues?

DVC has the concept of remote storage. A DVC workspace can push data to, or pull data from, remote storage. The remote storage pool can exist on any of the cloud storage platforms (S3, GCP, etc) as well as an SSH server.

Therefore to share code, configuration and data with a colleague, you first define a remote storage pool. The configuration file holding remote storage definitions is tracked by the SCM. You next push the SCM repository to a shared server, which carries with it the DVC configuration file. When your colleague clones the repository, they can immediately pull the data from the remote cache.

This means your colleagues no longer have to scratch their head wondering how to run your code. They can easily replicate the exact steps, and the exact data, used to produce the results.

Conclusion

The key to repeatable results is using good practices, to keep proper versioning of not only their data but the code and configuration files, and to automate processing steps. Successful projects sometimes requires collaboration with colleagues, which is made easier through cloud storage systems. Some jobs require AI software running on cloud computing platforms, requiring data files to be stored on cloud storage platforms.

With DVC a machine learning research team can ensure their data, configuration and code are in sync with each other. It is an easy-to-use system which efficiently manages shared data repositories alongside an SCM system (like Git) to store the configuration and code.

Resources

Back in 2014 Jason Brownlee wrote a checklist he claimed would encourage reproducible machine learning results, by default: https://machinelearningmastery.com/reproducible-machine-learning-results-by-default/

A Practical Taxonomy of Reproducibility for Machine Learning Research - A research paper by staff of Kaggle and the U of Washington http://www.rctatman.com/files/2018-7-14-MLReproducability.pdf

A researcher at McGill Univ, Joelle Pineau, has another checklist for Machine Learning reproducibility https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf

She made a presentation at the NeurIPS 2018 conference: https://videoken.com/embed/jH0AgVcwIBc (start at about 6 minutes)

The 12 Factor Application is a take on reproducibility or reliability of web services https://12factor.net/

A survey of scientists by the journal Nature noted over 50% of scientists agree there is a crisis in reproducing results https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970