<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Martin Daniel</title>
    <description>The latest articles on DEV Community by Martin Daniel (@martintali).</description>
    <link>https://dev.to/martintali</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F469533%2F2a743099-5573-49d2-83e9-6baa3bd61b0d.jpeg</url>
      <title>DEV Community: Martin Daniel</title>
      <link>https://dev.to/martintali</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/martintali"/>
    <language>en</language>
    <item>
      <title>Why Git is not enough for data science</title>
      <dc:creator>Martin Daniel</dc:creator>
      <pubDate>Wed, 05 May 2021 08:06:08 +0000</pubDate>
      <link>https://dev.to/martintali/why-git-is-not-enough-for-data-science-42gg</link>
      <guid>https://dev.to/martintali/why-git-is-not-enough-for-data-science-42gg</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; Git is used in almost every software development project to track code and file changes. Based on this ability to track every change, there has also been a tremendous increase in Gits adoption for Data science projects. In this post we discuss;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Benefits of Git for data science&lt;/li&gt;
&lt;li&gt;The gaps and limitations of Git&lt;/li&gt;
&lt;li&gt;Best practices for using Git for data science projects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For those of you familiar with Git jump to the section “Why Git is important to learn for data science”&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Git and how does it work?
&lt;/h2&gt;

&lt;p&gt;“Git is a free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.” - Git&lt;/p&gt;

&lt;p&gt;As the description states, Git is a version control system. It helps record, track, and save any change made to source code and to quickly and easily recover any previous state.&lt;br&gt;
Git uses a distributed version control model. This means that there can be many copies (or forks/remotes in the GitHub world) of the repository. When working locally, Git is the program that you will use to keep track of changes to your repository.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com"&gt;GitHub.com&lt;/a&gt; is a location on the internet that acts as a remote location for your repository. GitHub provides a backup of your work that can be retrieved if your local copy is lost (e.g., if your computer falls off a pier). GitHub also allows you to share your work and collaborate with others on a project.&lt;/p&gt;

&lt;p&gt;Similar tools to GitHub are &lt;a href="https://about.gitlab.com/"&gt;GitLab&lt;/a&gt;, &lt;a href="https://bitbucket.org/"&gt;Bitbucket&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mRPBp0IZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://git-scm.com/book/en/v2/book/01-introduction/images/distributed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mRPBp0IZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://git-scm.com/book/en/v2/book/01-introduction/images/distributed.png" alt="DVCS"&gt;&lt;/a&gt;&lt;br&gt;
Source: Pro Git by Scott Chacon and Ben Straub.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Git and Github, GitLab, or Bitbucket help you work better
&lt;/h3&gt;

&lt;p&gt;There are several practical ways Git helps a development project.&lt;/p&gt;

&lt;p&gt;Keep track of changes to your code locally using git.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synchronize code between different versions (i.e. your own  versions or others’ versions).&lt;/li&gt;
&lt;li&gt;Test changes to code without losing the original.&lt;/li&gt;
&lt;li&gt;Revert back to an older version of code, if needed.&lt;/li&gt;
&lt;li&gt;Backup your files on the cloud (GitHub/GitLab/Bitbucket).&lt;/li&gt;
&lt;li&gt;Share your files on GitHub/GitLab/Bitbucket and collaborate with others.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Up to this point, it's clear why Git is a powerful tool that will help you record, track and store any change made to almost any file in a project. We understood how Git helps you work well as a team and this is one of the main reasons why Git is so widely used in software development projects.&lt;/p&gt;

&lt;p&gt;These benefits are also relevant for data science projects, to manage the code that supports their work, but it does not translate 1:1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is Git important to learn for data science?
&lt;/h2&gt;

&lt;p&gt;If you want to join a data science project as a collaborator, you will have to face some challenges, for example;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review all ongoing research and repositories on a specific topic and pick the most promising one. (If you are working on an open source project)&lt;/li&gt;
&lt;li&gt;You will need to understand the current state of that project and how it has evolved over time.&lt;/li&gt;
&lt;li&gt;Identify which directions are promising and still worth exploring. In this step, reviewing ideas and approaches that were tried and abandoned is also important, since you don’t want to unnecessarily repeat work someone tried unsuccessfully. Usually, these failed approaches are not documented and forgotten, which is a huge challenge.&lt;/li&gt;
&lt;li&gt;You will need to collect all the pieces of the project (data, code, etc.) which might be spread out over multiple platforms, and sometimes not completely accessible.&lt;/li&gt;
&lt;li&gt;Last but not least, once you’ve made some improvement or explored a new direction, there is no easy way to contribute your results back to the project.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To summarize, nowadays there are multiple challenges which aren’t gracefully handled in today’s tools.&lt;/p&gt;

&lt;p&gt;Now let’s see how Git can help us fill in some of the missing pieces.&lt;/p&gt;

&lt;p&gt;With Git’s ability to track every change we made to our files, we can show all the directions taken, when they were taken, and by whom! It's possible to see the entire Git history as an actual story and understand what was done at every step (commit) and have some documentation (commit messages). You can also share it with other collaborators using one of the previously mentioned platforms.&lt;/p&gt;

&lt;p&gt;By using a more traditional software development workflow, you begin to treat your models more like an application and less as a script, which makes it easier to manage and leads to higher quality outcomes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Still, Git has limitations
&lt;/h3&gt;

&lt;p&gt;Although there are significant benefits from using version control tools like Git, they come with a high overhead cost.&lt;br&gt;
The overhead comes from the need to ensure every change goes through the "commit" process, which most often means using the command line and terminal. Since the terminal is so unfamiliar to most analysts (and even data scientists), you don't just need to learn Git, you also need to learn the terminal! This is not quick, and having your efficiency suffer while struggling to remember what command to write is a huge turn-off. If this is your case, you can check this blog post &lt;a href="https://dagshub.com/blog/effective-linux-bash-data-scientists/"&gt;Effective Linux &amp;amp; Bash for Data Scientist&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, Git can't do all the heavy lifting on its own. What do I mean by this?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git can't support experiment tracking. Here is a nice post comparing some existing tools. &lt;a href="https://dagshub.com/blog/how-to-compare-ml-experiment-tracking-tools-to-fit-your-data-science-workflow/"&gt;ML experiment tracking tools that fit your data science workflow&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Git can't track big files (datasets and models). You can find more information about this in this post &lt;a href="https://dagshub.com/blog/data-version-control-tools/"&gt;Comparing Data Version Control tools&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Best practices for structuring a Data Science project using Git.
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Dpp0C34i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1489533119213-66a5cd877091%3Fixid%3DMnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8%26ixlib%3Drb-1.2.1%26auto%3Dformat%26fit%3Dcrop%26w%3D1951%26q%3D80" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Dpp0C34i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1489533119213-66a5cd877091%3Fixid%3DMnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8%26ixlib%3Drb-1.2.1%26auto%3Dformat%26fit%3Dcrop%26w%3D1951%26q%3D80" alt="Begin"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With all that said I propose a solution to integrate the Git mindset into your DS project. It’s composed of a few components: experiment tracking, version control and using data as source code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Experiment tracking
&lt;/h4&gt;

&lt;p&gt;You can implement experiment tracking by taking two approaches either using a dedicated tool or using Git. You can also find more information about this on ML experiment tracking tools that fit your data science workflow&lt;/p&gt;

&lt;h5&gt;
  
  
  External tracking
&lt;/h5&gt;

&lt;p&gt;You have to log all your experiment information on an external system.&lt;/p&gt;

&lt;p&gt;This approach has some advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A lot of excellent tools have been developed&lt;/li&gt;
&lt;li&gt;It’s an intuitive way to do it. No need to stop before taking a new direction to create a new commit on a git project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But, with advantages come the disadvantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is no clear connection between the code and the experiment results&lt;/li&gt;
&lt;li&gt;It’s hard to review&lt;/li&gt;
&lt;li&gt;Reproducibility. It’s not easy to reproduce what lead to an experiment result&lt;/li&gt;
&lt;/ul&gt;

&lt;h5&gt;
  
  
  Version Control with Git Tracking
&lt;/h5&gt;

&lt;p&gt;You consider each experiment a git commit, this means that any change to the project will create a new version since code, data, and parameters are part of the source code&lt;/p&gt;

&lt;p&gt;Some advantages are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reproducing is easy, just do a git checkout and you have code, parameters, data, models.&lt;/li&gt;
&lt;li&gt;You get all the context related to an experiment&lt;/li&gt;
&lt;li&gt;Collaboration. As mentioned throughout this article Git in combination with some of the other platforms gives you the possibility to parallelize work&lt;/li&gt;
&lt;li&gt;If combined with data versioning tools you can also accept data contributions&lt;/li&gt;
&lt;li&gt;GitOps, CI/CD - Makes it easier to integrate with the existing git ecosystem for CI/CD, PRs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also has some disadvantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be messy when having a lot of experiments meaning having a lot of commits&lt;/li&gt;
&lt;li&gt;Change on the mindset to start considering any new direction as a commit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of course, you can do a mix of both ideas.&lt;/p&gt;

&lt;h4&gt;
  
  
  Data as source code
&lt;/h4&gt;

&lt;p&gt;As I mentioned before Git was developed to track changes in text files, not large binary files. So tracking a project data set is not an option. This is why I recommend two options to use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For a non-changing dataset, you can upload it to a server and access it through a URL&lt;/li&gt;
&lt;li&gt;In case you have a data set that might change you should consider versioning it using one tool. You can find a great comparison here &lt;a href="https://dagshub.com/blog/data-version-control-tools/"&gt;Comparing Data Version Control tools.&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can also find more information about why is it a good idea to version your data set on this blog post &lt;a href="https://dagshub.com/blog/datasets-should-behave-like-git-repositories/"&gt;Datasets should behave like Git repositories&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Implementing these suggested practices for Git offer several benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consolidate all your project files, data and models in one place&lt;/li&gt;
&lt;li&gt;Review tools which make it easier to contribute to an ongoing project and easier to check these contributions.&lt;/li&gt;
&lt;li&gt;Easier to reproduce and reuse work from previous projects&lt;/li&gt;
&lt;li&gt;CI/CD, if you are happy with the contributions that were made you can have an automatic way to merge them, taking the code and the data, test them, and ship them to production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Day by day, Git is being used in more Data Science projects. I hope that by reading this article you will have a better understanding on what are its limitations and its strengths and how you can use it with your colleagues. Good luck!&lt;/p&gt;

</description>
      <category>git</category>
      <category>datascience</category>
    </item>
    <item>
      <title>DAGsHub Storage: configure a DVC remote without a DevOps degree</title>
      <dc:creator>Martin Daniel</dc:creator>
      <pubDate>Sun, 17 Jan 2021 14:01:58 +0000</pubDate>
      <link>https://dev.to/martintali/dagshub-storage-configure-a-dvc-remote-without-a-devops-degree-ddi</link>
      <guid>https://dev.to/martintali/dagshub-storage-configure-a-dvc-remote-without-a-devops-degree-ddi</guid>
      <description>&lt;p&gt;DVC is a great tool; it lets you track and share your data, models, and experiments. It also supports pipelines to version control the steps in a typical ML workflow. To share your data and models, you will need to configure a DVC remote (such as S3, GCloud, GDrive, etc.), but doing so can be a hassle and take a tremendous amount of time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--b-YZVohH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1494059980473-813e73ee784b%3Fixlib%3Drb-1.2.1%26q%3D85%26fm%3Djpg%26crop%3Dentropy%26cs%3Dsrgb" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--b-YZVohH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1494059980473-813e73ee784b%3Fixlib%3Drb-1.2.1%26q%3D85%26fm%3Djpg%26crop%3Dentropy%26cs%3Dsrgb" alt="https://images.unsplash.com/photo-1494059980473-813e73ee784b?ixlib=rb-1.2.1&amp;amp;q=85&amp;amp;fm=jpg&amp;amp;crop=entropy&amp;amp;cs=srgb"&gt;&lt;/a&gt;&lt;/p&gt;
Too many things to order... Photo by &lt;a href="https://unsplash.com/@sloppyperfectionist"&gt;Hans-Peter Gauster&lt;/a&gt; on &lt;a href="https://unsplash.com/"&gt;Unsplash&lt;/a&gt;




&lt;p&gt;In this post, I'll show you that this configuration shouldn't have to be so difficult; it should be smooth and easy. To solve this issue, we created &lt;strong&gt;DAGsHub Storage,&lt;/strong&gt; a DVC remote that is super easy to configure, no credit cards, no need to grant complex permissions, no cloud setup. Just five commands and you are ready to go!&lt;/p&gt;

&lt;p&gt;To start, you will need to have a project on DAGsHub. There are two ways to do this, either &lt;a href="https://dagshub.com/repo/create"&gt;create one from scratch&lt;/a&gt; or &lt;a href="https://dagshub.com/repo/connect"&gt;connect an existing project&lt;/a&gt; from any other platform (We support GitHub, GitLab, and BitBucket).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;If you need, we have a &lt;a href="https://dagshub.com/docs/experiment-tutorial/overview/"&gt;tutorial&lt;/a&gt; on how to start a new project on our platform.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In order to continue with this tutorial you will need to &lt;a href="https://dvc.org/doc/install"&gt;install DVC&lt;/a&gt; first.&lt;/p&gt;

&lt;p&gt;After DVC is installed, in a &lt;strong&gt;Git project&lt;/strong&gt;, initialize it by running&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command will create &lt;code&gt;.dvc/.gitignore&lt;/code&gt; , &lt;code&gt;.dvc/config&lt;/code&gt;, &lt;code&gt;.dvc/plots&lt;/code&gt;, and &lt;code&gt;.dvcignore&lt;/code&gt;. These entries can be committed with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Initialize DVC"&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the purpose of this tutorial I've created a new project with the following structure&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;data
├── processed
│   ├── test_text.txt
│   └── train_text.txt
└── raw
    └── test_full.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To start tracking our data, either a file or a directory, we use &lt;code&gt;dvc add&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc add data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Here is where DVC does its magic.&lt;/em&gt; It stores metadata about the entry added in a &lt;code&gt;.dvc&lt;/code&gt; file; this is a small text file containing information about how to access the original entry but not the original entry itself. This command also adds the added entry to the &lt;code&gt;.gitignore&lt;/code&gt; file, so we won't commit it by accident&lt;/p&gt;

&lt;p&gt;In our case DVC created a file called &lt;code&gt;data.dvc&lt;/code&gt;, which will look like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;outs:
- md5: 61b3e1a6439d6770be4d210b758f6cbd.dir
  size: 0
  nfiles: 3
  path: data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the file that will be versioned by Git&lt;/p&gt;

&lt;p&gt;Following this step we are ready to commit the &lt;code&gt;.dvc&lt;/code&gt; file as we would do with any source code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git add data.dvc .gitignore
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add data"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Storing the data remotely
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9mzpLD6b--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1565889673174-ee7391b93d23%3Fixlib%3Drb-1.2.1%26q%3D85%26fm%3Djpg%26crop%3Dentropy%26cs%3Dsrgb" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9mzpLD6b--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1565889673174-ee7391b93d23%3Fixlib%3Drb-1.2.1%26q%3D85%26fm%3Djpg%26crop%3Dentropy%26cs%3Dsrgb" alt="https://images.unsplash.com/photo-1565889673174-ee7391b93d23?ixlib=rb-1.2.1&amp;amp;q=85&amp;amp;fm=jpg&amp;amp;crop=entropy&amp;amp;cs=srgb"&gt;&lt;/a&gt;&lt;/p&gt;
Configuring a bucket shouldn't be so hard! Photo by &lt;a href="https://unsplash.com/@jdjohnston"&gt;Jessica Johnston&lt;/a&gt; on &lt;a href="https://unsplash.com"&gt;Unsplash&lt;/a&gt;



&lt;p&gt;Excellent! We are now tracking the versions of our data, and now we have to figure out where to store the data itself.&lt;/p&gt;

&lt;p&gt;As I mentioned before, I will show you how to effortlessly configure a DVC remote. Following five simple commands, you will be pushing your data and models alongside your code. For easy comparison, I'll also show you the traditional way to set up remotes, so you can easily understand the time saved by using DAGsHub Storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to do it without a DevOps degree
&lt;/h3&gt;

&lt;p&gt;At DAGsHub, we automatically create a DVC remote with every project on the platform to push your data and models just as you receive a Git remote to push your code. This is where the simplicity starts showing! To push or pull data from this URL, we will use our existing DAGsHub credentials (via HTTPS basic authentication). Meaning we don't need to configure any IAM, provide Access tokens to access your bucket, or anything else related to a cloud provider.&lt;/p&gt;

&lt;p&gt;Public repositories will have publicly readable data, same as the code. If you want to share or receive data from a collaborator, add them as a project collaborator. If your repository is private only maintainers will be able to pull or push data to it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cT-lWrva--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/5bbby5aphzgbuq9mk368.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cT-lWrva--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/5bbby5aphzgbuq9mk368.png" alt="Collab"&gt;&lt;/a&gt;&lt;/p&gt;
Basically, if you can clone the code, you can pull the data!



&lt;p&gt;Let's get our hands dirty!&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We need to add DAGsHub as our DVC remote
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc remote add origin &lt;span class="nt"&gt;--local&lt;/span&gt; https://dagshub.com/&amp;lt;username&amp;gt;/&amp;lt;repo_name&amp;gt;.dvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Next we need to tell DVC how to ask for our credentials
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc remote modify origin &lt;span class="nt"&gt;--local&lt;/span&gt; auth basic
dvc remote modify origin &lt;span class="nt"&gt;--local&lt;/span&gt; user &amp;lt;username&amp;gt;
dvc remote modify origin &lt;span class="nt"&gt;--local&lt;/span&gt; ask_password &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;And finally, push the data to the new remote
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Make sure you are using DVC 1.10 or greater for the next command&lt;/span&gt;
dvc push &lt;span class="nt"&gt;-r&lt;/span&gt; origin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it! Just 5 commands and you configured your DVC remote effortlessly, we never opened a cloud provider webpage, handled complicated IAM, provided credit card information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mm-Q2fRO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/a3j1qn0t1wzjlypa9qep.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mm-Q2fRO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/a3j1qn0t1wzjlypa9qep.jpg" alt="Easy peasy lemon squeezy"&gt;&lt;/a&gt;&lt;/p&gt;
Easy peasy lemon squeezy



&lt;p&gt;&lt;strong&gt;&lt;em&gt;If you need more information about DAGsHub Storage, you can read our &lt;a href="https://dagshub.com/docs/reference/onboard_storage/"&gt;Feature Reference&lt;/a&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How to do it WITH a DevOps degree – A Comparison
&lt;/h3&gt;

&lt;p&gt;Before we dig into this section, DAGsHub currently supports &lt;a href="https://aws.amazon.com/s3/"&gt;AWS S3&lt;/a&gt; and &lt;a href="https://cloud.google.com/products/storage/?utm_source=google&amp;amp;utm_medium=cpc&amp;amp;utm_campaign=emea-il-all-en-dr-bkws-all-all-trial-e-gcp-1009139&amp;amp;utm_content=text-ad-none-any-DEV_c-CRE_253511057659-ADGP_Hybrid%20%7C%20AW%20SEM%20%7C%20BKWS%20~%20EXA_1%3A1_IL_EN_Storage_Storage_TOP_google%20cloud%20storage-KWID_43700053287112977-kwd-11642151515-userloc_1008004&amp;amp;utm_term=KW_google%20cloud%20storage-NET_g-PLAC_&amp;amp;&amp;amp;gclid=EAIaIQobChMIi-v79KLk7QIVQ7TtCh1xpwckEAAYASAAEgLqPvD_BwE"&gt;GCS&lt;/a&gt; in addition to DAGsHub Storage.&lt;/p&gt;

&lt;p&gt;For the sake of this comparison, let's see how to do it for Amazon S3.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hire AWS as your cloud provider. &lt;strong&gt;This involves taking out your credit card&lt;/strong&gt; (If you already have an account, you can skip this step)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html"&gt;Set up a bucket&lt;/a&gt; to store your data&lt;/li&gt;
&lt;li&gt;Install the &lt;a href="https://aws.amazon.com/cli/"&gt;AWS CLI tool&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Log in to AWS using the CLI tool&lt;/li&gt;
&lt;li&gt;If the user who is going to use the bucket is not an admin, &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html"&gt;create an IAM user&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/user-guide/add-bucket-policy.html"&gt;Assign it the correct permissio&lt;/a&gt;ns to use the bucket&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"Version"&lt;/span&gt;: &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;,
    &lt;span class="s2"&gt;"Statement"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
        &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;"Effect"&lt;/span&gt;: &lt;span class="s2"&gt;"Allow"&lt;/span&gt;,
            &lt;span class="s2"&gt;"Principal"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="s2"&gt;"AWS"&lt;/span&gt;: &lt;span class="s2"&gt;"&amp;lt;IAM-user-ARN&amp;gt;"&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;e.g: &lt;span class="s2"&gt;"arn:aws:iam::7777777:user/dags-lover"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;,
            &lt;span class="s2"&gt;"Action"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
                &lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;,
                &lt;span class="s2"&gt;"s3:PutObject"&lt;/span&gt;,
                &lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;,
            &lt;span class="o"&gt;]&lt;/span&gt;,
            &lt;span class="s2"&gt;"Resource"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
                &lt;span class="s2"&gt;"arn:aws:s3:::/*"&lt;/span&gt;
                &lt;span class="s2"&gt;"arn:aws:s3:::"&lt;/span&gt;
            &lt;span class="o"&gt;]&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A lot of things, right? All these steps are prone to errors even for the most experienced users, so if you are doing this for the first time, expect to miss something&lt;/p&gt;

&lt;p&gt;It doesn't end there. If you want to integrate DAGsHub, you will need to add a Storage Key to your project settings so we will be able to list, show, and diff your files on our file viewer.  &lt;/p&gt;

&lt;p&gt;You will find this settings page on &lt;a href="https://dagshub.com/"&gt;https://dagshub.com/&lt;/a&gt;//settings/storage/keys&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LbnM8n3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ry5fa97m5wh4psep1ht1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LbnM8n3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ry5fa97m5wh4psep1ht1.png" alt="Storage keys"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you enter your bucket URL you will receive all the instruction to add the storage key.&lt;/p&gt;

&lt;p&gt;Keep up! We haven't finished yet! Now you will need to install the S3 package for DVC&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"dvc[s3]"&lt;/span&gt;
&lt;span class="c"&gt;#Or if you are using poetry&lt;/span&gt;
poetry add dvc &lt;span class="nt"&gt;--extras&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Following this, we will need to add the bucket as our remote&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc remote add s3-remote s3://your-bucket/storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And finally, we push our data&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc push &lt;span class="nt"&gt;-r&lt;/span&gt; origin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7xw2Jj_F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1495427513693-3f40da04b3fd%3Fixlib%3Drb-1.2.1%26q%3D85%26fm%3Djpg%26crop%3Dentropy%26cs%3Dsrgb" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7xw2Jj_F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.unsplash.com/photo-1495427513693-3f40da04b3fd%3Fixlib%3Drb-1.2.1%26q%3D85%26fm%3Djpg%26crop%3Dentropy%26cs%3Dsrgb" alt="https://images.unsplash.com/photo-1495427513693-3f40da04b3fd?ixlib=rb-1.2.1&amp;amp;q=85&amp;amp;fm=jpg&amp;amp;crop=entropy&amp;amp;cs=srgb"&gt;&lt;/a&gt;&lt;/p&gt;
DAGsHub storage to the rescue!  Photo by &lt;a href="https://unsplash.com/@nikkotations"&gt;Nikko Macaspac&lt;/a&gt; on &lt;a href="https://unsplash.com"&gt;Unsplash&lt;/a&gt;



&lt;h2&gt;
  
  
  Learn more
&lt;/h2&gt;

&lt;p&gt;I hope this helped you understand how to set up a DVC remote (an easy way and a hard way). For more information about DAGsHub, check out our &lt;a href="https://dagshub.com"&gt;website&lt;/a&gt;, &lt;a href="https://dagshub.com/docs/"&gt;documentation&lt;/a&gt;, or join our &lt;a href="https://discord.com/invite/9gU36Y6"&gt;Discord community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>dvc</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
