DEV Community

Cover image for Comprehensive Guide to GitHub for Data Scientists.
Peter Wainaina
Peter Wainaina

Posted on • Originally published at wainainapierre.hashnode.dev

Comprehensive Guide to GitHub for Data Scientists.

This article is an in-depth guide to Git and GitHub. You will get to know what exactly Git and GitHub are and how you can leverage them to make your data science projects easier to track. As a data scientist, you need to have a solid grasp of these tools.

As a data scientist, you are going to collaborate with other fellow data scientists on projects and as you guys collaborate, there will be times when you have to update some part of the code. This is where Git & GitHub comes in handy and helps create a better workflow in that whatever changes anyone you are collaborating with makes, they can easily make those changes available to all the other collaborators, without necessarily having to be in the same room, country or even time zone. And if you make a mistake, you can always roll back to a previous version.

GitHub gives you the power to create a remote project and has all your team members work on different features in parallel, yet independently and still have a stable running code at the end of the day.

git image

What is the difference between Git & GitHub?

**Git **is a distributed Version Control System (VCS) that lets you keep track of all the modifications you make to your code. Being a distributed Version Control System ideally means that everyone who is collaborating on a project will have a history of the changes made on their local machine. This enables people to work on different features of the project without having to communicate with the server hosting the remote version of the project and you can easily merge any changes made to the project with the remote copy.

GitHub is a platform for version control that is built on top of git technology and uses Git at its core. GitHub hosts the remote version of your project from where all the people collaborating can access it.

Terminologies that you should be familiar with as we start:

  1. Repository – This is sort of a "Database" for all the branches and commits of a particular project.

  2. Branch – It’s an alternative state or line of development for a repository.

  3. Merge – This is bringing together multiple branches into a single branch.

  4. Clone – This is creating a local copy of a remote repository on your machine.

  5. Origin – Refers to the remote repository from which the local clone was cloned.

  6. Master/Main – This is the root branch of your remote repository.

  7. Stage - Choosing the files that will be part of a new commit you intend to make.

  8. Commit - A saved snapshot of staged changes made to the file(s) in the repository.

  9. HEAD – It’s the current commit your local repository is currently on.

  10. Push – This is the act of sending your changes to the remote repository for everyone you may be collaborating with to see.

  11. Pull – It’s the act of getting everybody else's changes (the changes that have been pushed) to your local repository.

  12. *Pull Request *– This is a mechanism to review and approve the changes you have made before merging to the main/master branch in the remote repository.

Basic commands that you should be familiar with:

git init - Create a new repository on your local computer.

git clone - Start working on an existing remote repository.

git clone

git add - Choose file(s) to be saved (staging).
git add (adding a single file)

git add -A (adding everything at once)

git status - Show which files you have changed.
git status

git commit - Save a snapshot (commit) of the chosen file(s).
git commit -m “”

git push - Send your saved snapshots (commits) to the remote repository.
git push origin

git pull - Pull recent commits made by others into your local computer.
git pull origin

git branch - Create or delete branches.
git branch

git checkout - Switch branches or undo changes made to local file(s).
git checkout

git merge - Merge branches to form a single branch.
git merge -m “”

Step-by-step procedure of how to Create and Clone a Repository.

This walkthrough will be of how to install Git on Windows and make a repository to which you will commit changes.

Step 1: Create Account & Git Installations

Go to Git and install the latest version according to your computer. Once you are done installing, launch GitBash and then use the Git --version command to check the version.

Step 2: Initializing a new Repository

Create a new folder/directory using the $mkdir command and navigate to the created folder using the $cd command. My local directory name will be “myproject1” for the sake of context.

Use the $Git init command to initialize the directory. To check if all is well so far, go to the folder where “myproject1” has been created and create a file with the .txt extension and write something to it like, ‘My first project is up and running’, then save the changes. After that, enter the Git bash and use the $Git status to check the status of the folder.

Step 3: Configuring Git

Git config will allow you to set configuration values on how you would want Git to look and operate and uses these configurations to determine the non-default behavior that you may want. With Git config you can set global variables for example the name and email of a user and verify the variables using Git config --list.

Step 4: Commit Files in Git

As it is currently, the file that we created is untracked. The Git add command will copy a file from the working directory to the staging area. Adding commits keeps track of the changes you perform. The commit command performs a commit and the -m “message” adds a message. It then takes a snapshot of the staging area and assigns a hash from the commit to the snapshot.

Step 5: Viewing Logs

Logs will enable you to see the commit history and changes in a project when you have collaborated with different people on the same repository.

Step 6: Uploading to a Remote Repository using Git

Create a new repository on GitHub and give it a name as well as a readme description.

Add a file into a folder and use these commands below in the exact sequence shown:

  1. cd

  2. Git init

  3. Git remote add origin

  4. Git remote -v

  5. Git add . (take note of the full stop/period)

  6. Git commit -m “your message”

  7. Git push origin master

The file will be automatically added to the GitHub repository you just created.

Step 7: Adding Git Remote to Your Repository

Git remote command can be used to share code to a remote repository. Any project can be downloaded from a remote server to your local computer. There is an existing connection between the original remote setup, which points to the “origin” remote connection.

We use the command Git remote add origin

Step 8: Push using Git

The Git push command is used to upload local repository content and commits to a remote repository. After you have made the final modifications to your project, you perform a push operation so that the changes you have made can be successfully shared with remote team members you are collaborating with.

The command is Git push origin master

Step 9: Cloning a GitHub Repository

Cloning a repository will enable you to keep a copy from GitHub to your local repository. Each repository comes with versions of every file and folder for the project. It creates a copy of the existing repository.

Step 10: Branching and Merging

Branching allows you to get the code from production to fix a bug or add a feature without modifying the already existing version. These branches work with a copy of code, make and build changes, test those changes, then merge them to the main branch.

To create a new branch, use the command - _Git branch < name of branch >
_
**Step 1: **Create branch -> Git branch “branch name”

Step 2: Checkout branch -> Git checkout “branch name”

Step 3: Merge new branch in master branch -> Git merge “branch name”

*Step 11: Pull using Git
*

Pull requests inform the changes in a branch in a repository. Once a pull request is opened, one can discuss and review the potential changes with collaborators and then commit after making those changes.

Step 12: Forking and Contributing to the world

Forking is the process of contributing to or using someone else’s project as it creates a remote copy of the original repository into your repository. You get a copy on which you can make changes or improvements to the existing project using pull requests which can then be merged with the original project. You are basically making open-source contributions to someone else’s project.

  • Open any public repository and click on the Fork button to fork the changes.
  • You can keep the same name of the repository you want to fork and click on Create Fork.
  • Once you fork, you will see a copy of the original repository in your account.
  • Once you have made changes in the code, you need to push the changes back.
  • This takes the snapshot of the changes, commits and push help to push the changes.

This is how you contribute to open-source changes and contribute to a public repository.

Conclusion

As a data scientist, you must have in-depth knowledge of version control tools like Git and GitHub to participate in maintaining and reviewing changes in collaborative and personal projects.

The key takeaway from this article is the basic Git commands and the step-by-step procedure of creating and cloning a repository.

Top comments (0)