Unit 01 Lab 1: Git
Part 1: Overview
- Know what git is
- Demonstrate use of git
- Understand how to clone, update and reset a git repository
What is git?
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
What is a git repository?
A repository is simply a place where the history of your work is stored. It often lives in a .git subdirectory of your working copy - a copy of the most recent state of the files you’re working on.
Git in IST 718
The way we are going to use Git is mostly for easy access to the datasets and for easily providing everyone the correct versions of the datasets on any machine. The repository used in this lab is a public repository, stored on the GitHub website [https://github.com]. Anyone with the repository’s url can access it.
Before You Begin
Before you start this activity:
- Make sure your Hadoop-Client is running as specified in the previous lab.
Part 2: Walk-Though
Cloning an existing Git repository
You can get a Git project using two main approaches. The first clones an existing Git repository from another server. The second takes an existing project or directory and imports it into Git.
From your Hadoop-Client:
- Launch Browser. Go to https://github.com/mafudge/datasets
- Click on the Clone or Download button as shown above. Copy the link given there
- Launch a terminal window. Make sure you are in your home directory, type:
$ cd ~
to verify it is
- To clone a repository we use the
git clonecommand. Enter the following into the terminal window:
$ git clone https://github.com/mafudge/datasets.git
The above command will clone the said repository into the datasets folder, where you have all the datasets that are stored on the github server copied into your Hadoop-Client.
- Let’s verify you have cloned successfully. First move into the
$ cd datasets
- List the contents of the directory:
$ ls -l
You should see folders similar to this screenshot:
NOTE: the folder may vary from the screenshot as new data sets are always being added.
Updating the local Git repository (Pull)
One of the common things you’ll need to do is update your local
datasets repository with the changes from the central repository on the GitHub website. To do this, we execute a
From your Hadoop-Client:
- Launch a terminal window.
- Chane into the
datasetsdirectory off your home directory, type:
$ cd ~/datasets
- Pull the changes from GitHub into your local repository type:
$ git pull
If there’s changes upstream those will be reflected locally. If everything’s current you’ll see the message
NOTE: Get in the habit of pulling the repository before you begin each lab. This will ensure you have the proper datasets before you begin work.
Help! My Pull Doesn’t work!?!
If you make changes to the local
datasets repository which conflict with the remote repository on GitHub this will prevent you from pulling down changes. This is by design as typically when you’re working on code you don’t want other collaborators to overwrite your work. Since you’re not contributing changes to the
datasets repository, we always want to overwrite our local copy with the remote.
In this case we issue this command:
$ git reset --hard
to undo any local file changes we’ve made in the repository. After we
reset we can issue a
More about Git
- To know more about git hosting visit https://git.wiki.kernel.org/index.php/GitHosting
- To know more about
git, here are a following resources: http://www.dataschool.io/git-quick-reference-for-beginners/
- What is the command to clone a repository named
- How would you update a remote repository with the changes made to your project locally?
- How do you revert local changed to you can pull from a remote?