Unit 01 Lab 1: Git

Part 1: Overview

Learning outcomes

What is git?

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

What is a git repository?

A repository is simply a place where the history of your work is stored. It often lives in a .git subdirectory of your working copy - a copy of the most recent state of the files you’re working on.

Git in IST 718

The way we are going to use Git is mostly for easy access to the datasets and for easily providing everyone the correct versions of the datasets on any machine. The repository used in this lab is a public repository, stored on the GitHub website [https://github.com]. Anyone with the repository’s url can access it.

Before You Begin

Before you start this activity:

Part 2: Walk-Though

Cloning an existing Git repository

You can get a Git project using two main approaches. The first clones an existing Git repository from another server. The second takes an existing project or directory and imports it into Git.

From your Hadoop-Client:

  1. Launch Browser. Go to https://github.com/mafudge/datasets
    gifhub screenshot
  2. Click on the Clone or Download button as shown above. Copy the link given there
  3. Launch a terminal window. Make sure you are in your home directory, type:
    $ cd ~
    then type:
    $ pwd
    to verify it is /home/ischool
  4. To clone a repository we use the git clone command. Enter the following into the terminal window:
    $ git clone https://github.com/mafudge/datasets.git
    The above command will clone the said repository into the datasets folder, where you have all the datasets that are stored on the github server copied into your Hadoop-Client.
  5. Let’s verify you have cloned successfully. First move into the datasets folder, type:
    $ cd datasets
  6. List the contents of the directory:
    $ ls -l
    You should see folders similar to this screenshot:
    datasets folders
    NOTE: the folder may vary from the screenshot as new data sets are always being added.

Updating the local Git repository (Pull)

One of the common things you’ll need to do is update your local datasets repository with the changes from the central repository on the GitHub website. To do this, we execute a git pull

From your Hadoop-Client:

  1. Launch a terminal window.
  2. Chane into the datasets directory off your home directory, type:
    $ cd ~/datasets
  3. Pull the changes from GitHub into your local repository type: $ git pull
    If there’s changes upstream those will be reflected locally. If everything’s current you’ll see the message Already up-to-date.

NOTE: Get in the habit of pulling the repository before you begin each lab. This will ensure you have the proper datasets before you begin work.

Help! My Pull Doesn’t work!?!

If you make changes to the local datasets repository which conflict with the remote repository on GitHub this will prevent you from pulling down changes. This is by design as typically when you’re working on code you don’t want other collaborators to overwrite your work. Since you’re not contributing changes to the datasets repository, we always want to overwrite our local copy with the remote.

In this case we issue this command:
$ git reset --hard
to undo any local file changes we’ve made in the repository. After we reset we can issue a git pull

More about Git

Test Yourself

  1. What is the command to clone a repository named me?
  2. How would you update a remote repository with the changes made to your project locally?
  3. How do you revert local changed to you can pull from a remote?