Unit 01 Lab 1: Just Enough Linux

Part 1: Overview

To be a productive Hadoop user you must have basic knowledge of the Linux command line. The command line is the primary method of interacting with your computer using text-only commands. The goal of this lab is to provide you with just enough instruction to be a functional user of the Linux operating system. We’re not trying to turn you into a sysadmin, only give you knowledge of the commands essential to managing your environment effectively.

Learning Outcomes

Upon completing this activity you will be able to:


To complete this lab you will need:

Before you Begin

Before you start this lab make sure to:

The Command Prompt

The linux command prompt is the part to the left of the cursor where you start typing in the terminal window. In this case, it’s ischool@dsappliance:~$. The command prompt provides us with three important pieces of information:

NOTE: My convention is to place a $ in front of the command I want you to type. Do not type the $ as part of the command. It is simply short-hand for the compete linux command prompt. Any command which begins with a $ should run correctly from the command line.

Command Line Tips

Understanding Paths

Paths are simple instructions for how to get to a place within the linux file system. For example for a file named salary.txt in the ischool home directory on hadoop-client VM, the absolute path would be /home/ischool/salary.txt.

You read paths left to right, going from general to more specific place in the file system until you reach the file or folder at the end of the parth. For example the path /home/ischool/salary.txt is read like this:

  1. Start at the base of the linux file system / (also known as the root),
  2. Then go into the home directory,
  3. Then from there into the ischool directory,
  4. And in that folder there is a file salary.txt.

The example above depicts an absolute file path because the path starts at the root of the linux file system. Likewise, if you were to give someone “absolute” directions to your house, you would start by telling them to be on earth, then go to your continent, then go to your country, then go to your region, etc.

A relative file path specifies the path to a file, taking into account your current working directory within the file system. For example, if you were to give someone “relative” directions to your house, you would give them directions from their current location (the relative path from where they are to where you are). In our salary.txt example if you were in the /home/ folder, then the relative path would be ischool/salary.txt

Path Abbreviations

There are three path abbreviations every command line aficionado should know. These make it easier for you to get around the file system.

  1. ~ Path to the current user’s home directory. For example ~/databases/ is a path to the databases folder in the current user’s home directory.
  2. .. Path to the parent folder, or “go up” one folder. For example let’s assume you have two directories in your home directory ~/databases/ and ~/games if you are currently in the games folder and need to reference the database folder the path would be ../databases (go up our of games and then down into databases).
  3. . Path to the current folder. This is the same as the output from the pwd command.

Part 2: Walk-Though

NOTE: While you can follow this lab from any linux command prompt the output seen here is dependent upon running these commands on the hadoop-client or Data Science Appliance virtual machine.

Getting Your Bearings

In this section we will cover the following commands: clear, date, hostname, pwd, whoami

Let’s try these commands:

  1. In The command prompt, type $ date to display the current system date and time. You should see something like this Thu May 19 16:54 EDT 2016. Actual time will vary of course.
  2. $ whoami displays the username of the current logged on user. This information is conveniently displayed on the command prompt, but it’s useful when you encounter a system with out it on the prompt. This should be ischool.
  3. $ hostname displays the host name of the computer. Likewise, the output of this command should match what you see in the linux prompt. This should be dsappliance.
  4. $ pwd displays the current working directory, or the place on the file system where the command prompt resides currently. This should be the absolute path /home/ischool.
  5. $ clear clears the output window and returns the cursor to the top of the screen. Merely cosmetic!

Directory Management

In this section we will cover the following commands: cd, ls, mkdir, pwd, rmdir

Let’s try these commands:

  1. First, let’s type: $ mkdir unit01-lab2 create a directory for this lab.

  2. Use the $ ls command to list the contents of the working directory. Therein you should see your new unit01-lab2 directory.

  3. Let’s change the working directory to the new folder. Type: $ cd unit01-lab2
  4. Check yourself type: $ pwd to output the current directory. It should be the absolute path: /home/ischool/unit01-lab2
  5. Next use the mkdir command to make two folders games and databases. Type the: $ ls command to verify the two directories are there.
  6. Oops. I meant datasets not databases. Let’s remove the folder type $ rmdir databases, then use: $ ls to verify there is one folder, then create the new datasets folder.
  7. Check yourself again with $ls you should see two folders: datasets and games
  8. Let’s move into the games folder using the cd command. Check yourself with :$ pwd it should report this path: /home/ischool/unit01-lab2/games
  9. Going up. Let’s go back up to the unit01-lab2 folder, using the parent folder path abbreviation. type: $ cd ..
  10. Going home. Let’s go back to our home folder using the path abbreviation. Type: $ cd ~
  11. Use pwd to verify you’re in the /home/ischool folder before continuing.

File Management

In this section, we will cover the following commands: cat, cp, nano, mv, rm, touch

Try these commands:

  1. First move into the unit01-lab2/games folder, type: $ cd unit01-lab2/games. Verify you’re in the correct working directory with pwd it should be : /home/ischool/unit01-lab2/games
  2. Let’s create an empty file: $ touch highscores.csv. Verify the file is there with ls -l notice it’s an empty file with 0 bytes.
  3. Let’s edit the file: $ nano highscores.csv this will open the file in the nano editor. add the following lines 4 to the file, exactly as shown:
    Contents of highscores.csv
    When you’re finished editing the file, press Ctrl+x to exit, then type Y to save and press Enter
  4. Let’s make sure you did the previous step correctly by seeing what’s in the file, type: $ cat highscores.csv to display the contents of the file you should see the same 4 lines as in the screenshot.
  5. Time to backup this important file by making a copy, type: $ cp highscores.csv backup-highscores.csv. Use ls -l to verify there are now two files in the working directory.
  6. That’s a horrible name. Let’s rename the file, type: $ mv backup-highscores.csv highscores.bak again list the contents of the folder to verify. You should have two files at this point highscores.csv and highscores.bak
  7. Let’s copy the high scores to our datasets folder. If you recall the datasets folder is up one level from the current folder, so type: $ cp highscores.csv ../datasets/ You can verify the file is there without changing the working directory by typing: ls ../datasets/ it should be there!
  8. Next use the cd command to move into the datasets folder. Hint: the command should look similar to the ls command you just typed. Use pwd to verify you are in the /home/ischool/unit01-lab2/datasets folder.
  9. Finally, since we no longer need the games folder or its contents, let’s delete it with rm -ir ../games. This will prompt to delete the two files and the games folder.

File Manipulations

In this section, we will cover the following commands: cut, echo, > (output redirect), >> (output append), | (pipe), sort, wc

Try these commands:

  1. Assuming you left off in the unit01-lab2/datasets folder, let’s count the lines in the highscores.csv type: $ wc -l highscores.csv. There should be 4 lines.
  2. Use cat to display the contents of highscores.csv you should see the scores in this order:
    Contents of highscores.csv
  3. Lets sort the games alphabetically, type: $ sort highscores.csv you should see output like this:
    Contents of highscores.csv sorted
  4. Hey that’s useful! But if there was only a way to save the screen output to a file. Lucky for you there is! Type: $ sort highscores.csv > highscores-sorted.csv The > operator redirects the output of the sort command to the file highscores-sorted.csv. Verify it worked with the cat command.
  5. I forgot to add a high score, but don’t fret there’s a way to add it from the command line. First, type $ echo "defender,12000" it should output defender,12000 to the console. Now what we need to do is append that to the end of highscores.csv we use the >> (output append) operator to do this. Type this and make sure it’s two >> or you will overwrite the file: $ echo "defender,12000" >> highscores.csv.
  6. Now when you $ cat highscores.csv the file has the new game and score appended to the end, like this:
    Contents of highscores.csv with defender appended
  7. Just the names, please. Next we will use the cut command to retrieve only the names of the games from the highscores.csv file. Since the character that separates the name of the game from its high score is a , that will be our delimiter. Type: $ cut -f1 -d, highscores.csv and you will see only the names of the games as output.
  8. Similarly, if you just want the scores, you can choose the 2nd column instead of the first one, like this: $ cut -f2 -d, highscores.csv
  9. We can combine commands to achieve a desired result. For example we can pull out the names of the games then sort them. To combine commands we use the pipe | operator. Type: $ cut -f1 -d, highscores.csv | sort which lists just the names of the games sorted alphabetically. The pipe takes the output of the cut command (just the names of the games) and uses it as input into the sort command, eliminating the need for us to use redirection to a temporary file.
  10. We can even redirect the final output of a combined set of commands to file, for instance, type: $ cut -f1 -d, highscores.csv | sort > games.txt

More File Manipulations

In this section, we will cover the following commands: curl, grep, head, less, tail

Try these commands.

  1. There’s a data set of orders from Chipolte, a popular Mexican eatery here: https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv. A common thing we Data Scientists need to do is download data sets from the Internet for analysis. The curl command can do this for us, type: $ curl -L https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv Watch the orders whizz by in the console!
  2. What if we’d like to save this output to a file? We can use a output redirect for that, of course. Type: $ curl -L https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv > orders.tsv The output will no longer display to the console but instead be written to the file orders.tsv
  3. But when we cat the file it scrolls by too fast for us to read. You can paginate the file by typing: $ less orders.tsv Now you can use the arrow keys to scroll up and down through the file, or the space bar to skip a through a screen’s worth at a time. When you are finished viewing the file, press q to quit.
  4. Sometimes we want to see the first few lines of the file. To see the first 15 lines, type: $ head -n15 orders.tsv
  5. Likewise to see the last 10 lines of the file, we type: $ tail -n10 orders.tsv
  6. Maybe you just want to see the orders for “Steak Burrito” to do that we type: $ grep "Steak Burrito" orders.tsv
  7. Still too many to view on the screen, have less help you out: $ grep "Steak Burrito" orders.tsv | less Scroll up and down with the arrow keys, and press q to quit.
  8. Finally, let’s get serious… how about a distinct list of everything ordered? We will cut out the item name, then sort it, then filter out the duplicates with uniq then page through the output with less, type: $ cut -f3 orders.tsv | sort | uniq | less and watch the magic unfold!

Security and Accessing Other Systems

In this section, we will cover the following commands: exit, su, sudo, ssh

Try these commands:

  1. Access denied. Try to access this file $ cat /etc/shadow it will report “Permission Denied” because you do not have access.
  2. Let’s assume another identity. We will logon as root the super-user, and try again. Type: $ su to login as the root user (password is the same as the user ischool). Upon success, your prompt should now display root instead of ischool Try this command again: $ cat /etc/shadow it should work this time.
  3. Type: exit to logout as root and return to the ischool prompt.
  4. Another way to obtain system access is to use the sudo command. This command elevates the current user to root permissions, instead of switching the user to root. It’s a safer way to share administrative access since you don’t have to know the root password. Let’s try it, type: $ sudo cat /etc/shadow The command will ask you to re-enter your ischool password before it elevates you.
  5. There are occasions where you’ll need access the hadoop-cluster command line. The easiest method to accomplish this from the hadoop-client is with ssh, which allows you to connect to the command prompt remotely. Try this: $ ssh root@hadoop-cluster The first time you connect you will have to answer yes to accept the RSA key. Then you’ll have to enter the root password (which is the same as the ischool password). Finally you should see the remote command prompt which says [root@sandbox ~]
    NOTE: We could have typed $ ssh root@sandbox to access hadoop-cluster, too sandbox is the official host name, and hadoop-cluster is an alias.
  6. Type exit to close the connection and return to hadoop-client

Test Yourself

  1. What is the linux command to display the current logged on user?
  2. Suppose you’re in the home directory, what command will create a directory under unit01-lab2 called testing
  3. Assuming you’re in the home directory, which command will change into the newly created testing directory
  4. Which linux command will copy the orders.tsv file in the /unit01-lab2/datasets folder into the unit01-lab2/testing folder
  5. Write a command to list the last 25 lines in a file called news.txt
  6. Which linux command allows you to find text within a file?
  7. Can you use the rmdir command on a directory which contains files? If not which command should you use instead?
  8. To take the output of a command and send it to a new file we use what ?

Part 3: On Your Own

In this part, let’s explore the Chipolte orders we downloaded in the previous section.


  1. Make a folder in datasets called Chipolte. What command did you use?
  2. Write a linux command to count the number of lines in the orders.tsv file.
  3. How many orders are in the file? What command helped you figure this out?
  4. Which is the more popular item “Chicken Burrito” or “Steak Burrito” (HINT: count lines)? Write down each linux command.
  5. Make a new file in the Chipolte folder called chicken-burritos.tsv which contains the output of only “Chicken Burrito” from orders.tsv
  6. Using chicken-burritos.tsv, tell me which beans, “Black Beans” or “Pinto Beans” are more popular on chicken burritos? Write down the linux commands you used.
  7. Write a linux command to backup the file chicken-burritos.tsv to chicken-burritos.bak in the same folder.
  8. Write a linux command to remove the ‘Chipolte’ folder and its contents.

Appendix: Linux Command Reference

General Format For Commands

<command> -<options> <arguments>

NOTE: For most commands you can get help by typing <command> --help or man <command>

Commands A-Z







date - Displays the current system date and time.

echo <text>









nano <filename>

> (output redirect)

>> (output append)

| (pipe)


rm -i

rmdir <dirname>