# Part 1: Overview

To be a productive Hadoop user you must have basic knowledge of the Linux command line. The command line is the primary method of interacting with your computer using text-only commands. The goal of this lab is to provide you with just enough instruction to be a functional user of the Linux operating system. We’re not trying to turn you into a sysadmin, only give you knowledge of the commands essential to managing your environment effectively.

## Learning Outcomes

Upon completing this activity you will be able to:

• Demonstrate use of the command line.
• Understand the basic syntax of linux commands.
• Understand file system concepts like directories (folders) and paths.
• Navigate a file system, manage folders and files though copying, moving and deleting.
• Work with the contents of files, file data in files.

## Requirements

To complete this lab you will need:

## Before you Begin

Before you start this lab make sure to:

### The Command Prompt

The linux command prompt is the part to the left of the cursor where you start typing in the terminal window. In this case, it’s ischool@dsappliance:~$. The command prompt provides us with three important pieces of information: • The first part of the prompt, before the @ symbol is the current user. It says ischool because we are logged on as that user. • The second part of the prompt, after the @ but before the : symbols is host name, or the name of the computer you are logged into. It says dsappliance because that is the friendly name of the hadoop-client virtual machine: The Data Science Appliance. • The third part of the prompt , after the : but before the $ symbols is the current file system path. It lets you know the current working folder on the filesystem. It says ~ which represents the home directory for the user ischool.

NOTE: My convention is to place a $ in front of the command I want you to type. Do not type the $ as part of the command. It is simply short-hand for the compete linux command prompt. Any command which begins with a $ should run correctly from the command line. ## Command Line Tips • After typing the first few letters of a file or path, press the Tab key to auto-complete the name. • Use the up and own arrow keys to navigate a history of previously typed commands. • If you’ve typed a command incorrect, press CTRL+Z to escape the loop ### Understanding Paths Paths are simple instructions for how to get to a place within the linux file system. For example for a file named salary.txt in the ischool home directory on hadoop-client VM, the absolute path would be /home/ischool/salary.txt. You read paths left to right, going from general to more specific place in the file system until you reach the file or folder at the end of the parth. For example the path /home/ischool/salary.txt is read like this: 1. Start at the base of the linux file system / (also known as the root), 2. Then go into the home directory, 3. Then from there into the ischool directory, 4. And in that folder there is a file salary.txt. The example above depicts an absolute file path because the path starts at the root of the linux file system. Likewise, if you were to give someone “absolute” directions to your house, you would start by telling them to be on earth, then go to your continent, then go to your country, then go to your region, etc. A relative file path specifies the path to a file, taking into account your current working directory within the file system. For example, if you were to give someone “relative” directions to your house, you would give them directions from their current location (the relative path from where they are to where you are). In our salary.txt example if you were in the /home/ folder, then the relative path would be ischool/salary.txt ### Path Abbreviations There are three path abbreviations every command line aficionado should know. These make it easier for you to get around the file system. 1. ~ Path to the current user’s home directory. For example ~/databases/ is a path to the databases folder in the current user’s home directory. 2. .. Path to the parent folder, or “go up” one folder. For example let’s assume you have two directories in your home directory ~/databases/ and ~/games if you are currently in the games folder and need to reference the database folder the path would be ../databases (go up our of games and then down into databases). 3. . Path to the current folder. This is the same as the output from the pwd command. # Part 2: Walk-Though NOTE: While you can follow this lab from any linux command prompt the output seen here is dependent upon running these commands on the hadoop-client or Data Science Appliance virtual machine. ## Getting Your Bearings In this section we will cover the following commands: clear, date, hostname, pwd, whoami Let’s try these commands: 1. In The command prompt, type $ date to display the current system date and time. You should see something like this Thu May 19 16:54 EDT 2016. Actual time will vary of course.
2. $whoami displays the username of the current logged on user. This information is conveniently displayed on the command prompt, but it’s useful when you encounter a system with out it on the prompt. This should be ischool. 3. $ hostname displays the host name of the computer. Likewise, the output of this command should match what you see in the linux prompt. This should be dsappliance.
4. $pwd displays the current working directory, or the place on the file system where the command prompt resides currently. This should be the absolute path /home/ischool. 5. $ clear clears the output window and returns the cursor to the top of the screen. Merely cosmetic!

## Directory Management

In this section we will cover the following commands: cd, ls, mkdir, pwd, rmdir

Let’s try these commands:

1. First, let’s type: $mkdir unit01-lab2 create a directory for this lab. 2. Use the $ ls command to list the contents of the working directory. Therein you should see your new unit01-lab2 directory.

3. Let’s change the working directory to the new folder. Type: $cd unit01-lab2 4. Check yourself type: $ pwd to output the current directory. It should be the absolute path: /home/ischool/unit01-lab2
5. Next use the mkdir command to make two folders games and databases. Type the: $ls command to verify the two directories are there. 6. Oops. I meant datasets not databases. Let’s remove the folder type $ rmdir databases, then use: $ls to verify there is one folder, then create the new datasets folder. 7. Check yourself again with $ls you should see two folders: datasets and games
8. Let’s move into the games folder using the cd command. Check yourself with :$pwd it should report this path: /home/ischool/unit01-lab2/games 9. Going up. Let’s go back up to the unit01-lab2 folder, using the parent folder path abbreviation. type: $ cd ..
10. Going home. Let’s go back to our home folder using the path abbreviation. Type: $cd ~ 11. Use pwd to verify you’re in the /home/ischool folder before continuing. ## File Management In this section, we will cover the following commands: cat, cp, nano, mv, rm, touch Try these commands: 1. First move into the unit01-lab2/games folder, type: $ cd unit01-lab2/games. Verify you’re in the correct working directory with pwd it should be : /home/ischool/unit01-lab2/games
2. Let’s create an empty file: $touch highscores.csv. Verify the file is there with ls -l notice it’s an empty file with 0 bytes. 3. Let’s edit the file: $ nano highscores.csv this will open the file in the nano editor. add the following lines 4 to the file, exactly as shown:

When you’re finished editing the file, press Ctrl+x to exit, then type Y to save and press Enter
4. Let’s make sure you did the previous step correctly by seeing what’s in the file, type: $cat highscores.csv to display the contents of the file you should see the same 4 lines as in the screenshot. 5. Time to backup this important file by making a copy, type: $ cp highscores.csv backup-highscores.csv. Use ls -l to verify there are now two files in the working directory.
6. That’s a horrible name. Let’s rename the file, type: $mv backup-highscores.csv highscores.bak again list the contents of the folder to verify. You should have two files at this point highscores.csv and highscores.bak 7. Let’s copy the high scores to our datasets folder. If you recall the datasets folder is up one level from the current folder, so type: $ cp highscores.csv ../datasets/ You can verify the file is there without changing the working directory by typing: ls ../datasets/ it should be there!
8. Next use the cd command to move into the datasets folder. Hint: the command should look similar to the ls command you just typed. Use pwd to verify you are in the /home/ischool/unit01-lab2/datasets folder.
9. Finally, since we no longer need the games folder or its contents, let’s delete it with rm -ir ../games. This will prompt to delete the two files and the games folder.

## File Manipulations

In this section, we will cover the following commands: cut, echo, > (output redirect), >> (output append), | (pipe), sort, wc

Try these commands:

1. Assuming you left off in the unit01-lab2/datasets folder, let’s count the lines in the highscores.csv type: $wc -l highscores.csv. There should be 4 lines. 2. Use cat to display the contents of highscores.csv you should see the scores in this order: 3. Lets sort the games alphabetically, type: $ sort highscores.csv you should see output like this:
4. Hey that’s useful! But if there was only a way to save the screen output to a file. Lucky for you there is! Type: $sort highscores.csv > highscores-sorted.csv The > operator redirects the output of the sort command to the file highscores-sorted.csv. Verify it worked with the cat command. 5. I forgot to add a high score, but don’t fret there’s a way to add it from the command line. First, type $ echo "defender,12000" it should output defender,12000 to the console. Now what we need to do is append that to the end of highscores.csv we use the >> (output append) operator to do this. Type this and make sure it’s two >> or you will overwrite the file: $echo "defender,12000" >> highscores.csv. 6. Now when you $ cat highscores.csv the file has the new game and score appended to the end, like this:
7. Just the names, please. Next we will use the cut command to retrieve only the names of the games from the highscores.csv file. Since the character that separates the name of the game from its high score is a , that will be our delimiter. Type: $cut -f1 -d, highscores.csv and you will see only the names of the games as output. 8. Similarly, if you just want the scores, you can choose the 2nd column instead of the first one, like this: $ cut -f2 -d, highscores.csv
9. We can combine commands to achieve a desired result. For example we can pull out the names of the games then sort them. To combine commands we use the pipe | operator. Type: $cut -f1 -d, highscores.csv | sort which lists just the names of the games sorted alphabetically. The pipe takes the output of the cut command (just the names of the games) and uses it as input into the sort command, eliminating the need for us to use redirection to a temporary file. 10. We can even redirect the final output of a combined set of commands to file, for instance, type: $ cut -f1 -d, highscores.csv | sort > games.txt

## More File Manipulations

In this section, we will cover the following commands: curl, grep, head, less, tail

Try these commands.

1. There’s a data set of orders from Chipolte, a popular Mexican eatery here: https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv. A common thing we Data Scientists need to do is download data sets from the Internet for analysis. The curl command can do this for us, type: $curl -L https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv Watch the orders whizz by in the console! 2. What if we’d like to save this output to a file? We can use a output redirect for that, of course. Type: $ curl -L https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv > orders.tsv The output will no longer display to the console but instead be written to the file orders.tsv
3. But when we cat the file it scrolls by too fast for us to read. You can paginate the file by typing: $less orders.tsv Now you can use the arrow keys to scroll up and down through the file, or the space bar to skip a through a screen’s worth at a time. When you are finished viewing the file, press q to quit. 4. Sometimes we want to see the first few lines of the file. To see the first 15 lines, type: $ head -n15 orders.tsv
5. Likewise to see the last 10 lines of the file, we type: $tail -n10 orders.tsv 6. Maybe you just want to see the orders for “Steak Burrito” to do that we type: $ grep "Steak Burrito" orders.tsv
7. Still too many to view on the screen, have less help you out: $grep "Steak Burrito" orders.tsv | less Scroll up and down with the arrow keys, and press q to quit. 8. Finally, let’s get serious… how about a distinct list of everything ordered? We will cut out the item name, then sort it, then filter out the duplicates with uniq then page through the output with less, type: $ cut -f3 orders.tsv | sort | uniq | less and watch the magic unfold!

## Security and Accessing Other Systems

In this section, we will cover the following commands: exit, su, sudo, ssh

Try these commands:

1. Access denied. Try to access this file $cat /etc/shadow it will report “Permission Denied” because you do not have access. 2. Let’s assume another identity. We will logon as root the super-user, and try again. Type: $ su to login as the root user (password is the same as the user ischool). Upon success, your prompt should now display root instead of ischool Try this command again: $cat /etc/shadow it should work this time. 3. Type: exit to logout as root and return to the ischool prompt. 4. Another way to obtain system access is to use the sudo command. This command elevates the current user to root permissions, instead of switching the user to root. It’s a safer way to share administrative access since you don’t have to know the root password. Let’s try it, type: $ sudo cat /etc/shadow The command will ask you to re-enter your ischool password before it elevates you.
5. There are occasions where you’ll need access the hadoop-cluster command line. The easiest method to accomplish this from the hadoop-client is with ssh, which allows you to connect to the command prompt remotely. Try this: $ssh root@hadoop-cluster The first time you connect you will have to answer yes to accept the RSA key. Then you’ll have to enter the root password (which is the same as the ischool password). Finally you should see the remote command prompt which says [root@sandbox ~] NOTE: We could have typed $ ssh root@sandbox to access hadoop-cluster, too sandbox is the official host name, and hadoop-cluster is an alias.
6. Type exit to close the connection and return to hadoop-client

## Test Yourself

1. What is the linux command to display the current logged on user?
2. Suppose you’re in the home directory, what command will create a directory under unit01-lab2 called testing
3. Assuming you’re in the home directory, which command will change into the newly created testing directory
4. Which linux command will copy the orders.tsv file in the /unit01-lab2/datasets folder into the unit01-lab2/testing folder
5. Write a command to list the last 25 lines in a file called news.txt
6. Which linux command allows you to find text within a file?
7. Can you use the rmdir command on a directory which contains files? If not which command should you use instead?
8. To take the output of a command and send it to a new file we use what ?

# Part 3: On Your Own

In this part, let’s explore the Chipolte orders we downloaded in the previous section.

## Exercises

1. Make a folder in datasets called Chipolte. What command did you use?
2. Write a linux command to count the number of lines in the orders.tsv file.
3. How many orders are in the file? What command helped you figure this out?
4. Which is the more popular item “Chicken Burrito” or “Steak Burrito” (HINT: count lines)? Write down each linux command.
5. Make a new file in the Chipolte folder called chicken-burritos.tsv which contains the output of only “Chicken Burrito” from orders.tsv
6. Using chicken-burritos.tsv, tell me which beans, “Black Beans” or “Pinto Beans” are more popular on chicken burritos? Write down the linux commands you used.
7. Write a linux command to backup the file chicken-burritos.tsv to chicken-burritos.bak in the same folder.
8. Write a linux command to remove the ‘Chipolte’ folder and its contents.

# Appendix: Linux Command Reference

## General Format For Commands

<command> -<options> <arguments>

• <command> is the action we want the computer to take (delete a file, create a directory, etc…)
• <options> (or “flags”) modify the behavior of the command (prompt before deleting?)
• <arguments> are the things we want the command to act on (what directory should be create)

NOTE: For most commands you can get help by typing <command> --help or man <command>

## Commands A-Z

cat

• cat <filename> prints (concatenates) the entire file

cd

• cd <path> changes directory to the path you specify, which can be a relative path or an absolute path
• cd .. moves you “up” one directory (to the parent directory)
• cd ~ moves you to your “home” directory

clear

• clear all output from your console

cp

• cp <filename> <new path> copies a file from its current location to <new path>, leaving the original file unchanged
• cp <filename> <new filename> copies a file without changing its location

curl

• curl -L <url> downloads the file at <url>, outputting to the console.
• To output to a file use redirection: curl -L <url> > <filename>

cut

• cut -f1,2 <filename> cuts a tab-delimited file into columns and returns the first two fields
• cut -f1,2 -d, <filename> indicates that the file is delimited by commas

date - Displays the current system date and time.

echo <text>

• Prints <text> to the console. Used frequently with other commands as part of a pipe or redirection.

find

• find <path> -name <name> will recursively search the specified path (and its subdirectories) and find files and directories with a given <name>
• Use . for the <path> to refer to the working directory.
• For the <name>, you can search for an exact match, or use wildcard characters to search for a partial match:
• * specifies any number of any characters, such as find . -name *.py or find . -name *data*.*
• ? specifies one character, such as find . -name ??_*.*

grep

• grep <pattern> <filename> searches a file for a regular expression pattern and outputs the matching lines
• The pattern should be in quotation marks to allow for multiple words.
• The pattern is case-sensitive by default, but you can use the -i option to ignore case.
• You can use wildcards in the filename to search multiple files, but it only searches the working directory (not subdirectories).
• grep -r <pattern> <path> does a recursive search (all folders under the current path) for matches within files
• Use . for the <path> to refer to the working directory.
• grep <pattern> does a global search (of the entire filesystem) for matches
• Hit Ctrl + c if you want to cancel the search.

head

• head <filename> prints the head (the first 10 lines) of the file
• head -n20 <filename> prints the first 20 lines of the file
• This is useful for previewing the contents of a large file without opening it.

hostname

• hostname displays the name of the host. This is typically displayed in the command prompt along with the current user, but it the event it’s not the command comes in handy.

less

• less <filename> allows you to page through the file
• Hit the spacebar to go down a page, use the arrow keys to scroll up and down, and hit q to exit.

ls

• lists files and folders in your working directory
• ls -a lists all files, including hidden files
• ls -l lists the files in a long format with extra information (permissions, size, last modified date, etc.)
• ls <path> lists files in a specific directory (without changing your working directory)

mkdir

• mkdir <dirname> makes a new directory called <dirname>

mv

• mv <filename> <new path> moves a file from its current location to <new path>
• mv <filename> <new filename> renames a file without changing its location

nano <filename>

• Allows you to edit the file <filename>
• If the file does not exist, it will be created.
• Use the arrow keys to navigate the editor.
• To exit the nano editor press Ctrl+x then answer Yes or No to save the file.

> (output redirect)

• <command> > <filename> takes the output of <command> and saves it in <filename> instead of showing it to the console
• This will overwrite the file if it already exists.

>> (output append)

• <command> >> <filename> takes the output of <command> and adds it to the end of <filename>
• This will create the file if it does not yet exist.

| (pipe)

• <command 1> | <command 2> pipes the output of <command 1> as input into <command 2>, and then the results of <command 2> are printed to the console

pwd

• prints working directory (displays the path of the directory (folder) you are in currently)

rm -i

• rm <filename> removes (deletes) a file permanently
• rm -i <filename> removes files in interactive mode, in which you are prompted to confirm that you really want to delete the file. It’s best to always use rm -i.
• rm -ir <dirname> removes a directory and recursively deletes all of its contents

rmdir <dirname>

• removes a directory (folder) <dirname>but it must be empty first.
• to remove the files with the directory, use rm -ir <dirname>

sort

• sort <filename> sorts a file by the first field

ssh

• The ssh utility allows you to remote logon to another computer. You’ll use this command to access the command prompt of the Hadoop-Cluster from the Hadoop-Client virtual machine.
• ssh <user>@<host> will remote shell into <host> as user <user>.
• For example to secure shell into Hadoop-Cluster from the Hadoop-Client type: ssh root@hadoop

su

• su <user> let’s you assume the identity of another user.
• This is commonly used to assume the super user’s identity, root.
• add - or -l to get a logon environment for that user. ex. su -l root
• NOTE Unless you’re root you will need to enter the password of the user to assume their identity.

sudo

• Placing sudo in front of a command allows you to execute that command as the root user.
• Unlike su with sudo you enter your password to elevate as admin, not the root password.
• Example: Type cat /etc/shadow you get “permission denied”. Type sudo cat /etc/shadow and enter your password, then you should see the contents of the shadow file!

tail

• tail <filename> prints the tail (the last 10 lines) of the file
• tail -n20 <filename> prints the last 20 lines of the file, like head

tar

• Extracts files contained in the Tar GZip archive format.
• tar -xzf <file.tar.gz> extracts the contents of a tar.gz archive.

touch

• touch <filename> creates an empty file called <filename>
• This is useful for creating empty files to be edited at a later time.

unzip

• Extracts files contained in a ZIP format archive
• unzip <file.zip> extracts the contents of a .zip archive.

wc

• wc <filename> returns the count of lines, words, and characters in a file
• wc -l <filename> counts lines
• wc -w <filename> counts words (anything delimited by a space), and
• wc -c <filename> only counts characters

whoami

• whoami displays the current user. This is typically displayed in the command prompt along with the hostname, but in the event is it not the command comes in handy.

wget

• wget <url> downloads the contents of the url to a file.
• It’s a simpler command than curl, but lacks the flexibility.