Unit 06 Lab 1: HDFS
Part 1: Overview
In this lab you will learn about the Hadoop Distributed File System, or HDFS. You’ll discover how it work, pest practices, etc.
Upon completing this lab you will be able to:
- Verify HDFS is running and check the health of the HDFS file system from the command line and from Ambari
- Get data in and out of HDFS using the HDFS commands.
- Understand the conventions and default behaviors of working with HDFS data.
- Understand strategies for streaming data into HDFS.
To complete this lab you will need:
- Minidoop setup or Hortonworks Sandbox. In Minidoop, the Hadoop client is the Data Science Appliance (the hadoop-client virtual machine). In Hortonworks sandbox, your computer is the Hadoop client.
- A clone of the mafudge/datasets repository on Github: https://github.com/mafudge/datasets this should be placed in your home directory on your Hadoop client.
Before you Begin
Please complete the following prior to starting this lab:
- Login to your Hadoop client. Open a linux command prompt.
- Check to make sure HDFS is up and running, and passes a service check. Consult the Ambari lab for details.
- Make sure your datasets repository is up-to-date by issuing a pull.
So many Commands, so little time!
NOTE This Walk-Though attempts to familiarize you with the most frequently used HDFS commands. There’s far to many to cover otherwise. Also we won’t focus on command syntax or structure, but instead cover what the command does, how it is used and why. A more comprehensive list of file system commands and options can be found on the Hadoop project website, which I encourage you to reference while completing the lab.
Part 2: Walk-Though
Getting Your Bearings
Let’s start as we usually do by getting our bearings. From the shell of your Hadoop client, try these:
$ hdfs dfsand observe the output. Notice this command is the same as
$ hadoop fsThe general syntax of this command is
$ hdfs dfs <command>where
<command>is one of the many commands listed in the output.
- Let’s see what files are on HDFS:
$ hdfs dfs -ls. It looks similar to the Linux
ls -lcommand except its displaying the files on the Hadoop file system not the local file system. You might be wondering where we are in the Hadoop file system? The root? Nope. The answer is the
ischoolhome directory. Why? The current linux user is
- The absolute path to the
ischoolhome directory on HDFS
/user/ischoolnotice this is different from the linux file system default of
/home/ischoolThus this command is the same as the previous. Type:
$ hdfs dfs -ls /user/ischoolThe output should be the same as
$ hdfs dfs -ls
- Let’s see what’s in the root of the HDFS file system. Type
$ hdfs dfs -ls /you’ll see output similar to this:
This is the folder structure from the Hortonworks installation. Other Hadoop distributions will vary, of course.
NOTE: For the Hadoop file system, there’s no
cd commands. That is because there is no shell or command prompt for HDFS. There is only executing Hadoop commands through the Linux command prompt (or any other Hadoop client for that matter).
Working With Files in HDFS
For next two sections of this walk-through we will use the
grades data set in the
datasets folder on your Hadoop client.
- Let’s begin by creating a directory in HDFS for this lab. All of our work will go in this directory, type:
% hdfs dfs -mkdir unit06lab1
- We use the
-putcommand to copy files into HDFS. Let’s load the
fall2015grades into our
unit06lab1folder on HDFS:
$ hdfs dfs -put datasets/grades/fall2015.tsv unit06lab1/fall2015.tsv
- Let’s check to make sure we did that correctly, by listing the files in
$ hdfs dfs -ls unit06lab1You should see this:
- What happens if we repeat the same command? Will it overwrite the file on HDFS? Let’s try it out:
$ hdfs dfs -put datasets/grades/fall2015.tsv unit06lab1/fall2015.tsvYou’ll notice it does not work and we get an error: put: ‘fall2015.tsv’: File exists. HDFS will not overwrite files by design.
- Suppose we want to view what’s in the file on HDFS? For that we use the
-catcommand for this. Type:
$ hdfs dfs -cat unit06lab1/fall2015.tsvto view the contents of the file on HDFS. There should be 5 lines in the file, output should look like this:
- Let’s make a copy of the file on HDFS, type:
$ hdfs dfs -cp unit06lab1/fall2015.tsv unit06lab1/grades.tsvthen do an
unit06lab1to verify there are now two files like this:
- Now let’s delete the
fall2015.tsvon HDFS, type:
$ hdfs dfs -rm unit06lab1/fall2015.tsvAn
-lswill now reveal a single file
- How do we download the file from HDFS back to the client? That’s the
$ hdfs dfs -get unit06lab1/grades.tsv grades.tsvYou will now notice the file
grades.tsvis local. Verify with
$ cat grades.tsv
- Before me move on to the next step, let’s remove the
$ hdfs dfs -rm unit06lab1/grades.tsv
- Not really deleting, is it? This is the second time we’ve deleted something, and you’ve probably noticed that when we delete a file from HDFS it is realling moving the file to the Trash folder. You can always get a listing of files in the
.Trashfolder to see what’s there, type:
$ hdfs dfs -ls .Trash/Current/user/ischoolYou’ll see the two files we deleted are here.
NOTE: HDFS Trash tips: Files remain in the Trash folder for 24 hours. We can use the
-mv commands to copy or move the files out of the trash. We can also add the
-skipTrash option on the
-rm command to delete the file immediately.
HDFS Best Practices
A common practice for HDFS storage is to place files of a common data set into the same folder. Thus the
fall2016.tsv files should all be in the same
grades folder as they represent the same thing. We will use this practice frequently in future labs.
- Let’s start by making a folder for the data set:
$ hdfs dfs -mkdir unit06lab1/grades
- Let’s upload all the files into the grades folder:
$ hdfs dfs -put datasets/grades/* unit06lab1/gradesthe asterisk
*represents all the files in the source folder. There should be 3 files in the folder.
- Verify the files are there with an
$ hdfs dfs -ls unit06lab1/gradesyou should see the three files:
- We know we can
-catone file from HDFS, but now that all the grades are in a folder we can few the contents of the folder as a single file:
$ hdfs dfs -cat unit06lab1/grades/*For output, you should see all grades for Fall 2015, Spring 2016 and Fall 2016.
- There’s also an HDFS command to
-getthe files and return them as a single file, in this case
$ hdfs dfs -getmerge unit06lab1/grades/* allgrades.tsv
Since a MapReduce job creates one file per reducer, this command is useful for exporting that output from HDFS.
- You can verify the merged file, which is now on the local file system is the contents of all the files on HDFS:
$ cat allgrades.tsvshould be the same as the
HDFS Block storage
We learned through class lecture that HDFS splits files into blocks and those blocks are then distributed to data nodes managed by HDFS. Let’s explore how this works:
- From your home directory, verify the data file
sr20160401.csvis present in your
$ ls -l datasets/nyc311notice the size of the file is 3.9 MB.
- As is customary to do, let’s make an HDFS directory for this data set:
$ hdfs dfs -mkdir nyc311
- Let’s try to split this file into 500 byte blocks as we upload it to HDFS, type:
$ hdfs dfs -D dfs.blocksize=500 -put datasets/nyc311/sr20160401.csv nyc311/
You’ll get an error: Specificed block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): 500 < 1048576 This error tells the block size of 500 is lower than the configured size for HDFS which is currently set to 1048576.
- Okay. Let’s try a larger block size, like 1200500, try this:
$ hdfs dfs -D dfs.blocksize=1200500 -put datasets/nyc311/sr20160401.csv nyc311/And we get another error: Invalid values: dfs.bytes-per-checksum (=512) must divide block size (=1200500) HDFS is telling us that the blocksize we specify must be at least 1048576 and a multiple of 512.
- Let’s do it one last time with a block size of 1048576, type:
$ hdfs dfs -D dfs.blocksize=1048576 -put datasets/nyc311/sr20160401.csv nyc311/
- Hooray! The command worked. Verify the file is in the
nyc311folder on HDFS:
$ hdfs dfs -ls nyc311/
- Let’s take a look at the blocks that make up this file, type:
$ hdfs fsck nyc311/sr20160401.csvYou should see 4 blocks. All the blocks are under-replicated. This makes sense since we only have a single node in our Hadoop cluster and a minimum of 3 are requried for replication. The blocks are implemented as files on the actual data node.
HDFS Health check
It’s useful at times to get a heath check report of the HDFS files system. To accomplish this, type:
$ hdfs dfsadmin -report The same information can be acquired through the Namenode UI in Ambari at: http://sandbox:50070/
- Why are there no
- Explain what happens to a file in HDFS when you delete it?
- What is the convention for storing files in HDFS, specifically which files belong in which folders?
- What is the minimum block size in HDFS? Block sizes must be a multiple of which number?
- What is the default replication factor in HDFS? Why are all blocks on your Hadoop cluster under-replicated?
Part 3: On Your own
- Upload all the
redditnewsfolder on HDFS.
- Upload the
clickstreamdataset to HDFS. Create a
clickstreamfolder and then two folders
logsinside that folder. Upload all the
.logfiles to the
logsfolder, and the
ip_lookup.csvfile to the
datasets/customersfolder has two files in it. Examine the contents of each file and then devise a strategy for how these files should be uploaded to HDFS. Remember HDFS best practices as you decide how to process the files.