Unit 01 Lab 1: Minidoop
Part 1: Overview
This purpose of this lab is to give you an overview of the Hadoop environment we will use in this course, called Minidoop. Minidoop is the ultimate Big Data playground. It is an isolated single-user environment, making it a suitable environment for learning Hadoop through exploration and experimentation. Minidoop a full featured Hadoop install on a single virtual machine; reducing the complexity inherent in multi-node setups, and eliminating resource contention among other users. You have full administrative rights into Minidoop so you can truly make it your own.
Here’s a quick video which explains the Minidoop environment in greater detail:
Minidoop is a network consisting of two virtual machines isolated on their own NAT:
- The Hadoop virtual machine is based on Hortonworks Sandbox. Sandbox is how Hortonworks, a Hadoop vendor gets you to try Hadoop. Their VM has been tweaked to work in our environment and talk to the other VM. It’s a single node cluster so it will not be able to handle large data sets, but it’s the perfect environment for learning and experimenting without the added complexity of managing multiple nodes.
- The other virtual machine is a version of the iSchool’s Data Science Appliance configured to connect to the single node Hadoop cluster. It has the Hadoop client tools, Revolution Analytics R Open, and Continuum Analytics Python Anaconda installed. Adequate tooling for doing data science.
Upon completing this activity you will able to:
- Understand the capabilities and architecture of Minidoop.
- Login and Logout of the virtual machines.
- Troubleshoot network connectivity in the Minidoop environment.
To complete this activity you will need:
- The Minidoop environment. Minidoop comes in two flavors. One version runs remotely the virtual machines run in the iSchool’s vSphere environment. The other version runs locally where the virtual machines are on your computer. Your Instructor should make it clear as to which version of Minidoop you will use in the course, and provide specific instructions for how to connect to and operate them.
Before You Begin
Before you start this activity make sure to:
- Power on the virtual machines in your Minidoop environment.
Part 2: Walk-Though
Let’s start this activity by walking you through some basic use cases for Minidoop.
How to Login
Logging in to the Hadoop Client
You will spend the majority of your time using the Minidoop Hadoop client virtual machine, hadoop-client. Switch to the console of this VM, where you should see the Ubuntu linux logon prompt:
ischool with password
SU2orange!. This should get you to the Unbuntu linux desktop.
NOTE: The Minidoop client is configured to login automatically with this account at startup, but you will need to re-enter the password after a period of inactivity.
SU2orange! is also the
root password, on both the hadoop-client and hadoop-cluster VM’s. The
root account is the linux account with the highest level of access to the system and should only be used when you need to maniuplate system settings.
Let’s open up a terminal window from the hadoop-client by clicking on the terminal icon in the toolbar.
You should see the following terminal window on your desktop:
Tangent: About the Linux Command Line Prompt
The linux command line prompt also known as just simply the command prompt or console is the part to the left of the cursor where you start typing in the terminal window. In this case, it’s
ischool@dsappliance:~$. The command prompt provides you with three important pieces of information:
- The first part of the prompt, before the
@symbol is the current user. It says
ischoolbecause we are logged on as that user.
- The second part of the prompt, after the
@but before the
:symbols is host name, or the name of the computer you are logged into. It says
dsappliancebecause that is the friendly name of the hadoop-client virtual machine: The Data Science Appliance.
- The third part of the prompt , after the
:but before the
$symbols is the current file system path. It lets you know the current working folder on the filesystem. It says
~which represents the home directory for the user
Logging in to the Hadoop cluster
There few reasons to logon directly to the Hadoop Cluster (hadoop-cluster) Virtual machine. On the rare occasion where you need to logon, here’s the procedure for logging on through the virtual machine console. (A second procedure for logging on remotely from the hadoop-client will be discussed in a future activity.)
Switch to the console of the hadoop-cluster VM, where you will see the Hortonworks Sandbox screen:
Press ALT+F5 (or on a Mac CTRL+ALT+F5) to open the logon prompt.
From the Sandbox logon prompt logon as user
root with password
NOTE: Ignore the on-screen instructions which tell you to logon as root / hadoop!
After you logon successfully, you will see the following linux command prompt:
[root@sandbox ~]#. This prompt is structured a little differently from the previous one on our Hadoop client, but the same principles apply. The current user is
root the hostname is
sandbox and the current working folder is
How to Logout
Logging out of the console or a terminal window is easy. Simply type
exit from the Linux command prompt. Practice logging in and out of both the hadoop-client and hadoop-cluster until you feel comfortable with the process.
Troubleshooting the Minidoop network
Most of the time the Minidoop setup runs flawlessly. On the rare occasion that something isn’t right, it’s good to know how to troubleshoot basic network connectivity with your setup.
After you power on the virtual machines Minidoop should be ready to use. In rare circumstances the hadoop-cluster might not get the correct TCP/IP address due to the timing of when the virtual machines start.
To verify the Minidoop network is working properly:
- Open a linux command prompt on either the hadoop-client or the hadoop-cluster.
- Type this command:
$ ping -c 4 hadoop-clusterto ping the cluster 4 times you should get 4 replies from
sandbox(the host name) on TCP/IP address
$ ping -c 4 hadoop-clientto ping the hadoop-client Virtual Machine 4 times. You should get 4 replies from
dsappliance(the host name) on TCP/IP address
NOTE: You don’t type the
$. It represents the command prompt itself.
You know your Minidoop setup is working properly because the ping statistics will report 4 packets sent, 4 received and 0% packet loss. Anything else indicated a problem.
Correcting Network Issues
To fix the network when it’s not working.
- Logon to hadoop-cluster as
- From the command prompt type:
$ ifconfigto view the network setup. The screenshot below has a correct setup for
inet addr:says anything other than
192.168.10.11then you must do the following
- ONLY DO THIS STEP IF YOUR IP ADDRESS ON YOUR Hadoop Cluster IS NOT
persistent-net.rulesfile which stores the network configuration for this virtual machine. Type:
$ rm /etc/udev/rules.d/70-persistent-net.rules
This will delete the file. After which you will need to reboot, type:
$ rebootto reboot the Hadoop Cluster
- After Hadoop Cluster returns to a logon page, you can try the
pingcommands in the troubleshooting section once more.
Opening Multiple Command Prompts
At some point, you might need to open multiple command prompts from your hadoop-client. To open another terminal window from an existing command prompt, press CTRL+Shift+n.
- How many Virtual Machines are part of the Minidoop setup? (Don’t include the router).
- Which linux account provides has the highest level of access to the system?
- For the following command prompt, identify the name of the login user, computer name, and current working directory:
- What command do you type to logout?
- What is the IP Address of the Hadoop-Client virtual machine?
- What is the IP Address of the Hadoop-Cluster virtual machine?
Part 3: On Your own
Now that you have a basic understanding of Minidoop it is time to put your new found skills to practice.
Restart all virtual machines in your Minidoop setup.
Attempt the following steps, answering the questions where appropriate:
- Open a terminal window on hadoop-client. What the host name and current user?
- Logon to the hadoop-cluster using the account with highest level of access. What is the host name and current user?
- Test your network connectivity. Report the packet loss for both hadoop-cluster and hadoop-clent Fix your network if required. How do you know it is working?
- Logout of both command prompts.