Unit 07 Lab 2: Hive and HCatalog
Part 1: Overview
About Hive and HCatalog
In this lab you will learn how to use Hive which provides SQL-Like behavior over Hadoop.
Upon completing this lab you will be able to: - Use Hive to create tables from data in HDFS - Write Hive queries and views - Leverage HCatalog to access Hive tables from Pig - Understand the benefits of the ORC file format. - Use Apache Zeppelin to visualize output
To complete this lab you will need:
- Minidoop setup or Hortonworks Sandbox. In Minidoop, the Hadoop client is the Data Science Appliance (the hadoop-client virtual machine). In Hortonworks sandbox, the linux console on sandbox is your client.
- A clone of the mafudge/datasets repository on Github: https://github.com/mafudge/datasets this should be placed in your home directory on your Hadoop client, or in the
/rootfolder on Hortonworks sandbox.
Before You Begin
Before you start this lab you should:
- Make sure HDFS, YARN, MapReduce2 and Hive are operating properly on your Hadoop Cluster. Use Ambari to check the services and troubleshoot any issues you may have.
- Create a lab folder in HDFS for your files:
$ hdfs dfs -mkdir unit07lab2
Part 2: Walk-Though
Hive and HCatalog - What Are they?
Hive brings the relational model and SQL to Hadoop. With hive, we can define a schema to our HDFS data then query it with the
SELECT as if it were a table. The SQL Query is translated to a YARN job then executed on our Hadoop cluster.
If you’re familiar with relational databases, then you know about schemas and table catalogs. Tables schemas are stored in the Hive metastore. This is where HCatalog comes in. It’s a service which exposes the Hive metastore to other Hadoop applications. This allows us to define a single schema for our Hadoop data and avoids having to create that schema separately in other tools like Pig and Spark.
Let’s walk through a simple example using the
grades data set in our
- First let’s make a folder on HDFS:
$ hdfs dfs -mkdir unit07lab2/grades
- Then upload the just the fall grades dataset to the folder:
$ hdfs dfs -put datasets/grades/fall*.tsv unit07lab2/grades
NOTE: We’ll upload spring after we create the table in hive.
- And make sure 2 files are present:
$ hdfs dfs -ls unit07lab2/grades
- If we take a look at one of the files:
$ hdfs dfs -cat unit07lab2/grades/fall2015.tsv
You’ll notice the data are tab-separated into 5 columns:
year, semester, course, credits, and grade:
NOTE: We will need to make note of this before we can create our hive table.
Our First Hive Table
- Now it’s time to start the hive client and create a table from this data:
After some log output you will see the hive prompt:
- From the hive prompt, let’s create a database for this example:
hive> create database gradesdb;
It should be noted this does not execute a MapReduce job It merely adds to the hive metastore. You should see
OKbefore you return to the
- If you type:
hive> show databases;
You will get a list of databases in the metastore. You should see the
gradesdbdatabase we just created.
- To set this database to the active database, type:
hive> use gradesdb;
We’ve now set
gradesdbto be the current database.
- This might not be apparent to you, so let’s modify the prompt to display the current working database:
hive> set hive.cli.print.current.db=true;
And now you’ll see something a bit more useful!
- It’s time to make a hive table. There’s a few ways to do this but the most common is to create an external table, where the data already exists on HDFS and we’re merely providing a schema around it. If you recall from the earlier step, our data is tab-separated into columns year, semester, course, credits and grade. Type:
hive> CREATE EXTERNAL TABLE IF NOT EXISTS grades (
year int, semester string, course string, credits int, grade string
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
If you’ve vaguely familiar with SQL this command should make some sense. The table consists of 5 columns, It is
EXTERNALmeaning the data points to a
LOCATIONin HDFS, a
TEXTFILEdelimited by tabs
- Now that we have a table named
gradesdefined for our HDFS data at
/user/ischool/unit07lab2/gradeswe can write queries like:
hive> SELECT * FROM grades;
Which then outputs the following:
- And more complex queries, which invoke YARN MapReduce jobs using the Tez execution engine, like this one to count all of the ‘A’ grades (there should be 4):
hive> SELECT count(*) FROM grades WHERE grade='A';
- If you need to get a list of tables in your database type:
hive> SHOW TABLES;
Obviously there’s just one table in this database:
- You can view the schema of the
hive> DESCRIBE grades;
Which give you a list of column names and data types:
- To exit hive, and return to the linux comand prompt, type:
As we explained earlier HCatalog is an interface into the Hive metastore. It allows other Hadoop applications to access the hive tables. Let’s demonstrate with Pig.
Start Pig, but tell it to use HCatalog:
$ pig -useHCatalog
- From the
grunt>shell let’s load the
gradestable into a relation:
grunt> grades = LOAD 'gradesdb.grades' USING org.apache.hive.hcatalog.pig.HCatLoader();
If you type the command correctly you will see
Connected to metastore.Also notice how we had to put the database in front of the table name e.g.
- Since we’ve connected to HCatalog, the
gradesrelation should have a schema:
grunt> DUMP grades;
Notice HCatalog automatically translates to Pig’s data types for us.
- Furthermore, we can view the grades similar as we did in hive:
grunt> DUMP grades;
- To demonstrate use of the schema, let’s make a simple transformation of only the grades for the IST course prefix:
grunt> istgrades = FILTER grades BY STARTSWITH(course,'IST');
DUMPthe data to verify it works:
grunt> DUMP istgrades;
- Finally exit Pig:
Adding Data to a Hive Table
- If you can write data into a folder in HDFS, you can add data to a hive table. It’s time to add our missing spring semester of grades to HDFS:
$ hdfs dfs -put datasets/grades/spring*.tsv unit07lab2/grades
- Now let’s start hive:
- And query the
gradestable from the
hive> SELECT * FROM gradesdb.grades;
You’ll now see the spring grades now in the output:
NOTE: we queried the table without the
USEcommand by placing the database name in front of the table like this
Views are named SQL
SELECT queries. They’re quite common in Relational Database Management Systems, and supported in Hive.
- First switch to
hive> set hive.cli.print.current.db=true;
hive> USE gradesdb;
- For example, here’s a query to get the good grades:
hive (gradesdb)> SELECT * FROM grades WHERE grade in ('A','A-');
All the grades returned should be an ‘A’ or ‘A-‘.
- We can encapsulate this query logic into a view:
hive (gradesdb)> CREATE VIEW IF NOT EXISTS v_goodgrades AS
SELECT * FROM gradesdb.grades WHERE grade in ('A','A-');
- This does not execute the
SELECTstatement, but instead adds a the view to the catalog:
hive gradesdb)> SHOW TABLES;
You will now see
v_goodgrades. It is a convention to put
v_in front of view names so that you can identify them as views in the schema.
- We can now treat our view as if it were a table:
hive (gradesdb)> SELECT course, grade FROM v_goodgrades;
Which produces this output:
Make Table Queries and ORC files
One way to improve hive queries is to store data in ORC (Optimized Row Columnar) Format. ORC is a compact, efficient format for ad-hoc querying. It is quite common to take source data which is commonly queried in Hadoop and output it to a new table in this format for end users to query.
- Let’s create the new table:
hive (gradesdb)> CREATE TABLE IF NOT EXISTS gradesorc STORED AS ORC AS SELECT * FROM grades;
this executes a YARN job which copies the output of the
SELECTstatement into a new table called
You’ll notice this is not an
EXTERNALtable. This table is stored in the Hive metastore and is only accessible through HCatalog. Data are copied into the path seen in the screen shot. This is also known as an
- We can still query this table:
hive (gradesdb)> SELECT * FROM gradesorc;
- Exit Hive to free up YARN memory:
hive (gradesdb)> exit;
NOTE: It should be noted that when data is added to the
grades table this will not be reflected in the
gradeorc table automatically. We will need to
DROP the ORC table and re-create it from the other table.
Apache Zeppelin is a web based notebook application, similar to the Jupyter application we used earlier. It provides a user interface for visualizing output from a variety of Hadoop and non-Hadoop applications. We will use it to query our
gradesorc Hive table and produce visualizations.
- Logon to Ambari as the ischool user.
- From the menu bar, click on the View Menu then select Zeppelin to launch the application.
- You should now see the Zeppelin home page.
- Click on Create new note
From the dialog, type Grades for the note name and click Create Note
- Back at the Zeppelin home page,
Click on the Grades note to open it. You should see the notebook page:
- From Zeppelin we can access a variety of applications, but in this instance we want to use Hive so we will specify that. Also we’ll issue a hive command to query our table, type:
SELECT * FROM gradesdb.gradesorc
It should look like this:
- To execute the code in this paragraph and send it to Hive for processing click the Play Button to the right:
- This will send the command to Hive for processing and return the output to your browser:
- Let’s create a histogram of grades. Click on the Column Chart icon. Zeppelin defaults to a chart:
- Click Settings to configure the chart. You will now see a designer where you can manipulate the data displayed in the chart.
- Remove All existing fields from the chart.
- Drag and drop gradesorc.grade to the Keys text box.
- Drag and drop gradesorc.course to the Values box.
- Click on gradesorc.course SUM and change it to gradesorc.course COUNT
- The chart designer should look like this:
- Click Settings to close and reveal your chart:
This lab only scratches the surface of what’s possible with Zeppelin. Some other interesting notes:
- The notebook saves as you type.
- When new data arrives just press the play button to re-query it.
- You can write another query and display a different chart in a new paragraph.
- Each paragraph can use a different interpreter. You can run a Hive program in one paragraph and a Spark program in another.
- You’re encouraged to look at some of the other notebook examples included in the installation. They can be found under the Notebook menu.
IMPORTANT: Zeppelin consumes quite a bit of YARN memory, and leaves sessions open. You might need to kill running YARN jobs to reclaim available memory.
- Explain the purpose of HCatalog.
- What is the Hive command to view all tables in the current database?
- Describe the difference between and external table and an internal hive table.
- What is the difference between a hive table and a hive view?
- What is the ORC file format? List one advantage and one disadvantage of this format?
Part 3 On Your Own
In this part you will create a Zeppelin notebook from the two files in
datasets/customers which are
surveys.csv and create visualizations from Hive data.
- For more inforation on Hive commands, consult: https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- For more information on Zeppelin, consult:
For each of the following steps provide the list of commands required to complete the steps.
customers.csvto the folder
- Create a Hive database called
unit07lab2create an EXTERNAL tables
surveysthat database. Provide a screenshot to prove the tables exist and have the describe the correct schema. Here’s the code:
CREATE EXTERNAL TABLE IF NOT EXISTS customers( first_name string, last_name string, email string, gender string, last_ip string, city string, state string, total_orders int, total_purchased double, months_customer int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," LOCATION "/user/ischool/unit07lab2/customers" tblproperties ("skip.header.line.count"="1"); CREATE EXTERNAL TABLE IF NOT EXISTS surveys( email string, twitter_username string, marital_status string, household_income double, own_your_home string, education string, fav_department string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," LOCATION "/user/ischool/unit07lab2/surveys" tblproperties ("skip.header.line.count"="1");
skip.header.line.countdoes what you suspect: skips the first row, which contains the field names.
- Use the
SELECTstatement to figure out which column to use to
JOINthese two tables together. Write a
SELECT statementto list all columns from and rows from both tables into a single output. HINT: You must create a table aliases to join these two tables as they share similar column names.
- create a table called
customer_surveysfrom the combined query output. Store the table in the Hive Metastore in ORC format.
- Exit Hive. Create a new Zeppelin note called
Customer SurveysProduce a pie chart to break out customers by education. Provide a screenshot.
- Produce a bar chart of favorite departments grouped by gender.
- Which three departments do men prefer the most? what About women?
- How many customers surveyed have a 4 year degree?
- In the state of ‘VA’ how many customers own a home?
- Which state has the most male customers?