Unit 07 Lab 2: Hive and HCatalog

Part 1: Overview

About Hive and HCatalog

In this lab you will learn how to use Hive which provides SQL-Like behavior over Hadoop.

Learning Outcomes

Upon completing this lab you will be able to: - Use Hive to create tables from data in HDFS - Write Hive queries and views - Leverage HCatalog to access Hive tables from Pig - Understand the benefits of the ORC file format. - Use Apache Zeppelin to visualize output

Requirements

To complete this lab you will need:

Before You Begin

Before you start this lab you should: - Make sure HDFS, YARN, MapReduce2 and Hive are operating properly on your Hadoop Cluster. Use Ambari to check the services and troubleshoot any issues you may have. - Create a lab folder in HDFS for your files:
$ hdfs dfs -mkdir unit07lab2

Part 2: Walk-Though

Hive and HCatalog - What Are they?

Hive brings the relational model and SQL to Hadoop. With hive, we can define a schema to our HDFS data then query it with the SELECT as if it were a table. The SQL Query is translated to a YARN job then executed on our Hadoop cluster.

If you’re familiar with relational databases, then you know about schemas and table catalogs. Tables schemas are stored in the Hive metastore. This is where HCatalog comes in. It’s a service which exposes the Hive metastore to other Hadoop applications. This allows us to define a single schema for our Hadoop data and avoids having to create that schema separately in other tools like Pig and Spark.

Let’s walk through a simple example using the grades data set in our datasets folder.

Setup

  1. First let’s make a folder on HDFS:
    $ hdfs dfs -mkdir unit07lab2/grades
  2. Then upload the just the fall grades dataset to the folder:
    $ hdfs dfs -put datasets/grades/fall*.tsv unit07lab2/grades
    NOTE: We’ll upload spring after we create the table in hive.
  3. And make sure 2 files are present:
    $ hdfs dfs -ls unit07lab2/grades
  4. If we take a look at one of the files:
    $ hdfs dfs -cat unit07lab2/grades/fall2015.tsv
    You’ll notice the data are tab-separated into 5 columns:
    year, semester, course, credits, and grade:
    fall2015 grades output
    NOTE: We will need to make note of this before we can create our hive table.

Our First Hive Table

  1. Now it’s time to start the hive client and create a table from this data:
    $ hive
    After some log output you will see the hive prompt:
    hive>
  2. From the hive prompt, let’s create a database for this example:
    hive> create database gradesdb;
    It should be noted this does not execute a MapReduce job It merely adds to the hive metastore. You should see OK before you return to the hive> prompt.
  3. If you type:
    hive> show databases;
    You will get a list of databases in the metastore. You should see the gradesdb database we just created.
  4. To set this database to the active database, type:
    hive> use gradesdb;
    We’ve now set gradesdb to be the current database.
  5. This might not be apparent to you, so let’s modify the prompt to display the current working database:
    hive> set hive.cli.print.current.db=true;
    And now you’ll see something a bit more useful!
    hive (gradesdb)>
  6. It’s time to make a hive table. There’s a few ways to do this but the most common is to create an external table, where the data already exists on HDFS and we’re merely providing a schema around it. If you recall from the earlier step, our data is tab-separated into columns year, semester, course, credits and grade. Type:
    hive> CREATE EXTERNAL TABLE IF NOT EXISTS grades (
    year int, semester string, course string, credits int, grade string
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE
    LOCATION '/user/ischool/unit07lab2/grades';
    If you’ve vaguely familiar with SQL this command should make some sense. The table consists of 5 columns, It is EXTERNAL meaning the data points to a LOCATION in HDFS, a TEXTFILE delimited by tabs '\t'.
  7. Now that we have a table named grades defined for our HDFS data at /user/ischool/unit07lab2/grades we can write queries like:
    hive> SELECT * FROM grades;
    Which then outputs the following:
    select all from grades
  8. And more complex queries, which invoke YARN MapReduce jobs using the Tez execution engine, like this one to count all of the ‘A’ grades (there should be 4):
    hive> SELECT count(*) FROM grades WHERE grade='A';
    select runs yarn job
  9. If you need to get a list of tables in your database type:
    hive> SHOW TABLES;
    Obviously there’s just one table in this database: grades
  10. You can view the schema of the grades table with:
    hive> DESCRIBE grades;
    Which give you a list of column names and data types:
    describe grades
  11. To exit hive, and return to the linux comand prompt, type:
    hive> quit;

Understanding HCatalog

  1. As we explained earlier HCatalog is an interface into the Hive metastore. It allows other Hadoop applications to access the hive tables. Let’s demonstrate with Pig.

  2. Start Pig, but tell it to use HCatalog:
    $ pig -useHCatalog

  3. From the grunt> shell let’s load the grades table into a relation:
    grunt> grades = LOAD 'gradesdb.grades' USING org.apache.hive.hcatalog.pig.HCatLoader();
    If you type the command correctly you will see Connected to metastore. Also notice how we had to put the database in front of the table name e.g. gradesdb.grades
  4. Since we’ve connected to HCatalog, the grades relation should have a schema:
    grunt> DUMP grades;
    pig describe grades
    Notice HCatalog automatically translates to Pig’s data types for us.
  5. Furthermore, we can view the grades similar as we did in hive:
    grunt> DUMP grades;
    pig dump grades
  6. To demonstrate use of the schema, let’s make a simple transformation of only the grades for the IST course prefix:
    grunt> istgrades = FILTER grades BY STARTSWITH(course,'IST');
    Then DUMP the data to verify it works:
    grunt> DUMP istgrades;
    pig dump ist grades
  7. Finally exit Pig:
    grunt> quit

Adding Data to a Hive Table

  1. If you can write data into a folder in HDFS, you can add data to a hive table. It’s time to add our missing spring semester of grades to HDFS:
    $ hdfs dfs -put datasets/grades/spring*.tsv unit07lab2/grades
  2. Now let’s start hive:
    $ hive
  3. And query the grades table from the gradesdb database:
    hive> SELECT * FROM gradesdb.grades;
    You’ll now see the spring grades now in the output:
    spring grades in output
    NOTE: we queried the table without the USE command by placing the database name in front of the table like this gradesdb.grades

Views

Views are named SQL SELECT queries. They’re quite common in Relational Database Management Systems, and supported in Hive.

  1. First switch to gradesdb: hive> set hive.cli.print.current.db=true; hive> USE gradesdb;
  2. For example, here’s a query to get the good grades:
    hive (gradesdb)> SELECT * FROM grades WHERE grade in ('A','A-');
    All the grades returned should be an ‘A’ or ‘A-‘. hive_query01
  3. We can encapsulate this query logic into a view:
    hive (gradesdb)> CREATE VIEW IF NOT EXISTS v_goodgrades AS
    SELECT * FROM gradesdb.grades WHERE grade in ('A','A-');
  4. This does not execute the SELECT statement, but instead adds a the view to the catalog:
    hive gradesdb)> SHOW TABLES;
    You will now see grades and v_goodgrades. It is a convention to put v_ in front of view names so that you can identify them as views in the schema.
  5. We can now treat our view as if it were a table:
    hive (gradesdb)> SELECT course, grade FROM v_goodgrades;
    Which produces this output:
    selet v_goodgrades

Make Table Queries and ORC files

One way to improve hive queries is to store data in ORC (Optimized Row Columnar) Format. ORC is a compact, efficient format for ad-hoc querying. It is quite common to take source data which is commonly queried in Hadoop and output it to a new table in this format for end users to query.

  1. Let’s create the new table:
    hive (gradesdb)> CREATE TABLE IF NOT EXISTS gradesorc STORED AS ORC AS SELECT * FROM grades;
    this executes a YARN job which copies the output of the SELECT statement into a new table called gradesorc.
    create gradeorc
    You’ll notice this is not an EXTERNAL table. This table is stored in the Hive metastore and is only accessible through HCatalog. Data are copied into the path seen in the screen shot. This is also known as an INTERNAL table.
  2. We can still query this table:
    hive (gradesdb)> SELECT * FROM gradesorc;
  3. Exit Hive to free up YARN memory:
    hive (gradesdb)> exit;

NOTE: It should be noted that when data is added to the grades table this will not be reflected in the gradeorc table automatically. We will need to DROP the ORC table and re-create it from the other table.

Zeppelin

Apache Zeppelin is a web based notebook application, similar to the Jupyter application we used earlier. It provides a user interface for visualizing output from a variety of Hadoop and non-Hadoop applications. We will use it to query our gradesorc Hive table and produce visualizations.

  1. Logon to Ambari as the ischool user.
  2. From the menu bar, click on the View Menu view then select Zeppelin to launch the application.
  3. You should now see the Zeppelin home page.
    Zeppelin home Page
  4. Click on Create new note
    create new note
    From the dialog, type Grades for the note name and click Create Note
  5. Back at the Zeppelin home page,
    Zeppelin home Page
    Click on the Grades note to open it. You should see the notebook page:
    grades notebook page
  6. From Zeppelin we can access a variety of applications, but in this instance we want to use Hive so we will specify that. Also we’ll issue a hive command to query our table, type:
    %hive
    SELECT * FROM gradesdb.gradesorc
    It should look like this:
    select in Zeppelin
  7. To execute the code in this paragraph and send it to Hive for processing click the Play Button to the right:
    select in Zeppelin
  8. This will send the command to Hive for processing and return the output to your browser: select output in zeppelin
  9. Let’s create a histogram of grades. Click on the Column Chart column chart icon. Zeppelin defaults to a chart:
    chart default
  10. Click Settings to configure the chart. You will now see a designer where you can manipulate the data displayed in the chart.
    chart designer
  11. Remove All existing fields from the chart.
  12. Drag and drop gradesorc.grade to the Keys text box.
  13. Drag and drop gradesorc.course to the Values box.
  14. Click on gradesorc.course SUM and change it to gradesorc.course COUNT
  15. The chart designer should look like this:
    final chart designer
  16. Click Settings to close and reveal your chart:
    final chart

This lab only scratches the surface of what’s possible with Zeppelin. Some other interesting notes:

IMPORTANT: Zeppelin consumes quite a bit of YARN memory, and leaves sessions open. You might need to kill running YARN jobs to reclaim available memory.

Test Yourself

  1. Explain the purpose of HCatalog.
  2. What is the Hive command to view all tables in the current database?
  3. Describe the difference between and external table and an internal hive table.
  4. What is the difference between a hive table and a hive view?
  5. What is the ORC file format? List one advantage and one disadvantage of this format?

Part 3 On Your Own

Instructions

In this part you will create a Zeppelin notebook from the two files in datasets/customers which are customers.csv and surveys.csv and create visualizations from Hive data.

Tips:

For each of the following steps provide the list of commands required to complete the steps.

Steps

  1. Upload customers.csv to the folder unit07lab2/customers/ and surveys.csv to unit07lab2/surveys/ on HDFS.
  2. Create a Hive database called unit07lab2 create an EXTERNAL tables customers and surveys that database. Provide a screenshot to prove the tables exist and have the describe the correct schema. Here’s the code:
CREATE EXTERNAL TABLE IF NOT EXISTS customers(
    first_name string,
    last_name string,
    email string,
    gender string,
    last_ip string,
    city string,
    state string,
    total_orders int,
    total_purchased double,
    months_customer int
  )
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ","
  LOCATION "/user/ischool/unit07lab2/customers"
  tblproperties ("skip.header.line.count"="1");

CREATE EXTERNAL TABLE IF NOT EXISTS surveys(
    email string,
    twitter_username string,
    marital_status string,
    household_income double,
    own_your_home string,
    education string,
    fav_department string
  )
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ","
  LOCATION "/user/ischool/unit07lab2/surveys"
  tblproperties ("skip.header.line.count"="1");

The skip.header.line.countdoes what you suspect: skips the first row, which contains the field names.

  1. Use the SELECT statement to figure out which column to use to JOIN these two tables together. Write a SELECT statement to list all columns from and rows from both tables into a single output. HINT: You must create a table aliases to join these two tables as they share similar column names.
  2. create a table called customer_surveys from the combined query output. Store the table in the Hive Metastore in ORC format.
  3. Exit Hive. Create a new Zeppelin note called Customer Surveys Produce a pie chart to break out customers by education. Provide a screenshot.
  4. Produce a bar chart of favorite departments grouped by gender.

Questions

  1. Which three departments do men prefer the most? what About women?
  2. How many customers surveyed have a 4 year degree?
  3. In the state of ‘VA’ how many customers own a home?
  4. Which state has the most male customers?