Unit 02 Lab 2: Pandas

Part 1: Overview

About Title

In this lab you will take your knowledge of Python 3 and learn how to use the Pandas and MatPlotLib libraries.

Learning Outcomes

Upon completing this lab you will be able to: - Understand the Pandas and MatPlotLib libraries - Manipulate data with Pandas - Plot data with MatPlotLib


To complete this lab you will need:

NOTE: We can’t teach you everything about Pandas and MatPlotLib, that’s why we recommend the following resources:

Before You Begin

Before you start this lab you should:

From your Hadoop Client:

  1. Open a linux command prompt.
  2. Start jupyter notebook:
    $ jupyter-notebook
  3. Open a browser and go to http://localhost:8888
    You should see the Jupyter Application in your browser.

Part 2: Walk-Though

Pandas: The Basics

Pandas is an open-source library for data analysis in Python. Pandas makes it simple to structure and manipulate data.

Getting Started

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
days = Series(['Mon','Tue','Wed','Thu','Fri'])

When you execute this you will see this output:
series output

sales = DataFrame( {'Day' : ['Mon','Tue','Wed','Thu','Fri'], 'Sold': [45,55,60,70,65]})

Execute this cell to view this output:
pandas dataframe

Understanding DataFrames

Adding Columns to the DataFrame

It’s common practice add your own Series to an existing DataFrame. In this example we will calculate the Percentage Of Total for the Sold column as a Series then and create a new column PoT in our sales DataFrame.

Reading from Files and Plotting

In this final part of the Walk-Through, we will load data from the Superhero Movie Dataset perform some simple manipulations and plot the data as a means of exploring the features in this dataset.

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
%matplotlib inline
sh_raw = pd.read_csv('datasets/superhero/superhero-movie-dataset-1978-2012.csv',

When you execute it, you should see this output:
superhero dataset raw
NOTE: There’s a scroll bar at the bottom of the output so you can view all the columns.

# Normalize the scores
imdb_normalized = sh.IMDB / 10   # IMDB scores out of 10
sh.insert(9,'IMDBNormalized', imdb_normalized)
rt_normalized = sh.RT/100        # RT scores out of 100
sh.insert(9, 'RTNormalized', rt_normalized)

Plots, Distrubtions and Correlations

Advanced Plotting

plt = sh.plot.scatter(x='RTNormalized',y='EstdOpeningAttendance', marker='x', color='Red', label='Rotten Tomatoes')
plt = sh.plot.scatter(x='IMDBNormalized',y='EstdOpeningAttendance', marker='o', ax=plt, color='Blue', label='IMDB')
plt.set_xlabel("Normalized Score [0-1]")
plt.set_ylabel("Estd. Opening Weekend Attendance")

Wnen you execute this code you should see the following plot:
two series plot
For each Series, we’ve included a different marker=, color=, and label= so the viewer can differentiate between IMDB and Rotten Tomatoes data points. Neato!

plt = sh.plot.scatter(x = 'Year', y='AvgTicketPriceThatYear')
z = np.polyfit(x=sh.Year,y=sh.AvgTicketPriceThatYear, deg=5)
p = np.poly1d(z)
trendline = pd.DataFrame(data=p(sh.Year), index=sh.Year)
trendline.plot.line(ax=plt, color='Green')

When you execute the cell you will see our data and the trend line:
plot with trendline
NOTE: The argument deg=5 represents a 5th degree polynomial. For a straight line, change this to deg=1 and execute again.

Test Yourself

  1. Define a Pandas Series.
  2. Define a Pandas DataFrame, use the term Series in your definition.
  3. In Pandas, how would you only show 'DC' comic movies from the sh DataFrame?
  4. In Pandas, how would you only show the Year, Title and OpeningWeekendBoxOffice columns from DataFrame sh?
  5. In Pandas, how would you only show the Year and Title of only 'Marvel' movies from DataFrame sh?
  6. Write a command to plot a line() AvgTicketPriceThatYear with Year on the x axis. Make the line Black.

Part 3 On Your Own


Open a new Notebook Called Unit02Lab2Part3 Write code in each cell to answer each question.


Write each program in its own cell.

  1. Load the 'datasets/exam-scores/exam-scores.csv' into a pandas DataFrame called scores. Note since this data set has headers, you can omit the header=None and names= arguments.
  2. Write a program to determine if there is a correlation between a Student’s score and the time it took them to complete the examination. Is there such a correlation?
  3. Filter the scores for exam version 'D' only. what does the correlation look like now? Is it different? How so?
  4. Let’s the relationship between students who Made their own study guide and their exam score. Create a variable study containing the columns 'Made Own Study Guide', 'Student Score' and removes the rows with '?' in the Made Own Study Guide column. The output should look like this:
    study guide
  5. Next we need to convert the Made Own Study Guide to a numeric value. To do this we will create a new series and add it to the DataFrame. Hint: use a list comprehension to evaluate 'Y' or 'N' and convert them to 1 or 0 respectively. When you’re done your study DataFrame should look like this:
    study guide value
  6. What is the correlation between the 'StudyGuideValue' and Exam Scores? Plot a scatter with 'StudyGuideValue' on the x axis, and include a screen shot.