Data Analysis with Pandas

By Michael Fudge

What is data analysis?

Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data [1].

The goals of data analysis are to:

  1. discover useful information,
  2. provide insights,
  3. suggest conclusions and
  4. support decision-making.

Some examples

Most data analysis problems start with a fundamental question (but they don't always have to). Examples:

  1. Do students who study in groups perform better on examinations that those who study alone?
  2. What role (if any at all) does weather play in consumer shopping habits?
  3. What types of passengers were most likely to survive the Titantic?
  4. Among American Football teams who "go for it" on 4th downs, what is their win percentage?
  5. Are we stocking products in warehouses closest to the customers who purchase them?

Data analysis is "information storytelling"

Don't think of data analysis as crunching numbers and churning out graphs.

I like to think of data analysis as "information storytelling." Unlike slapping a chart on a powerpoint slide, it's a full disclosure of the process:

  1. helping the reader understand your methodology (how you acquired and prepared the data)
  2. sharing your complete analysis (including things that didn't work)
  3. provding a narrative as to what the results mean, and most importantly
  4. providing an honest and accurate analysis.

Can't I just do this in Excel?

You could do this in Excel, but Python and Jupyter Notebook offer several advantages:

  • its easier to automate and update the process later on, since its code
  • You can intermix code with analysis in the Jupyter notebook
  • it can integrate with a variety of services and systems (because its code!).

Data analysis is a skill every information professional should have, and this is a primary reason we teach programming in the iSchool. When you can code it makes your data analysis that much better!

What is Pandas?

Pandas is a Python library for data analysis. The homepage of Pandas is http://pandas.pydata.org/.

Pandas takes the pain and suffering out of data analysis by doing a lot of the work for you. Pandas is so elegant at times, people feel it's "indistinguishable from magic." In this lesson we will try to our best to explain what Pandas is doing "behind the curtain" and expose the magic behind Pandas.

Table of Contents:

  1. Importing Pandas
  2. Pandas Data Structures
  3. Selecting rows and columns in a DataFrame
  4. Manipulating a DataFrame
  5. Real-world Pandas

1. Importing pandas

Before you can use pandas you must import it into Python. When we import Pandas we usually alias it as pd so that we don't have to type pandas in our code. For example instead of typing pandas.DataFrame() we can save ourselves a few keystrokes with pd.DataFrame()

In [2]:
import pandas as pd

# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')

2. Pandas Data Structures

The key to understanding how Pandas works is to know about its underlying data structures. the series, the data frame, and the index.

Series

A series is the most basic structure in Pandas. It is simply a named-list of values. Every series has a data type, expressed as a numpy data type. (Pandas uses another module called numpy for is implementation.)

The following example creates a series called grades we use Python's named arguments feature to set the data, name and dtype arguments

In [3]:
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades", dtype='int')
grades
Out[3]:
0    100
1     80
2    100
3     90
4     80
5     70
Name: Grades, dtype: int32

NOTE: We don't have to set the dtype argument when we create a Series, typically the type will be inferred by the data types of the values in the list.

For example, here we create the same series, but instead we let Python infer the type from the list.

In [4]:
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades") # we left off dtype='int', assumes int anyways thanks to the list
grades
Out[4]:
0    100
1     80
2    100
3     90
4     80
5     70
Name: Grades, dtype: int64

Index

What's with those numbers to the left of the grades? That's the index. An index is an ordered list of values. It's how Pandas accesses specific values in a Series (or as we will see in a bit... a DataFrame)

The index of the series works a lot like the index in a list or a string. For example, this code prints the first and last grades in the series.

In [5]:
print("first grade:", grades[0])
print("last grade:", grades[5])
first grade: 100
last grade: 70

As you would expect, a Series is iterable meaning we can use slice notiation and if we want loop over the values:

In [6]:
for grade in grades:
    print(grade, end =" ")
100 80 100 90 80 70 

Good News! You don't need to iterate!

The beauty of Pandas is that for most operations, you won't even need a loop! For example, we can calculate each of the following without a loop:

In [7]:
print("Highest grade:", grades.max())
print("Average grade:", grades.mean())
print("lowest grade:", grades.min())
print("Sum of grades:", grades.sum())
print("Count of grades", grades.count())
Highest grade: 100
Average grade: 86.66666666666667
lowest grade: 70
Sum of grades: 520
Count of grades 6

Where did all these functions come from? They're methods (object functions) associated with the Series object. If you dir(grades) you can see them!

If you want to learn more, here's the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html

DataFrame

The DataFrame is simply a dictionary of series. You can think of it as a table of data that you can access and manipulate in a variety of ways. I like to think of the DataFrame as a programmable spreadsheet. It has rows and columns which can be accessed and manipulated with Python.

DataFrame is the most common Pandas data structure. As you'll see its expressive and versitile, making it an essential tool in data analysis.

To make a DataFrame we must create a Python dict of our series. We use the series name as the key and the series itself as the value.

This example creates a DataFrame from two series of Student names and Grade-Point-Averages (GPA's).

In [8]:
names = pd.Series( data = ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'])
gpas = pd.Series( data = [4.0, 3.0, 3.4, 2.8, 2.5, 3.8, 3.0])
years = pd.Series( data = ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'])
series_dict = { 'Name':  names, 'GPA': gpas, 'Year' : years }  # dict of Series, keys are the series names
students = pd.DataFrame( series_dict )
students
Out[8]:
GPA Name Year
0 4.0 Allen So
1 3.0 Bob Fr
2 3.4 Chris Fr
3 2.8 Dave Jr
4 2.5 Ed Sr
5 3.8 Frank Sr
6 3.0 Gus Fr

What? Pandas didn't use the same column order as the dict? We entered Name, GPA, Year and the DataFrame is GPA, Name, Year. No sweat! This can be fixed by including the columns named argument when we create the DataFrame:

In [9]:
students = pd.DataFrame( series_dict, columns = ['Name', 'GPA', 'Year'] )
students
Out[9]:
Name GPA Year
0 Allen 4.0 So
1 Bob 3.0 Fr
2 Chris 3.4 Fr
3 Dave 2.8 Jr
4 Ed 2.5 Sr
5 Frank 3.8 Sr
6 Gus 3.0 Fr

Iterating over a DataFrame

What happens when we iterate over a DataFrame? Let's try it!

In [10]:
for column in students:
    print(column)
Name
GPA
Year

It iterates over the columns! Why?!?!? If you recall, a DataFrame is a dict of Series, this is what happens when you loop over a dict.

So how do you loop through the rows? Use the DataFrame's to_records() method, of course.

This example prints out the students on the Dean's list (GPA of 3.4 or better)

In [11]:
for student in students.to_records():
    if student['GPA'] >=3.4:
        print("%s with a GPA of %.3f is on the Dean's list." % (student['Name'], student['GPA']) )
Allen with a GPA of 4.000 is on the Dean's list.
Chris with a GPA of 3.400 is on the Dean's list.
Frank with a GPA of 3.800 is on the Dean's list.

Of course if you've been paying attention, you might suspect that we:

do not need to iterate over the DataFrame to do this!

Pandas allows us to be expressive in our data selections. For example, here's a data frame of only the student names and GPA's on the dean's list.

In [12]:
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list
Out[12]:
Name GPA
0 Allen 4.0
2 Chris 3.4
5 Frank 3.8

3. Selecting rows and columns in a DataFrame

What you just witnessed sure looks like magic, but it's not. Once you understand how Pandas works, it becomes easy to employ approaches like the one you saw above to your data analysis.

Let's dive deeper.

Selecting Columns

To select a single column we access its key from the DataFrame dictionary. What we get back is a Series.

For example this retrieves the GPA's:

In [13]:
students['GPA']  # Series
Out[13]:
0    4.0
1    3.0
2    3.4
3    2.8
4    2.5
5    3.8
6    3.0
Name: GPA, dtype: float64

Oh? You wanted that as a DataFrame not a series, correct? Then you need to retrieve a list of columns instead:

In [14]:
students[ ['GPA'] ]  # DataFrame because ['GPA'] is a list
Out[14]:
GPA
0 4.0
1 3.0
2 3.4
3 2.8
4 2.5
5 3.8
6 3.0

Likewise you can use this notation to select more than one column. Like this example which selects Name and Year:

In [15]:
students[ ['Name', 'Year'] ]
Out[15]:
Name Year
0 Allen So
1 Bob Fr
2 Chris Fr
3 Dave Jr
4 Ed Sr
5 Frank Sr
6 Gus Fr

Selecting Rows

We've seen how to select columns from the DataFrame so let's move on to rows.

Typically DataFrames have far more rows than can fit on your screen. When getting a sense of what your data had to offer you can use three DataFrame methods.

head(n)will return the first n rows. For example, here's the first 3 rows:

In [16]:
students.head(3)
Out[16]:
Name GPA Year
0 Allen 4.0 So
1 Bob 3.0 Fr
2 Chris 3.4 Fr

Likewise we can get the last 2 rows in the DataFrame using the tail(n) method:

In [17]:
students.tail(2)
Out[17]:
Name GPA Year
5 Frank 3.8 Sr
6 Gus 3.0 Fr

How about a random sample of the DataFrame? The sample(n) method will retrieve n rows at random from the data frame. Let's get 3 rows at random:

In [18]:
students.sample(3)
Out[18]:
Name GPA Year
0 Allen 4.0 So
5 Frank 3.8 Sr
4 Ed 2.5 Sr

Selecting rows with boolean indexing

A boolean index is a Series of boolean values. For example. This boolean index describes rows in the DataFrame where the student Year is a freshman 'Fr':

In [19]:
students['Year'] == 'Fr' 
Out[19]:
0    False
1     True
2     True
3    False
4    False
5    False
6     True
Name: Year, dtype: bool

At a glance this might not seem useful. But when we apply a boolean index to a DataFrame it only selects the rows which are True in the boolean index. For example, this is only the freshman 'Fr'.

In [20]:
students[ students['Year'] == 'Fr' ]
Out[20]:
Name GPA Year
1 Bob 3.0 Fr
2 Chris 3.4 Fr
6 Gus 3.0 Fr

I'll admit the syntax is a little awkward, but you get accustomed to it over time. It is a very expressive way to filter rows in a DataFrame.

Pandas has special operators for logical and (&), or (|), and not(~).

For example, this retrieves students with a GPA between 2.5 and 3.0. Notice the parenthesis. These are required.

In [21]:
students[ (students.GPA >=2.5) & (students.GPA <= 3.0)]
Out[21]:
Name GPA Year
1 Bob 3.0 Fr
3 Dave 2.8 Jr
4 Ed 2.5 Sr
6 Gus 3.0 Fr

And as we've seen you can combine row selections using boolean indexing with column selections.

In [22]:
#             columns            boolean index
students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
Out[22]:
Name GPA
0 Allen 4.0
2 Chris 3.4
5 Frank 3.8

Even more useful is we can assign these results to a variable, and since that variable is a DataFrame the same rules apply!

In [23]:
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list[['Name']].head(2)
Out[23]:
Name
0 Allen
2 Chris

Manipulating a DataFrame

It's easy to manipulate the data in a DataFrame. This typically done as part of the data preparation steps of data analysis. Sometimes we need to create new columns for a better analysis or clean up existing data. Sometimes this is referred to as feature engineering.

For example, this code add's a column called 'Deans List' The column is initially set to 'No' for all students:

In [24]:
students['Deans List'] = 'No'
students
Out[24]:
Name GPA Year Deans List
0 Allen 4.0 So No
1 Bob 3.0 Fr No
2 Chris 3.4 Fr No
3 Dave 2.8 Jr No
4 Ed 2.5 Sr No
5 Frank 3.8 Sr No
6 Gus 3.0 Fr No

Next, we set 'Deans list' to 'Yes' for any student who has a GPA of 3.4 or better:

In [25]:
students['Deans List'][ students['GPA'] >= 3.4 ] = 'Yes'
students
Out[25]:
Name GPA Year Deans List
0 Allen 4.0 So Yes
1 Bob 3.0 Fr No
2 Chris 3.4 Fr Yes
3 Dave 2.8 Jr No
4 Ed 2.5 Sr No
5 Frank 3.8 Sr Yes
6 Gus 3.0 Fr No

NOTE: If you get a warning that states: A value is trying to be set on a copy of a slice from a DataFrame You get this warning because you manipulated an existing dataframe, instead of a copy.

You can also use this approach to clean up data. For example Ed should be Eddie.

In [26]:
students['Name'][students['Name'] == 'Ed'] = 'Eddie'
students
Out[26]:
Name GPA Year Deans List
0 Allen 4.0 So Yes
1 Bob 3.0 Fr No
2 Chris 3.4 Fr Yes
3 Dave 2.8 Jr No
4 Eddie 2.5 Sr No
5 Frank 3.8 Sr Yes
6 Gus 3.0 Fr No

Here's an example of how to create a new calculation from an existing column. Here we create a column 'Pts From Deans List' to calculate the number of points the student's GPA is from the 'Deans List'

In [27]:
students['Pts From Deans List'] = students['GPA'] - 3.4
students
Out[27]:
Name GPA Year Deans List Pts From Deans List
0 Allen 4.0 So Yes 0.6
1 Bob 3.0 Fr No -0.4
2 Chris 3.4 Fr Yes 0.0
3 Dave 2.8 Jr No -0.6
4 Eddie 2.5 Sr No -0.9
5 Frank 3.8 Sr Yes 0.4
6 Gus 3.0 Fr No -0.4

Real-world Pandas

We conclude our Pandas tour with some examples of how easily Pandas handles larger data sets. The true power of Pandas is not in creating your own Series and DataFramaes but loading them from files.

Reading comma-separated values (csv) into a DataFrame

For example this code reads a CSV (comma-separated values) file stored on GitHub and loads it into the customers DataFrame:

In [28]:
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')
customers.head()
Out[28]:
First Last Email Gender Last IP Address City State Total Orders Total Purchased Months Customer
0 Al Fresco afresco@dayrep.com M 74.111.18.161 Syracuse NY 1 45 1
1 Abby Kuss akuss@rhyta.com F 23.80.125.101 Phoenix AZ 1 25 2
2 Arial Photo aphoto@dayrep.com F 24.0.14.56 Newark NJ 1 680 1
3 Bette Alott balott@rhyta.com F 56.216.127.219 Raleigh NC 6 560 18
4 Barb Barion bbarion@superrito.com F 38.68.15.223 Dallas TX 4 1590 1

Here's an example of loading a Weather data for Syracuse, NY from 1997 to 2015, also found on GitHub:

In [29]:
weather = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/weather/syracuse-ny.csv')
weather.head()
Out[29]:
EST Max TemperatureF Mean TemperatureF Min TemperatureF Max Dew PointF MeanDew PointF Min DewpointF Max Humidity Mean Humidity Min Humidity ... Max VisibilityMiles Mean VisibilityMiles Min VisibilityMiles Max Wind SpeedMPH Mean Wind SpeedMPH Max Gust SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees
0 1997-1-1 27 12.0 -2 22 4 -8 92 74 59 ... 10 9 1 14 5 NaN 0.05 6 Snow 89
1 1997-1-2 34 28.0 23 33 29 21 100 96 88 ... 9 2 0 8 4 NaN 0.08 8 Fog-Rain-Snow 82
2 1997-1-3 44 40.0 36 44 38 34 100 96 89 ... 10 4 0 15 6 NaN 0.09 8 Fog-Rain 273
3 1997-1-4 48 40.0 34 44 36 33 96 90 83 ... 10 10 8 13 4 NaN 0.00 8 Rain 80
4 1997-1-5 55 46.0 37 50 43 29 89 81 73 ... 10 10 10 21 11 30.0 0.13 8 Rain 199

5 rows × 23 columns

There's too many columns in this dataset. This is where the DataFrame.columns property comes in handy:

In [30]:
weather.columns
Out[30]:
Index(['EST', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',
       'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
       'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn',
       'Mean Sea Level PressureIn', 'Min Sea Level PressureIn',
       'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles',
       'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH',
       'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees'],
      dtype='object')

This example displays the min, mean and max temperatures for July, 4 2015. Notice how we use .str.startswith() to retrieve the rows in question.

In [31]:
weather[ weather['EST'] == '2015-7-4'][['EST','Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
Out[31]:
EST Min TemperatureF Mean TemperatureF Max TemperatureF
6664 2015-7-4 60 69.0 78

Reading HTML tables into a DataFrame

You can even read HTML tables off a website with Pandas and import them into a DataFrame!

For example this retieves a table of US cities by population, from Wikipedia https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

In [32]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population')
data[3].head()
Out[32]:
0 1 2 3 4 5 6 7 8
0 2015 rank City State[5] 2015 estimate 2010 Census Change 2014 land area 2010 population density Location
1 1 New York[6] New York 8550405 8175133 7000459040849855290♠+4.59% 7002302600000000000♠302.6 sq mi 783.8 km2 7004270120000000000♠27,012 per sq mi 10,430 km−2 40°39′51″N 73°56′19″W / 40.6643°N 73.9385°W...
2 2 Los Angeles California 3971883 3792621 7000472659936228800♠+4.73% 7002468700000000000♠468.7 sq mi 1,213.9 km2 7003809200000000000♠8,092 per sq mi 3,124 km−2 34°01′10″N 118°24′39″W / 34.0194°N 118.4108°...
3 3 Chicago Illinois 2720546 2695598 6999925508922324480♠+0.93% 7002227600000000000♠227.6 sq mi 589.6 km2 7004118420000000000♠11,842 per sq mi 4,572 km−2 41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W...
4 4 Houston[7] Texas 2296224 2100263 7000933030768051430♠+9.33% 7002599600000000000♠599.6 sq mi 1,552.9 km2 7003350100000000000♠3,501 per sq mi 1,352 km−2 29°46′50″N 95°23′11″W / 29.7805°N 95.3863°W...

The read_html() method returns a list of DataFrames. The assumption here is there is more than one table on the webpage. data[3] represents the 4th table on the webpage, which is the one we wanted in this case.

Also the data needs to be cleaned up a bit. We will discuss how to do this in our In-Class Coding Lab.

To demonstrate how versatile the read_html() method can be, let's read the schedule of classes off the iSchool website: https://ischool.syr.edu/classes

Remember that read_html() returns a list of DataFrames so we must select the appropriate index containing the data we want. In this case the schedule is in data[0]

In [40]:
data = pd.read_html('https://ischool.syr.edu/classes/', attrs = {'id': 'classSchedule'} )
schedule = data[0]
schedule.head()
Out[40]:
0 1 2 3 4 5 6 7 8
0 GET300 M001 41064 3.0 Enterprise Data Analysis Michael A Leonardo 5:15pm - 8:05pm M Hinds Hall 010
1 GET302 M001 37049 3.0 Global Financial Sys Arch Frank Jr Marullo 5:15pm - 8:00pm M Hinds Hall 021
2 GET365 M001 36973 1.5 Business Value of IT Timothy D. Stedman 5:00pm - 7:50pm Tu Hinds Hall 018
3 GET400 M002 37081 3.0 Global Consulting Challenges Jason Dedrick 10:35am - 1:10pm F Hinds Hall 018
4 GET433 M001 37075 3.0 Multi-tier App. Development P Douglas Taber 8:00am - 9:20am TuTh Hinds Hall 013

Adding columns to a DataFrame

Unfortunately, the schedule DataFrame is missing column names.

But, Pandas allows us to add these at any time. Very Cool!

In [41]:
schedule.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']
schedule.head()
Out[41]:
Course Section ClassNo Credits Title Instructor Time Days Room
0 GET300 M001 41064 3.0 Enterprise Data Analysis Michael A Leonardo 5:15pm - 8:05pm M Hinds Hall 010
1 GET302 M001 37049 3.0 Global Financial Sys Arch Frank Jr Marullo 5:15pm - 8:00pm M Hinds Hall 021
2 GET365 M001 36973 1.5 Business Value of IT Timothy D. Stedman 5:00pm - 7:50pm Tu Hinds Hall 018
3 GET400 M002 37081 3.0 Global Consulting Challenges Jason Dedrick 10:35am - 1:10pm F Hinds Hall 018
4 GET433 M001 37075 3.0 Multi-tier App. Development P Douglas Taber 8:00am - 9:20am TuTh Hinds Hall 013

And now we can do some analysis, like finding classes on wednesdays:

In [42]:
schedule[ schedule['Days'] =='W'] 
Out[42]:
Course Section ClassNo Credits Title Instructor Time Days Room
38 IST323 M001 36983 3.0 Intro to Information Security Christopher Croad 2:15pm - 5:05pm W Hinds Hall 027 Hinds Hall 117
43 IST335 M003 36978 3.0 Intro/Info Based Organizations Michael A D'Eredita 5:15pm - 8:05pm W Hall of Languages 115

Reading JSON data into a DataFrame

This next example demonstrates how to load json data into a DataFrame. If the JSON data is structured like a list of dict, then it's quite simple. The example would work the same as if we read the json data from a file first.

We need orient='records' because our JSON data is a list of values.

In [34]:
json_data = '[{"GPA":3.0,"Name":"Bob"},{"GPA":3.7,"Name":"Sue"},{"GPA":2.4,"Name":"Tom"}]'

tweets = pd.read_json(json_data, orient='records')
tweets.head()
Out[34]:
GPA Name
0 3.0 Bob
1 3.7 Sue
2 2.4 Tom

Converting a list of dictionary to a pandas DataFrame (and back)

In our final example we demonstrate how to convert a list of dict in python into a pandas DataFrame we will then demonstrate how to convert it back to a list of dict.

This type of scenairio is useful when you call an API and would like to load the data into pandas, or take the pandas DataFrame and process it further in outside of pandas.

In [35]:
students_list_of_dict = [
    { 'Name' : 'Bob', 'GPA' : 3.0 },
    { 'Name' : 'Sue', 'GPA' : 3.7 },
    { 'Name' : 'Tom', 'GPA' : 2.4 }
]

students_df = pd.DataFrame(students_list_of_dict)
students_df
Out[35]:
GPA Name
0 3.0 Bob
1 3.7 Sue
2 2.4 Tom

Now let's convert the students_df back into a list of dict. Choosing the named argument orient ='record' produces a list of dict.

In [36]:
students = students_df.to_dict(orient='records')
students
Out[36]:
[{'GPA': 3.0, 'Name': 'Bob'},
 {'GPA': 3.7, 'Name': 'Sue'},
 {'GPA': 2.4, 'Name': 'Tom'}]