By Michael Fudge

**Data Analysis** is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data [1].

The goals of data analysis are to:

- discover useful information,
- provide insights,
- suggest conclusions and
- support decision-making.

Most data analysis problems start with a fundamental question (but they don't always have to). Examples:

- Do students who study in groups perform better on examinations that those who study alone?
- What role (if any at all) does weather play in consumer shopping habits?
- What types of passengers were most likely to survive the Titantic?
- Among American Football teams who "go for it" on 4th downs, what is their win percentage?
- Are we stocking products in warehouses closest to the customers who purchase them?

Don't think of data analysis as crunching numbers and churning out graphs.

I like to think of data analysis as "*information storytelling*." Unlike slapping a chart on a powerpoint slide, it's a full disclosure of the process:

- helping the reader understand your methodology (how you acquired and prepared the data)
- sharing your complete analysis (including things that didn't work)
- provding a narrative as to what the results mean, and most importantly
- providing an honest and accurate analysis.

You **could** do this in Excel, but Python and Jupyter Notebook offer several advantages:

- its easier to automate and update the process later on, since its code
- You can intermix code with analysis in the Jupyter notebook
- it can integrate with a variety of services and systems (because its code!).

Data analysis is a skill every information professional should have, and this is a primary reason we teach programming in the iSchool. When you can code it makes your data analysis that much better!

Pandas is a Python library for data analysis. The homepage of Pandas is http://pandas.pydata.org/.

Pandas takes the pain and suffering out of data analysis by doing a lot of the work for you. Pandas is so elegant at times, people feel it's "indistinguishable from magic." In this lesson we will try to our best to explain what Pandas is doing "behind the curtain" and expose the magic behind Pandas.

- Importing Pandas
- Pandas Data Structures
- Selecting rows and columns in a DataFrame
- Manipulating a DataFrame
- Real-world Pandas

Before you can use pandas you must import it into Python. When we import Pandas we usually alias it as `pd`

so that we don't have to type `pandas`

in our code. For example instead of typing `pandas.DataFrame()`

we can save ourselves a few keystrokes with `pd.DataFrame()`

In [2]:

```
import pandas as pd
# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')
```

The key to understanding how Pandas works is to know about its underlying data structures. the series, the data frame, and the index.

A **series** is the most basic structure in Pandas. It is simply a named-list of values. Every series has a data type, expressed as a `numpy`

data type. (Pandas uses another module called `numpy`

for is implementation.)

The following example creates a series called `grades`

we use Python's **named arguments** feature to set the `data`

, `name`

and `dtype`

arguments

In [3]:

```
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades", dtype='int')
grades
```

Out[3]:

**NOTE:** We don't have to set the `dtype`

argument when we create a Series, typically the type will be *inferred* by the data types of the values in the list.

For example, here we create the same series, but instead we let Python infer the type from the list.

In [4]:

```
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades") # we left off dtype='int', assumes int anyways thanks to the list
grades
```

Out[4]:

What's with those numbers to the left of the grades? That's the **index**. An index is an ordered list of values. It's how Pandas accesses specific values in a `Series`

(or as we will see in a bit... a `DataFrame`

)

The index of the series works a lot like the index in a list or a string. For example, this code prints the first and last grades in the series.

In [5]:

```
print("first grade:", grades[0])
print("last grade:", grades[5])
```

`Series`

is **iterable** meaning we can use slice notiation and if we want loop over the values:

In [6]:

```
for grade in grades:
print(grade, end =" ")
```

The beauty of Pandas is that for most operations, you won't even need a loop! For example, we can calculate each of the following without a loop:

In [7]:

```
print("Highest grade:", grades.max())
print("Average grade:", grades.mean())
print("lowest grade:", grades.min())
print("Sum of grades:", grades.sum())
print("Count of grades", grades.count())
```

**Where did all these functions come from?** They're methods (object functions) associated with the `Series`

object. If you `dir(grades)`

you can see them!

If you want to learn more, here's the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html

The **DataFrame** is simply a dictionary of series. You can think of it as a table of data that you can access and manipulate in a variety of ways. I like to think of the `DataFrame`

as a programmable spreadsheet. It has rows and columns which can be accessed and manipulated with Python.

`DataFrame`

is the most common Pandas data structure. As you'll see its expressive and versitile, making it an essential tool in data analysis.

To make a `DataFrame`

we must create a Python `dict`

of our series. We use the series name as the key and the series itself as the value.

This example creates a `DataFrame`

from two series of Student names and Grade-Point-Averages (GPA's).

In [8]:

```
names = pd.Series( data = ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'])
gpas = pd.Series( data = [4.0, 3.0, 3.4, 2.8, 2.5, 3.8, 3.0])
years = pd.Series( data = ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'])
series_dict = { 'Name': names, 'GPA': gpas, 'Year' : years } # dict of Series, keys are the series names
students = pd.DataFrame( series_dict )
students
```

Out[8]:

`Name, GPA, Year`

and the DataFrame is `GPA, Name, Year`

. No sweat! This can be fixed by including the `columns`

named argument when we create the `DataFrame`

:

In [9]:

```
students = pd.DataFrame( series_dict, columns = ['Name', 'GPA', 'Year'] )
students
```

Out[9]:

What happens when we iterate over a `DataFrame`

? Let's try it!

In [10]:

```
for column in students:
print(column)
```

It iterates over the *columns*! Why?!?!? If you recall, a `DataFrame`

is a `dict`

of `Series`

, this is what happens when you loop over a `dict`

.

So how do you loop through the rows? Use the `DataFrame`

's `to_records()`

method, of course.

This example prints out the students on the Dean's list (GPA of 3.4 or better)

In [11]:

```
for student in students.to_records():
if student['GPA'] >=3.4:
print("%s with a GPA of %.3f is on the Dean's list." % (student['Name'], student['GPA']) )
```

Of course if you've been paying attention, you might suspect that we:

**do not need to iterate over the DataFrame to do this!**

Pandas allows us to be expressive in our data selections. For example, here's a data frame of only the student names and GPA's on the dean's list.

In [12]:

```
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list
```

Out[12]:

What you just witnessed sure looks like magic, but it's not. Once you understand how Pandas works, it becomes easy to employ approaches like the one you saw above to your data analysis.

Let's dive deeper.

To select a single column we access its key from the `DataFrame`

dictionary. What we get back is a `Series`

.

For example this retrieves the GPA's:

In [13]:

```
students['GPA'] # Series
```

Out[13]:

`DataFrame`

not a series, correct? Then you need to retrieve a list of columns instead:

In [14]:

```
students[ ['GPA'] ] # DataFrame because ['GPA'] is a list
```

Out[14]:

`Name`

and `Year`

:

In [15]:

```
students[ ['Name', 'Year'] ]
```

Out[15]:

We've seen how to select columns from the `DataFrame`

so let's move on to rows.

Typically DataFrames have far more rows than can fit on your screen. When getting a sense of what your data had to offer you can use three `DataFrame`

methods.

`head(n)`

will return the first **n** rows. For example, here's the first 3 rows:

In [16]:

```
students.head(3)
```

Out[16]:

Likewise we can get the last 2 rows in the `DataFrame`

using the `tail(n)`

method:

In [17]:

```
students.tail(2)
```

Out[17]:

`DataFrame`

? The `sample(n)`

method will retrieve **n** rows at random from the data frame. Let's get 3 rows at random:

In [18]:

```
students.sample(3)
```

Out[18]:

A boolean index is a `Series`

of boolean values. For example. This boolean index describes rows in the `DataFrame`

where the student `Year`

is a freshman `'Fr'`

:

In [19]:

```
students['Year'] == 'Fr'
```

Out[19]:

`DataFrame`

it only selects the rows which are `True`

in the boolean index. For example, this is only the freshman `'Fr'`

.

In [20]:

```
students[ students['Year'] == 'Fr' ]
```

Out[20]:

I'll admit the syntax is a little awkward, but you get accustomed to it over time. It is a very expressive way to filter rows in a `DataFrame`

.

Pandas has special operators for logical and (&), or (|), and not(~).

For example, this retrieves students with a GPA between 2.5 and 3.0. Notice the parenthesis. These are required.

In [21]:

```
students[ (students.GPA >=2.5) & (students.GPA <= 3.0)]
```

Out[21]:

And as we've seen you can combine row selections using boolean indexing with column selections.

In [22]:

```
# columns boolean index
students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
```

Out[22]:

`DataFrame`

the same rules apply!

In [23]:

```
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list[['Name']].head(2)
```

Out[23]:

It's easy to manipulate the data in a `DataFrame`

. This typically done as part of the data preparation steps of data analysis. Sometimes we need to create new columns for a better analysis or clean up existing data. Sometimes this is referred to as **feature engineering**.

For example, this code add's a column called `'Deans List'`

The column is initially set to `'No'`

for all students:

In [24]:

```
students['Deans List'] = 'No'
students
```

Out[24]:

Next, we set `'Deans list'`

to `'Yes'`

for any student who has a GPA of 3.4 or better:

In [25]:

```
students['Deans List'][ students['GPA'] >= 3.4 ] = 'Yes'
students
```

Out[25]:

**NOTE:** If you get a warning that states: `A value is trying to be set on a copy of a slice from a DataFrame`

You get this warning because you manipulated an existing dataframe, instead of a copy.

You can also use this approach to clean up data. For example `Ed`

should be `Eddie`

.

In [26]:

```
students['Name'][students['Name'] == 'Ed'] = 'Eddie'
students
```

Out[26]:

`'Pts From Deans List'`

to calculate the number of points the student's GPA is from the `'Deans List'`

In [27]:

```
students['Pts From Deans List'] = students['GPA'] - 3.4
students
```

Out[27]:

We conclude our Pandas tour with some examples of how easily Pandas handles larger data sets. The true power of Pandas is not in creating your own `Series`

and `DataFramaes`

but loading them from files.

For example this code reads a CSV (comma-separated values) file stored on GitHub and loads it into the `customers DataFrame`

:

In [28]:

```
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')
customers.head()
```

Out[28]:

In [29]:

```
weather = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/weather/syracuse-ny.csv')
weather.head()
```

Out[29]:

`DataFrame.columns`

property comes in handy:

In [30]:

```
weather.columns
```

Out[30]:

`.str.startswith()`

to retrieve the rows in question.

In [31]:

```
weather[ weather['EST'] == '2015-7-4'][['EST','Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
```

Out[31]:

You can even read HTML tables off a website with Pandas and import them into a `DataFrame`

!

For example this retieves a table of US cities by population, from **Wikipedia** https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

In [32]:

```
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population')
data[3].head()
```

Out[32]:

The `read_html()`

method returns a list of `DataFrames`

. The assumption here is there is more than one table on the webpage. `data[3]`

represents the 4th table on the webpage, which is the one we wanted in this case.

Also the data needs to be cleaned up a bit. We will discuss how to do this in our In-Class Coding Lab.

To demonstrate how versatile the `read_html()`

method can be, let's read the schedule of classes off the iSchool website: https://ischool.syr.edu/classes

Remember that `read_html()`

returns a list of `DataFrames`

so we must select the appropriate index containing the data we want. In this case the schedule is in `data[0]`

In [40]:

```
data = pd.read_html('https://ischool.syr.edu/classes/', attrs = {'id': 'classSchedule'} )
schedule = data[0]
schedule.head()
```

Out[40]:

Unfortunately, the `schedule`

DataFrame is missing column names.

But, Pandas allows us to add these at any time. Very Cool!

In [41]:

```
schedule.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']
schedule.head()
```

Out[41]:

And now we can do some analysis, like finding classes on wednesdays:

In [42]:

```
schedule[ schedule['Days'] =='W']
```

Out[42]:

This next example demonstrates how to load json data into a `DataFrame`

. If the JSON data is structured like a list of dict, then it's quite simple. The example would work the same as if we read the json data from a file first.

We need `orient='records'`

because our JSON data is a list of values.

In [34]:

```
json_data = '[{"GPA":3.0,"Name":"Bob"},{"GPA":3.7,"Name":"Sue"},{"GPA":2.4,"Name":"Tom"}]'
tweets = pd.read_json(json_data, orient='records')
tweets.head()
```

Out[34]:

In our final example we demonstrate how to convert a list of dict in python into a pandas `DataFrame`

we will then demonstrate how to convert it back to a list of dict.

This type of scenairio is useful when you call an API and would like to load the data into pandas, or take the pandas `DataFrame`

and process it further in outside of pandas.

In [35]:

```
students_list_of_dict = [
{ 'Name' : 'Bob', 'GPA' : 3.0 },
{ 'Name' : 'Sue', 'GPA' : 3.7 },
{ 'Name' : 'Tom', 'GPA' : 2.4 }
]
students_df = pd.DataFrame(students_list_of_dict)
students_df
```

Out[35]:

`students_df`

back into a list of dict. Choosing the named argument `orient ='record'`

produces a list of dict.

In [36]:

```
students = students_df.to_dict(orient='records')
students
```

Out[36]: