Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


In [ ]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

Q1 This question deals with the tuburculosis dataset. At no point should you rearrange the order of the rows.

  1. Read table6.csv into a dataframe df1.
  2. Combine 'century' and 'yearDigits' into one column, 'year' (whose values are strings), then drop the two old columns. Use copy() to avoid modifying the original data frame. Store the result as df1a.
  3. Starting from df1a, split the column 'rate' into two new columns 'cases' (the number before the slash) and 'population' (the number after). After you're done, drop 'rate'. Store the result as df1b.
In [ ]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing cell

assert df1.shape == (6,4)
assert df1a.shape == (6,3)
assert df1b.shape == (6,4)
assert df1.iloc[2,3] == '37737/172006362'
assert df1a.iloc[3,2] == '2000'
assert df1b.iloc[4,3] == '1272915272'
assert df1b.iloc[0,2] == "745"

Q2 Read us_rent_income.csv into a dataframe (with "GEOID" as the index), then transform as needed to make it tidy. Store the result as df_rent.

In [ ]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing cell

assert(df_rent.shape == (52,4))
assert(df_rent.iloc[0,0] == 24476.0)
assert(df_rent.iloc[0,1] == 747.0)
assert(df_rent.iloc[0,2] == 136.0)
assert(df_rent.iloc[0,3] == 3.0)
assert(df_rent.iloc[20,0] == 37147.0)
assert(df_rent.iloc[31,1] == 809.0) 

Q3 Consider the data on religions and income, gathered by Pew Research Center and hosted at this link:

https://github.com/chendaniely/pandas_for_everyone/blob/master/data/pew.csv

The data is also available as "pew.csv" in the data folder. In the markdown cell that follows, read the data into a DataFrame assigned to df. In the subsequent markdown cell, answer the question: Is this data in tidy data form? Explain your answer.

In [ ]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

Q4 Explore the data from the previous exercise, then from the data list the independent variable(s) and the dependent variable(s). Note: this data came from a survey of counting individuals based on their religion and their income category.

YOUR ANSWER HERE

Q5 Transform as needed to make it tidy.

In [ ]:
# Solution cell

# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing cell

assert(df_rel.shape == (180,3))
assert(df_rel.iloc[0,0] == "Agnostic")
assert(df_rel.iloc[0,1] == "<$10k")
assert(df_rel.iloc[0,2] == 27)
assert(df_rel.iloc[41,0] == "Evangelical Prot")
assert(df_rel.iloc[89,1] == "$40-50k")
assert(df_rel.iloc[104,2] == 14)