Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".
import os
import os.path
import pandas as pd
datadir = "publicdata"
Q1 This question deals with the tuburculosis dataset. At no point should you rearrange the order of the rows.
table6.csv into a dataframe df1.copy() to avoid modifying the original data frame. Store the result as df1a.df1a, split the column 'rate' into two new columns 'cases' (the number before the slash) and 'population' (the number after). After you're done, drop 'rate'. Store the result as df1b.# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert df1.shape == (6,4)
assert df1a.shape == (6,3)
assert df1b.shape == (6,4)
assert df1.iloc[2,3] == '37737/172006362'
assert df1a.iloc[3,2] == '2000'
assert df1b.iloc[4,3] == '1272915272'
assert df1b.iloc[0,2] == "745"
Q2 Read us_rent_income.csv into a dataframe (with "GEOID" as the index), then transform as needed to make it tidy. Store the result as df_rent.
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert(df_rent.shape == (52,4))
assert(df_rent.iloc[0,0] == 24476.0)
assert(df_rent.iloc[0,1] == 747.0)
assert(df_rent.iloc[0,2] == 136.0)
assert(df_rent.iloc[0,3] == 3.0)
assert(df_rent.iloc[20,0] == 37147.0)
assert(df_rent.iloc[31,1] == 809.0)
Q3 Consider the data on religions and income, gathered by Pew Research Center and hosted at this link:
https://github.com/chendaniely/pandas_for_everyone/blob/master/data/pew.csv
The data is also available as "pew.csv" in the data folder. In the markdown cell that follows, read the data into a DataFrame assigned to df. In the subsequent markdown cell, answer the question: Is this data in tidy data form? Explain your answer.
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
YOUR ANSWER HERE
Q4 Explore the data from the previous exercise, then from the data list the independent variable(s) and the dependent variable(s). Note: this data came from a survey of counting individuals based on their religion and their income category.
YOUR ANSWER HERE
Q5 Transform as needed to make it tidy.
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert(df_rel.shape == (180,3))
assert(df_rel.iloc[0,0] == "Agnostic")
assert(df_rel.iloc[0,1] == "<$10k")
assert(df_rel.iloc[0,2] == 27)
assert(df_rel.iloc[41,0] == "Evangelical Prot")
assert(df_rel.iloc[89,1] == "$40-50k")
assert(df_rel.iloc[104,2] == 14)