Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
import os
import os.path
import pandas as pd
datadir = "publicdata"
Q1 This question deals with the tuburculosis dataset. At no point should you rearrange the order of the rows.
table6.csv
into a dataframe df1
.copy()
to avoid modifying the original data frame. Store the result as df1a
.df1a
, split the column 'rate' into two new columns 'cases' (the number before the slash) and 'population' (the number after). After you're done, drop 'rate'. Store the result as df1b
.# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert df1.shape == (6,4)
assert df1a.shape == (6,3)
assert df1b.shape == (6,4)
assert df1.iloc[2,3] == '37737/172006362'
assert df1a.iloc[3,2] == '2000'
assert df1b.iloc[4,3] == '1272915272'
assert df1b.iloc[0,2] == "745"
Q2 Read us_rent_income.csv
into a dataframe (with "GEOID" as the index), then transform as needed to make it tidy. Store the result as df_rent
.
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert(df_rent.shape == (52,4))
assert(df_rent.iloc[0,0] == 24476.0)
assert(df_rent.iloc[0,1] == 747.0)
assert(df_rent.iloc[0,2] == 136.0)
assert(df_rent.iloc[0,3] == 3.0)
assert(df_rent.iloc[20,0] == 37147.0)
assert(df_rent.iloc[31,1] == 809.0)
Q3 Consider the data on religions and income, gathered by Pew Research Center and hosted at this link:
https://github.com/chendaniely/pandas_for_everyone/blob/master/data/pew.csv
The data is also available as "pew.csv"
in the data folder. In the markdown cell that follows, read the data into a DataFrame assigned to df
. In the subsequent markdown cell, answer the question: Is this data in tidy data form? Explain your answer.
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
YOUR ANSWER HERE
Q4 Explore the data from the previous exercise, then from the data list the independent variable(s) and the dependent variable(s). Note: this data came from a survey of counting individuals based on their religion and their income category.
YOUR ANSWER HERE
Q5 Transform as needed to make it tidy.
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert(df_rel.shape == (180,3))
assert(df_rel.iloc[0,0] == "Agnostic")
assert(df_rel.iloc[0,1] == "<$10k")
assert(df_rel.iloc[0,2] == 27)
assert(df_rel.iloc[41,0] == "Evangelical Prot")
assert(df_rel.iloc[89,1] == "$40-50k")
assert(df_rel.iloc[104,2] == 14)