Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


In [ ]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

Q1 In the data directory, you will find two csv files, educationTop.csv and educationBottom.csv, both based on data hosted by www.census.gov. Both have the same columns, and each row is a U.S. metropolitan area, with data on population, education, and unemployment. The first csv file contains metropolitan areas starting with A-K, and the second starting with L-Z. Read both into pandas data frames, using the column GEO.id2 as an index. Be careful, as these files have two lines that might be considered header lines. The first one is more descriptive, and the second one is more concise and give succint variable names. So when we read from the csv, we want to skip the first line and then use the second line as the header line. The pandas read_csv constructor can do this for you, but you have to discover and use the correct arguments.

Concatenate these two data frames along the row dimension (with the top one on top), and call the result educationDF.

In [ ]:
# Solution Cell

# YOUR CODE HERE
raise NotImplementedError()
educationDF.head()
In [ ]:
# Testing Cell

assert True

Q2 In the data directory, you will find two csv files, educationLeft.csv and educationRight.csv, both based on data hosted by www.census.gov. Both have the same rows, and each row is a U.S. metropolitan area, with data on population, education, and unemployment. The first has information on individuals without a college degree, and the second has information on individuals with a college degree. Read both into pandas data frames. The GEO.display-label field should serve as the row index. Read these files into data frames and then concatenate these two data frames along the column dimension, and call the result educationDF2.

The same cautions on the CSV having an informational row of headers followed by the header variable names applies to these two CSV files as well.

In [ ]:
# Solution Cell

# YOUR CODE HERE
raise NotImplementedError()
educationDF2.head()
In [ ]:
# Testing Cell

assert True

Q3 In the data directory, you will find two csv files, educationLeftJ.csv and educationRightJ.csv, both based on data hosted by www.census.gov. In both, rows represent U.S. metropolitan area, with data on population, education, and unemployment. However, they do not have exactly the same set of rows, and the columns are totally different except for the index column Geography. In these CSV files, there is not an "extra" header line. The row label index should come from Geography unless you are instructed otherwise.

  1. Read both into pandas data frames, with names educationLeftJ and educationRightJ.
  2. Make a copy of educationLeftJ called educationLeftOriginal.
  3. Starting with educationLeftJ, do a left join to bring in the data from educationRightJ, storing your answer as dfJ.
  4. Starting with educationLeftOriginal, do an inner join to bring in the data from educationRightJ, storing your answer as dfJ2.
  5. Now read the original csv files in as eduLeft and eduRight with no meaningful index. Then, starting from eduLeft, do an inner merge along the column Geography, storing your answer as dfJ3.
In [ ]:
# Solution Cell

# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing Cell

assert True

Q4 Explain the difference in the number of rows and columns in dfJ, dfJ2, and dfJ3. Be specific.

YOUR ANSWER HERE