Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
import os
import os.path
import pandas as pd
datadir = "publicdata"
Q1 In the data directory, you will find two csv files, educationTop.csv
and educationBottom.csv
, both based on data hosted by www.census.gov
. Both have the same columns, and each row is a U.S. metropolitan area, with data on population, education, and unemployment. The first csv
file contains metropolitan areas starting with A-K, and the second starting with L-Z. Read both into pandas
data frames, using the column GEO.id2
as an index. Be careful, as these files have two lines that might be considered header lines. The first one is more descriptive, and the second one is more concise and give succint variable names. So when we read from the csv, we want to skip the first line and then use the second line as the header line. The pandas
read_csv
constructor can do this for you, but you have to discover and use the correct arguments.
Concatenate these two data frames along the row dimension (with the top one on top), and call the result educationDF
.
# Solution Cell
# YOUR CODE HERE
raise NotImplementedError()
educationDF.head()
# Testing Cell
assert True
Q2 In the data directory, you will find two csv files, educationLeft.csv
and educationRight.csv
, both based on data hosted by www.census.gov
. Both have the same rows, and each row is a U.S. metropolitan area, with data on population, education, and unemployment. The first has information on individuals without a college degree, and the second has information on individuals with a college degree. Read both into pandas
data frames. The GEO.display-label
field should serve as the row index. Read these files into data frames and then concatenate these two data frames along the column dimension, and call the result educationDF2
.
The same cautions on the CSV having an informational row of headers followed by the header variable names applies to these two CSV files as well.
# Solution Cell
# YOUR CODE HERE
raise NotImplementedError()
educationDF2.head()
# Testing Cell
assert True
Q3 In the data directory, you will find two csv files, educationLeftJ.csv
and educationRightJ.csv
, both based on data hosted by www.census.gov
. In both, rows represent U.S. metropolitan area, with data on population, education, and unemployment. However, they do not have exactly the same set of rows, and the columns are totally different except for the index column Geography
. In these CSV files, there is not an "extra" header line. The row label index should come from Geography
unless you are instructed otherwise.
pandas
data frames, with names educationLeftJ
and educationRightJ
.educationLeftJ
called educationLeftOriginal
.educationLeftJ
, do a left join to bring in the data from educationRightJ
, storing your answer as dfJ
.educationLeftOriginal
, do an inner join to bring in the data from educationRightJ
, storing your answer as dfJ2
.eduLeft
and eduRight
with no meaningful index. Then, starting from eduLeft
, do an inner merge along the column Geography
, storing your answer as dfJ3
.# Solution Cell
# YOUR CODE HERE
raise NotImplementedError()
# Testing Cell
assert True
Q4 Explain the difference in the number of rows and columns in dfJ
, dfJ2
, and dfJ3
. Be specific.
YOUR ANSWER HERE