Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
import os
import os.path
import pandas as pd
datadir = "publicdata"
Q1 Read CSV file topnames.csv
in datadir
into a data frame named topnames0
, with no index. Using individual operations on the count
column, find the mean, the median, and the max count, assigning to mean_counts
, median_counts
, and max_counts
.
# YOUR CODE HERE
raise NotImplementedError()
print("mean: ", mean_counts, ", median: ", median_counts, ", max: ", max_counts, sep="")
# Testing Cell
assert True
Q2 Using the agg
method on the Series
of the column vector of counts, perform the same calculation of mean, median, and max in a single step, and assign to agg_values
. Note in a comment in the code solution cell the data type of the result. Note that this invocation may not have an exact correspondent in the book, so you may have to look up documentation of using agg
on a Series
.
# YOUR CODE HERE
raise NotImplementedError()
agg_values
# Testing Cell
assert True
Q3 Create a subset of topnames0
restricted to Female
entries between 1960 and 1969 inclusive. ssign this to female_subset
. Then use the agg
function, in one step, to determine the mean and median count and the number of unique names, assigning to female_aggvalues
. In a comment in the markdown solution cell, indicate the data type of the result.
# YOUR CODE HERE
raise NotImplementedError()
female_aggvalues
# Testing Cell
assert True
Q4 The constraints for selecting the rows from the last problem are based on sex
and year
. We often use these independent variables to set an index for a data set. Then, when we want to filter rows, our operations that use row label/index
values for filtering are different.
Start by creating dataframe topnames
with its index drawn from the columns year
and sex
. Then,
with a goal of the same use of the agg
function from Q3, use xs
to take a cross section of topnames
to get the Female entries and then use loc
to get a data frame, female_subset
. Finally, use agg
on this data frame to, in one step, determine the mean and median count and the number of unique names, assigning to female_aggvalues.
# YOUR CODE HERE
raise NotImplementedError()
female_aggvalues
# Testing Cell
assert True
Q5 Read CSV file indicators2016.csv
in datadir
into a data frame named indicators0
, with no index. Write code to add a new column popSize
to indicators0
which takes value 'high' if pop > 300
, 'low' if pop < 50
, and 'medium' otherwise.
# YOUR CODE HERE
raise NotImplementedError()
indicators0.head()
# Testing Cell
assert True
Q6 Building on the question above, use groupby
to partition indicators0
by this new column popSize
, assigning to variable groupby_pop
. Note in a comment the data type of groupby_pop
.
# YOUR CODE HERE
raise NotImplementedError()
print(len(groupby_pop))
# Testing Cell
assert True
Q7 Building on the question above, aggregate this groupby partitioning, determining the number of non-missing elements for each of the columns by partitiion. Assign to partition_counts
and also include a comment giving the data type of the result, and the row labels of the result.
# YOUR CODE HERE
raise NotImplementedError()
partition_counts
# Testing Cell
assert True
Q8 In similar fashion, determine the mean, by partition, of pop, gdp, and cell, and the max of life, assigning to partition_aggvalues
. Use the round()
method of DataFrames to round the numeric values to 2 decimal places.
# YOUR CODE HERE
raise NotImplementedError()
partition_aggvalues
# Testing Cell
assert True
Q8 In similar fashion, determine the mean, min, and max of gdp
and life
, again assiging to partition_aggvalues
. Explain in the markdown cell after the code cell how the columns of this result differ from the last two questions, and why.
# YOUR CODE HERE
raise NotImplementedError()
partition_aggvalues
# Testing Cell
assert True
YOUR ANSWER HERE