Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


In [ ]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

Q1 Read CSV file topnames.csv in datadir into a data frame named topnames0, with no index. Using individual operations on the count column, find the mean, the median, and the max count, assigning to mean_counts, median_counts, and max_counts.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
print("mean: ", mean_counts, ", median: ", median_counts, ", max: ", max_counts, sep="")
In [ ]:
# Testing Cell

assert True

Q2 Using the agg method on the Series of the column vector of counts, perform the same calculation of mean, median, and max in a single step, and assign to agg_values. Note in a comment in the code solution cell the data type of the result. Note that this invocation may not have an exact correspondent in the book, so you may have to look up documentation of using agg on a Series.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
agg_values
In [ ]:
# Testing Cell

assert True

Q3 Create a subset of topnames0 restricted to Female entries between 1960 and 1969 inclusive. ssign this to female_subset. Then use the agg function, in one step, to determine the mean and median count and the number of unique names, assigning to female_aggvalues. In a comment in the markdown solution cell, indicate the data type of the result.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
female_aggvalues
In [ ]:
# Testing Cell

assert True

Q4 The constraints for selecting the rows from the last problem are based on sex and year. We often use these independent variables to set an index for a data set. Then, when we want to filter rows, our operations that use row label/index values for filtering are different.

Start by creating dataframe topnames with its index drawn from the columns year and sex. Then, with a goal of the same use of the agg function from Q3, use xs to take a cross section of topnames to get the Female entries and then use loc to get a data frame, female_subset. Finally, use agg on this data frame to, in one step, determine the mean and median count and the number of unique names, assigning to female_aggvalues.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
female_aggvalues
In [ ]:
# Testing Cell

assert True

Q5 Read CSV file indicators2016.csv in datadir into a data frame named indicators0, with no index. Write code to add a new column popSize to indicators0 which takes value 'high' if pop > 300, 'low' if pop < 50, and 'medium' otherwise.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
indicators0.head()
In [ ]:
# Testing Cell

assert True

Q6 Building on the question above, use groupby to partition indicators0 by this new column popSize, assigning to variable groupby_pop. Note in a comment the data type of groupby_pop.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
print(len(groupby_pop))
In [ ]:
# Testing Cell

assert True

Q7 Building on the question above, aggregate this groupby partitioning, determining the number of non-missing elements for each of the columns by partitiion. Assign to partition_counts and also include a comment giving the data type of the result, and the row labels of the result.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
partition_counts
In [ ]:
# Testing Cell

assert True

Q8 In similar fashion, determine the mean, by partition, of pop, gdp, and cell, and the max of life, assigning to partition_aggvalues. Use the round() method of DataFrames to round the numeric values to 2 decimal places.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
partition_aggvalues
In [ ]:
# Testing Cell

assert True

Q8 In similar fashion, determine the mean, min, and max of gdp and life, again assiging to partition_aggvalues. Explain in the markdown cell after the code cell how the columns of this result differ from the last two questions, and why.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
partition_aggvalues
In [ ]:
# Testing Cell

assert True

YOUR ANSWER HERE