Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [ ]:
NAME = ""

In [ ]:
import os
import os.path
import pandas as pd

datadir = "publicdata"
In [ ]:
data = {'animal': ['cat','cat','snake','dog',
'age': [2.5, 3, 0.5, 7, 5, 2, 4.5, 4, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes','yes','no','yes','no', 
             'no','no','yes','no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 
          'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, index=labels)

Q: Create a column vector agevisit (a Series) which, for each row, is the age of the animal divided by the number of visits for the animal.

In [ ]:
raise NotImplementedError()

Q: Create a dataframe 'df2' with only the rows where the age is greater than 3.

In [ ]:
raise NotImplementedError()

Q: From the original data frame, project the animal and visits columns from those rows where priority is yes. Assign the result to df3.

In [ ]:
raise NotImplementedError()

Q The code below defines a List of Lists (plus a header/column names list). The data is based on data from R for Data Science by Garrett Grolemund and Hadley Wickham. Sometimes "missing data" is indicated by a sentinal value, and in this case, the integer -1 is used for this purpose. (Note that this is a poor practice and we will learn better ways in the future.)

In the single cell that follows, perform the following sequence of operations, making sure to assign the intermediate results to the variables as specified. Operations are not cumulative. Each of the operations performed should start with the original data frame. Note that most operations create a new data frame as part of their operation, but if you need to explicitly create a copy, you can use the copy() method.

  1. Use pandas to read this List of Lists into a dataframe called df, with column labels given by colnames.


  1. Extract the shape of the data into variables called nrows and ncols
  2. Extract the list of cases, into a variable called L (hint: be very careful about type).
  3. Define a list of booleans K where K[i] is False exactly when population[i] represents missing data.
  4. Create a new data frame df2, via slicing and iloc, corresponding to the rows that are about Brazil and the columns for year and cases. Your data frame should have two rows and two columns.
  5. Define a new dataframe df3 consisting of only the rows of df without missing data.
  6. Use a projection to define a new dataframe df4 consisting of only the columns without missing data. Use loc.
  7. Create a new data frame df5 whose row names are of the form X_year where X is the first letter of the country name, e.g., the first row name should be A_1999 and the last C_2001.
In [ ]:
# Provided data initialization

data = [['Afghanistan',  '1999',    745,  19987071, 0],
        ['Afghanistan',  '2000',   2666,  20595360, 1],
        ['Afghanistan',  '2001',   -1,  31527618, 0],
        [     'Brazil',  '1999',  37737, 172006362, 2],
        [     'Brazil',  '2000',  80488, 174504898, 0],
        [      'China',  '1999', 212258, 972915272, 1],
        [      'China',  '2000', 213766, 980428583, 2],
        [      'China',  '2001', 215626,      -1, 1]]

colnames = ['country', 'year', 'cases', 'population', 'category']
In [ ]:
raise NotImplementedError()
In [ ]:
# Testing cell

assert[1,'year'] == '2000'
assert[7,'cases'] == 215626.0
assert[2,'category'] == 0
assert[3,'country'] == 'Brazil'
assert list(df.columns) == ['country', 'year', 'cases', 'population', 'category']
assert nrows == 8
assert ncols == 5
assert len(L) == 8
assert 2666.0 in L
assert type(L) == list
assert type(K) == list
assert K[2] == False
assert K[6] == True
assert[3,'year'] == '1999'
assert df2.shape == (2,2)
assert[4,'cases'] == 80488.0
assert df3.shape == (6,5)
assert[1,'year'] == '2000'
assert[6,'cases'] == 213766.0
assert[3,'country'] == 'Brazil'
assert df4.shape == (8,3)
assert[1,'year'] == '2000'
assert[6,'category'] == 2
assert[3,'country'] == 'Brazil'
assert df5.shape == (8,5)
assert list(df5.index) == ['A_1999', 'A_2000', 'A_2001', 'B_1999', 'B_2000', 'C_1999', 'C_2000', 'C_2001']