Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE", as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
import os
import os.path
import pandas as pd
datadir = "publicdata"
data = {'animal': ['cat','cat','snake','dog',
'dog','cat','snake','cat','dog','dog'],
'age': [2.5, 3, 0.5, 7, 5, 2, 4.5, 4, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes','yes','no','yes','no',
'no','no','yes','no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e',
'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
Q: Create a column vector agevisit
(a Series) which, for each row, is the age of the animal divided by the number of visits for the animal.
# YOUR CODE HERE
raise NotImplementedError()
agevisit
Q: Create a dataframe 'df2' with only the rows where the age
is greater than 3.
# YOUR CODE HERE
raise NotImplementedError()
df2
Q: From the original data frame, project the animal and visits columns from those rows where priority is yes. Assign the result to df3
.
# YOUR CODE HERE
raise NotImplementedError()
df3
Q The code below defines a List of Lists (plus a header/column names list). The data is based on data from R for Data Science by Garrett Grolemund and Hadley Wickham. Sometimes "missing data" is indicated by a sentinal value, and in this case, the integer -1 is used for this purpose. (Note that this is a poor practice and we will learn better ways in the future.)
In the single cell that follows, perform the following sequence of operations, making sure to assign the intermediate results to the variables as specified. Operations are not cumulative. Each of the operations performed should start with the original data frame. Note that most operations create a new data frame as part of their operation, but if you need to explicitly create a copy, you can use the copy()
method.
pandas
to read this List of Lists into a dataframe called df
, with column labels given by colnames
. Then:
shape
of the data into variables called nrows
and ncols
cases
, into a variable called L
(hint: be very careful about type).K
where K[i]
is False
exactly when population[i]
represents missing data. df2
, via slicing and iloc
, corresponding to the rows that are about Brazil
and the columns for year
and cases
. Your data frame should have two rows and two columns.df3
consisting of only the rows of df
without missing data.df4
consisting of only the columns without missing data. Use loc
.df5
whose row names are of the form X_year
where X
is the first letter of the country name, e.g., the first row name should be A_1999
and the last C_2001
.# Provided data initialization
data = [['Afghanistan', '1999', 745, 19987071, 0],
['Afghanistan', '2000', 2666, 20595360, 1],
['Afghanistan', '2001', -1, 31527618, 0],
[ 'Brazil', '1999', 37737, 172006362, 2],
[ 'Brazil', '2000', 80488, 174504898, 0],
[ 'China', '1999', 212258, 972915272, 1],
[ 'China', '2000', 213766, 980428583, 2],
[ 'China', '2001', 215626, -1, 1]]
colnames = ['country', 'year', 'cases', 'population', 'category']
# YOUR CODE HERE
raise NotImplementedError()
# Testing cell
assert df.at[1,'year'] == '2000'
assert df.at[7,'cases'] == 215626.0
assert df.at[2,'category'] == 0
assert df.at[3,'country'] == 'Brazil'
assert list(df.columns) == ['country', 'year', 'cases', 'population', 'category']
assert nrows == 8
assert ncols == 5
assert len(L) == 8
assert 2666.0 in L
assert type(L) == list
assert type(K) == list
assert K[2] == False
assert K[6] == True
assert df2.at[3,'year'] == '1999'
assert df2.shape == (2,2)
assert df2.at[4,'cases'] == 80488.0
assert df3.shape == (6,5)
assert df3.at[1,'year'] == '2000'
assert df3.at[6,'cases'] == 213766.0
assert df3.at[3,'country'] == 'Brazil'
assert df4.shape == (8,3)
assert df4.at[1,'year'] == '2000'
assert df4.at[6,'category'] == 2
assert df4.at[3,'country'] == 'Brazil'
assert df5.shape == (8,5)
assert list(df5.index) == ['A_1999', 'A_2000', 'A_2001', 'B_1999', 'B_2000', 'C_1999', 'C_2000', 'C_2001']