Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
import os
import os.path
import pandas as pd
datadir = "publicdata"
path = os.path.join(datadir, "topnames.csv")
tn = pd.read_csv(path)
tn.set_index(['year', 'sex'], inplace=True)
tn.sort_index(level='year', ascending=False, inplace=True)
tn.reset_index(inplace=True)
tn.head(10)
tn2 = tn.set_index(['year', 'sex'])
tn2.head(10)
Depiction of CSV: topnames.csv
year,sex,name,count
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
...
Q Fill in the body of the for loop below so that we accumulate data from the rows in "topnames.csv" into the parallel lists initialized for you as yearL, etc.
path = os.path.join(datadir, "topnames.csv")
yearL = []
sexL = []
nameL =[]
countL = []
with open(path, "r") as f:
headers = f.readline().strip().split(',')
print(headers)
for line in f:
# YOUR CODE HERE
raise NotImplementedError()
print(yearL[:6])
print(sexL[:6])
print(nameL[:6])
print(countL[:6])
print()
print(yearL[-1], sexL[-1], nameL[-1], countL[-1])
year, sex $\rightarrow$ name, count
Dictionary maps keys to values. In this representation, the keys are the names of the columns. The values being mapped to are lists. Each entry in the dictionary represents exactly one column of the data set. Elements in the column list represent the column value for the successive rows of the data set. When a representation makes it easy to work with entire columns, but working with rows is more cumbersome, it is a column-centric representation.
topnames = {'year': [2018, 2018, 2017, 2017, 2016, 2016],
'sex': ['Male', 'Female', 'Male',
'Female', 'Male', 'Female'],
'name': ['Liam', 'Emma', 'Liam', 'Emma',
'Noah', 'Emma'],
'count': [19837, 18688, 18798, 19800, 19117, 19496]}
Use file
tn10.csvto get 10 entry example.
Q Write code to create DoL representation of data in tn10.csv, naming the DoL tn10.
path = os.path.join(datadir, "tn10.csv")
tn10 = {'year': [],
'sex': [],
'name': [],
'count': []}
# YOUR CODE HERE
raise NotImplementedError()
# Column operation
print(tn10['year'])
# Row operation
row_index = 7
print(tn10['year'][row_index], tn10['sex'][row_index],
tn10['name'][row_index], tn10['count'][row_index])
# Importance of "correspondence"
del tn10['year'][4]
row_index = 7
print(tn10['year'][row_index], tn10['sex'][row_index],
tn10['name'][row_index], tn10['count'][row_index])
Q (HW) Assuming a CSV that has the same format of $x$ rows of data, with one header line of year,sex,name,count and data lines with those same four fields, write a function
readTopNamesDoL(path)
that reads the file and creates a DoL representation and returns that dictionary from the function.
# YOUR CODE HERE
raise NotImplementedError()
tn10 = readTopNamesDoL(os.path.join(datadir, "tn10.csv"))
print(tn10)
Q Write an expression that assigns to rowsDoL the number of rows in tn10 based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
print(rowsDoL)
Q Write an expression that assigns to colsDoL the number of columns in tn10 based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
colsDoL
Q (HW) Write a function
filterTopNamesDoL(tnDoL, threshold)
to create a filter copy of a topnames DoL tnDoL (with columns year, sex, name, count) so that only rows with a count value greater than or equal to threshold are present in the newly created DoL. Your function should return the new and filtered DoL.
Q (HW) Write a function
addCatColumnDoL(tnDoL, threshold1, threshold2)
that adds a categorical column to a DoL representation in parameter tnDoL with the new column named category whose values are the strings "small" when count is below threshold1, is "medium" when count is greater than or equal to threshold1 and less than threshold2, and large when count is greater than or equal to threshold2.
Q (HW) Write a function
dropColumnDoL(DoL, columnname)
that drops the column specified by columnname from the dictionary of lists representation given in DoL. This should be done "in place". If columnname does not refer to one of the columns in DoL, the function should simply return.
Outer structure is a list that represents the full data of the data set. Inside this list, we have a sequence of lists where each one of these "inner" lists represents exactly one row of the data, and the elements within the row represent the set of column fields of the row. When a representation makes it easy to work with rows, but working with columns is more cumbersome, it is a row-centric representation.
Depiction of CSV: topnames.csv
year,sex,name,count
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
...
topnames = [[2018, 'Male', 'Liam', 19837],
[2018, 'Female', 'Emma', 18688],
[2017, 'Male', 'Liam', 18798],
[2017, 'Female', 'Emma', 19800],
[2016, 'Male', 'Noah', 19117],
[2016, 'Female', 'Emma', 19496]]
columns = ['year', 'sex', 'name', 'count']
Use file
tn10.csvto get 10 entry example.
Q Write code to create LoL representation of data in tn10.csv, name the LoL tn10data and the columns tn10columns.
path = os.path.join(datadir, "tn10.csv")
tn10data = []
# YOUR CODE HERE
raise NotImplementedError()
tn10data
Q Write an expression that assigns to rowsLoL the number of rows in tn10data based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
print(rowsLoL)
Q Write an expression that assigns to colsLoL the number of columns in tn10data based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
colsLoL
# Row operation
row_index = 7
print(tn10data[row_index])
# Deleting a row
del tn10data[4]
tn10data
# Row operation
row_index = 7
print(tn10data[row_index])
Q (HW) Assuming a CSV that has the same format of $x$ rows of data, with one header lineand data lines with the same four fields, write a function
readTopNamesLoL(path)
that reads the file and creates a LoL representation and returns both the list of column names and the list of lists structure from the function.
# YOUR CODE HERE
raise NotImplementedError()
tn10columns, tn10data = readTopNamesLoL(os.path.join(datadir, "tn10.csv"))
print(tn10columns)
print(tn10data)
Q (HW) Write a function
filterTopNamesLoL(tnLoL, threshold)
to filter a topnames LoL tnLoL (with columns year, sex, name, count) so that only rows with a count value greater than or equal to threshold are present in the newly created LoL. Note that, for this function (and different from filterTopNamesLoL, you are creating a new LoL with the filtered data, and not modifying tnLoL in place. Your function should return the new LoL.
Q (HW) Write a function
addCatColumnLoL(tnLoL, threshold1, threshold2)
that adds a categorical column to a LoL representation in parameter tnDoL with the new column named category whose values are the strings "small" when count is below threshold1, is "medium" when count is greater than or equal to threshold1 and less than threshold2, and large when count is greater than or equal to threshold2. The function should perform its modifications in place.
Q (HW) Write a function
dropColumnLoL(LoL, columns, columnname)
that drops the column specified by columnname from the list of lists representation given in LoL. This should be done "in place" and should not assume the topnames columns, but rather use the list of columns specified in columns to determine which column to drop.