Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE", as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
import os
import os.path
import pandas as pd
datadir = "publicdata"
path = os.path.join(datadir, "topnames.csv")
tn = pd.read_csv(path)
tn.set_index(['year', 'sex'], inplace=True)
tn.sort_index(level='year', ascending=False, inplace=True)
tn.reset_index(inplace=True)
tn.head(10)
tn2 = tn.set_index(['year', 'sex'])
tn2.head(10)
Depiction of CSV: topnames.csv
year,sex,name,count
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
...
Q Fill in the body of the for
loop below so that we accumulate data from the rows in "topnames.csv"
into the parallel lists initialized for you as yearL
, etc.
path = os.path.join(datadir, "topnames.csv")
yearL = []
sexL = []
nameL =[]
countL = []
with open(path, "r") as f:
headers = f.readline().strip().split(',')
print(headers)
for line in f:
# YOUR CODE HERE
raise NotImplementedError()
print(yearL[:6])
print(sexL[:6])
print(nameL[:6])
print(countL[:6])
print()
print(yearL[-1], sexL[-1], nameL[-1], countL[-1])
year
, sex
$\rightarrow$ name
, count
Dictionary maps keys to values. In this representation, the keys are the names of the columns. The values being mapped to are lists. Each entry in the dictionary represents exactly one column of the data set. Elements in the column list represent the column value for the successive rows of the data set. When a representation makes it easy to work with entire columns, but working with rows is more cumbersome, it is a column-centric representation.
topnames = {'year': [2018, 2018, 2017, 2017, 2016, 2016],
'sex': ['Male', 'Female', 'Male',
'Female', 'Male', 'Female'],
'name': ['Liam', 'Emma', 'Liam', 'Emma',
'Noah', 'Emma'],
'count': [19837, 18688, 18798, 19800, 19117, 19496]}
Use file
tn10.csv
to get 10 entry example.
Q Write code to create DoL representation of data in tn10.csv
, naming the DoL tn10
.
path = os.path.join(datadir, "tn10.csv")
tn10 = {'year': [],
'sex': [],
'name': [],
'count': []}
# YOUR CODE HERE
raise NotImplementedError()
# Column operation
print(tn10['year'])
# Row operation
row_index = 7
print(tn10['year'][row_index], tn10['sex'][row_index],
tn10['name'][row_index], tn10['count'][row_index])
# Importance of "correspondence"
del tn10['year'][4]
row_index = 7
print(tn10['year'][row_index], tn10['sex'][row_index],
tn10['name'][row_index], tn10['count'][row_index])
Q (HW) Assuming a CSV that has the same format of $x$ rows of data, with one header line of year,sex,name,count
and data lines with those same four fields, write a function
readTopNamesDoL(path)
that reads the file and creates a DoL representation and returns that dictionary from the function.
# YOUR CODE HERE
raise NotImplementedError()
tn10 = readTopNamesDoL(os.path.join(datadir, "tn10.csv"))
print(tn10)
Q Write an expression that assigns to rowsDoL
the number of rows in tn10
based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
print(rowsDoL)
Q Write an expression that assigns to colsDoL
the number of columns in tn10
based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
colsDoL
Q (HW) Write a function
filterTopNamesDoL(tnDoL, threshold)
to create a filter copy of a topnames DoL tnDoL
(with columns year
, sex
, name
, count
) so that only rows with a count value greater than or equal to threshold
are present in the newly created DoL. Your function should return the new and filtered DoL.
Q (HW) Write a function
addCatColumnDoL(tnDoL, threshold1, threshold2)
that adds a categorical column to a DoL representation in parameter tnDoL
with the new column named category
whose values are the strings "small"
when count is below threshold1
, is "medium"
when count is greater than or equal to threshold1
and less than threshold2
, and large
when count is greater than or equal to threshold2
.
Q (HW) Write a function
dropColumnDoL(DoL, columnname)
that drops the column specified by columnname
from the dictionary of lists representation given in DoL
. This should be done "in place". If columnname
does not refer to one of the columns in DoL
, the function should simply return.
Outer structure is a list that represents the full data of the data set. Inside this list, we have a sequence of lists where each one of these "inner" lists represents exactly one row of the data, and the elements within the row represent the set of column fields of the row. When a representation makes it easy to work with rows, but working with columns is more cumbersome, it is a row-centric representation.
Depiction of CSV: topnames.csv
year,sex,name,count
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
...
topnames = [[2018, 'Male', 'Liam', 19837],
[2018, 'Female', 'Emma', 18688],
[2017, 'Male', 'Liam', 18798],
[2017, 'Female', 'Emma', 19800],
[2016, 'Male', 'Noah', 19117],
[2016, 'Female', 'Emma', 19496]]
columns = ['year', 'sex', 'name', 'count']
Use file
tn10.csv
to get 10 entry example.
Q Write code to create LoL representation of data in tn10.csv
, name the LoL tn10data
and the columns tn10columns
.
path = os.path.join(datadir, "tn10.csv")
tn10data = []
# YOUR CODE HERE
raise NotImplementedError()
tn10data
Q Write an expression that assigns to rowsLoL
the number of rows in tn10data
based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
print(rowsLoL)
Q Write an expression that assigns to colsLoL
the number of columns in tn10data
based on the data structure itself.
# YOUR CODE HERE
raise NotImplementedError()
colsLoL
# Row operation
row_index = 7
print(tn10data[row_index])
# Deleting a row
del tn10data[4]
tn10data
# Row operation
row_index = 7
print(tn10data[row_index])
Q (HW) Assuming a CSV that has the same format of $x$ rows of data, with one header lineand data lines with the same four fields, write a function
readTopNamesLoL(path)
that reads the file and creates a LoL representation and returns both the list of column names and the list of lists structure from the function.
# YOUR CODE HERE
raise NotImplementedError()
tn10columns, tn10data = readTopNamesLoL(os.path.join(datadir, "tn10.csv"))
print(tn10columns)
print(tn10data)
Q (HW) Write a function
filterTopNamesLoL(tnLoL, threshold)
to filter a topnames LoL tnLoL
(with columns year
, sex
, name
, count
) so that only rows with a count value greater than or equal to threshold
are present in the newly created LoL. Note that, for this function (and different from filterTopNamesLoL
, you are creating a new LoL with the filtered data, and not modifying tnLoL
in place. Your function should return the new LoL.
Q (HW) Write a function
addCatColumnLoL(tnLoL, threshold1, threshold2)
that adds a categorical column to a LoL representation in parameter tnDoL
with the new column named category
whose values are the strings "small"
when count is below threshold1
, is "medium"
when count is greater than or equal to threshold1
and less than threshold2
, and large
when count is greater than or equal to threshold2
. The function should perform its modifications in place.
Q (HW) Write a function
dropColumnLoL(LoL, columns, columnname)
that drops the column specified by columnname
from the list of lists representation given in LoL
. This should be done "in place" and should not assume the topnames columns, but rather use the list of columns specified in columns
to determine which column to drop.