Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:

NAME = ""
COLLABORATORS = ""

Intro Data Systems: Section 3.4¶

Setup and preliminaries¶

import os
import os.path
import pandas as pd

datadir = "publicdata"

path = os.path.join(datadir, "topnames.csv")
tn = pd.read_csv(path)
tn.set_index(['year', 'sex'], inplace=True)
tn.sort_index(level='year', ascending=False, inplace=True)
tn.reset_index(inplace=True)

tn.head(10)

tn2 = tn.set_index(['year', 'sex'])
tn2.head(10)

Depiction of CSV: topnames.csv

year,sex,name,count
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
      ...

Q Fill in the body of the for loop below so that we accumulate data from the rows in "topnames.csv" into the parallel lists initialized for you as yearL, etc.

path = os.path.join(datadir, "topnames.csv")

yearL = []
sexL = []
nameL =[]
countL = []

with open(path, "r") as f:
    headers = f.readline().strip().split(',')
    print(headers)
    for line in f:
        # YOUR CODE HERE
        raise NotImplementedError()

print(yearL[:6])
print(sexL[:6])
print(nameL[:6])
print(countL[:6])

print()
print(yearL[-1], sexL[-1], nameL[-1], countL[-1])

year, sex $\rightarrow$ name, count

Dictionary of Lists (DoL)¶

Dictionary maps keys to values. In this representation, the keys are the names of the columns. The values being mapped to are lists. Each entry in the dictionary represents exactly one column of the data set. Elements in the column list represent the column value for the successive rows of the data set. When a representation makes it easy to work with entire columns, but working with rows is more cumbersome, it is a column-centric representation.

Hard-coded Value Initialization of DoL¶

topnames = {'year': [2018, 2018, 2017, 2017, 2016, 2016],
          'sex': ['Male', 'Female', 'Male',
                    'Female', 'Male', 'Female'],
          'name': ['Liam', 'Emma', 'Liam', 'Emma',
                     'Noah', 'Emma'],
          'count': [19837, 18688, 18798, 19800, 19117, 19496]}

CSV Value Initialization of DoL¶

Use file tn10.csv to get 10 entry example.

Q Write code to create DoL representation of data in tn10.csv, naming the DoL tn10.

path = os.path.join(datadir, "tn10.csv")

tn10 = {'year': [],
        'sex': [],
        'name': [],
        'count': []}

# YOUR CODE HERE
raise NotImplementedError()

# Column operation

print(tn10['year'])

# Row operation

row_index = 7
print(tn10['year'][row_index], tn10['sex'][row_index],
    tn10['name'][row_index], tn10['count'][row_index])

# Importance of "correspondence"

del tn10['year'][4]

row_index = 7
print(tn10['year'][row_index], tn10['sex'][row_index],
    tn10['name'][row_index], tn10['count'][row_index])

Q (HW) Assuming a CSV that has the same format of $x$ rows of data, with one header line of year,sex,name,count and data lines with those same four fields, write a function

readTopNamesDoL(path)

that reads the file and creates a DoL representation and returns that dictionary from the function.

# YOUR CODE HERE
raise NotImplementedError()
tn10 = readTopNamesDoL(os.path.join(datadir, "tn10.csv"))
print(tn10)

Q Write an expression that assigns to rowsDoL the number of rows in tn10 based on the data structure itself.

# YOUR CODE HERE
raise NotImplementedError()
print(rowsDoL)

Q Write an expression that assigns to colsDoL the number of columns in tn10 based on the data structure itself.

# YOUR CODE HERE
raise NotImplementedError()
colsDoL

Q (HW) Write a function

filterTopNamesDoL(tnDoL, threshold)

to create a filter copy of a topnames DoL tnDoL (with columns year, sex, name, count) so that only rows with a count value greater than or equal to threshold are present in the newly created DoL. Your function should return the new and filtered DoL.

Q (HW) Write a function

addCatColumnDoL(tnDoL, threshold1, threshold2)

that adds a categorical column to a DoL representation in parameter tnDoL with the new column named category whose values are the strings "small" when count is below threshold1, is "medium" when count is greater than or equal to threshold1 and less than threshold2, and large when count is greater than or equal to threshold2.

Q (HW) Write a function

dropColumnDoL(DoL, columnname)

that drops the column specified by columnname from the dictionary of lists representation given in DoL. This should be done "in place". If columnname does not refer to one of the columns in DoL, the function should simply return.

List of Lists (LoL)¶

Outer structure is a list that represents the full data of the data set. Inside this list, we have a sequence of lists where each one of these "inner" lists represents exactly one row of the data, and the elements within the row represent the set of column fields of the row. When a representation makes it easy to work with rows, but working with columns is more cumbersome, it is a row-centric representation.

Depiction of CSV: topnames.csv

year,sex,name,count
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
      ...

Hard-coded Value Initialization of LoL¶

topnames = [[2018, 'Male', 'Liam', 19837],
            [2018, 'Female', 'Emma', 18688],
            [2017, 'Male', 'Liam', 18798],
            [2017, 'Female', 'Emma', 19800],
            [2016, 'Male', 'Noah', 19117],
            [2016, 'Female', 'Emma', 19496]]
columns = ['year', 'sex', 'name', 'count']

CSV Value Initialization of LoL¶

Use file tn10.csv to get 10 entry example.

Q Write code to create LoL representation of data in tn10.csv, name the LoL tn10data and the columns tn10columns.

path = os.path.join(datadir, "tn10.csv")

tn10data = []

# YOUR CODE HERE
raise NotImplementedError()
tn10data

Q Write an expression that assigns to rowsLoL the number of rows in tn10data based on the data structure itself.

# YOUR CODE HERE
raise NotImplementedError()
print(rowsLoL)

Q Write an expression that assigns to colsLoL the number of columns in tn10data based on the data structure itself.

# YOUR CODE HERE
raise NotImplementedError()
colsLoL

# Row operation

row_index = 7
print(tn10data[row_index])

# Deleting a row

del tn10data[4]
tn10data

# Row operation

row_index = 7
print(tn10data[row_index])

Q (HW) Assuming a CSV that has the same format of $x$ rows of data, with one header lineand data lines with the same four fields, write a function

readTopNamesLoL(path)

that reads the file and creates a LoL representation and returns both the list of column names and the list of lists structure from the function.

# YOUR CODE HERE
raise NotImplementedError()
tn10columns, tn10data = readTopNamesLoL(os.path.join(datadir, "tn10.csv"))
print(tn10columns)
print(tn10data)

Q (HW) Write a function

filterTopNamesLoL(tnLoL, threshold)

to filter a topnames LoL tnLoL (with columns year, sex, name, count) so that only rows with a count value greater than or equal to threshold are present in the newly created LoL. Note that, for this function (and different from filterTopNamesLoL, you are creating a new LoL with the filtered data, and not modifying tnLoL in place. Your function should return the new LoL.

Q (HW) Write a function

addCatColumnLoL(tnLoL, threshold1, threshold2)

that adds a categorical column to a LoL representation in parameter tnDoL with the new column named category whose values are the strings "small" when count is below threshold1, is "medium" when count is greater than or equal to threshold1 and less than threshold2, and large when count is greater than or equal to threshold2. The function should perform its modifications in place.

Q (HW) Write a function

dropColumnLoL(LoL, columns, columnname)

that drops the column specified by columnname from the list of lists representation given in LoL. This should be done "in place" and should not assume the topnames columns, but rather use the list of columns specified in columns to determine which column to drop.