{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name and collaborators below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "COLLABORATORS = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro Data Systems: Section 3.4\n", "\n", "## Setup and preliminaries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "import pandas as pd\n", "\n", "datadir = \"publicdata\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = os.path.join(datadir, \"topnames.csv\")\n", "tn = pd.read_csv(path)\n", "tn.set_index(['year', 'sex'], inplace=True)\n", "tn.sort_index(level='year', ascending=False, inplace=True)\n", "tn.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tn.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tn2 = tn.set_index(['year', 'sex'])\n", "tn2.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Depiction of CSV: topnames.csv**\n", "\n", "```\n", "year,sex,name,count\n", "2018,Male,Liam,19837\n", "2018,Female,Emma,18688\n", "2017,Male,Liam,18798\n", "2017,Female,Emma,19800\n", "2016,Male,Noah,19117\n", "2016,Female,Emma,19496\n", " ...\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Fill in the body of the `for` loop below so that we accumulate data from the rows in `\"topnames.csv\"` into the parallel lists initialized for you as `yearL`, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f45b0828d83afe4c81f7152582004f46", "grade": true, "grade_id": "cell-a75b9393a4c1b4cb", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "path = os.path.join(datadir, \"topnames.csv\")\n", "\n", "yearL = []\n", "sexL = []\n", "nameL =[]\n", "countL = []\n", "\n", "with open(path, \"r\") as f:\n", " headers = f.readline().strip().split(',')\n", " print(headers)\n", " for line in f:\n", " # YOUR CODE HERE\n", " raise NotImplementedError()\n", "\n", "print(yearL[:6])\n", "print(sexL[:6])\n", "print(nameL[:6])\n", "print(countL[:6])\n", "\n", "print()\n", "print(yearL[-1], sexL[-1], nameL[-1], countL[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`year`, `sex` $\\rightarrow$ `name`, `count` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dictionary of Lists (DoL)\n", "\n", "> Dictionary maps keys to values. In this representation, the keys are the **names of the columns**. The values being mapped to are **lists**. Each entry in the dictionary represents **exactly one column** of the data set. Elements in the column list represent the column value for the successive rows of the data set. When a representation makes it easy to work with entire columns, but working with rows is more cumbersome, it is a **column-centric** representation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hard-coded Value Initialization of DoL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topnames = {'year': [2018, 2018, 2017, 2017, 2016, 2016],\n", " 'sex': ['Male', 'Female', 'Male',\n", " 'Female', 'Male', 'Female'],\n", " 'name': ['Liam', 'Emma', 'Liam', 'Emma',\n", " 'Noah', 'Emma'],\n", " 'count': [19837, 18688, 18798, 19800, 19117, 19496]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV Value Initialization of DoL\n", "\n", "> Use file `tn10.csv` to get 10 entry example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Write code to create DoL representation of data in `tn10.csv`, naming the DoL `tn10`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f72dfc6eb9aa052def9c7bfc89de5510", "grade": true, "grade_id": "cell-7c6e356ac6321a3e", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "path = os.path.join(datadir, \"tn10.csv\")\n", "\n", "tn10 = {'year': [],\n", " 'sex': [],\n", " 'name': [],\n", " 'count': []}\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Column operation\n", "\n", "print(tn10['year'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Row operation\n", "\n", "row_index = 7\n", "print(tn10['year'][row_index], tn10['sex'][row_index],\n", " tn10['name'][row_index], tn10['count'][row_index])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Importance of \"correspondence\"\n", "\n", "del tn10['year'][4]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row_index = 7\n", "print(tn10['year'][row_index], tn10['sex'][row_index],\n", " tn10['name'][row_index], tn10['count'][row_index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Assuming a CSV that has the same format of $x$ rows of data, with one header line of `year,sex,name,count` and data lines with those same four fields, write a function\n", "\n", " readTopNamesDoL(path)\n", "\n", "that reads the file and creates a DoL representation and returns that dictionary from the function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "38389bd791aa17ba0276fb4671ef531c", "grade": true, "grade_id": "cell-c349c04ca9e41ea6", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "tn10 = readTopNamesDoL(os.path.join(datadir, \"tn10.csv\"))\n", "print(tn10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Write an expression that assigns to `rowsDoL` the number of rows in `tn10` based on the data structure itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "84d72877c67861c7d4507edbabcecce0", "grade": true, "grade_id": "cell-31d7a47c603fa646", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(rowsDoL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Write an expression that assigns to `colsDoL` the number of columns in `tn10` based on the data structure itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8d3dea8b314494599e6983a880d39035", "grade": true, "grade_id": "cell-35370add2b5eee46", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "colsDoL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Write a function \n", "\n", " filterTopNamesDoL(tnDoL, threshold)\n", " \n", "to create a **filter** copy of a topnames DoL `tnDoL` (with columns `year`, `sex`, `name`, `count`) so that only rows with a count value greater than or equal to `threshold` are present in the newly created DoL. Your function should return the new and filtered DoL." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Write a function \n", "\n", " addCatColumnDoL(tnDoL, threshold1, threshold2)\n", " \n", "that adds a categorical column to a DoL representation in parameter `tnDoL` with the new column named `category` whose values are the strings `\"small\"` when count is below `threshold1`, is `\"medium\"` when count is greater than or equal to `threshold1` and less than `threshold2`, and `large` when count is greater than or equal to `threshold2`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Write a function\n", "\n", " dropColumnDoL(DoL, columnname)\n", " \n", "that drops the column specified by `columnname` from the dictionary of lists representation given in `DoL`. This should be done \"in place\". If `columnname` does not refer to one of the columns in `DoL`, the function should simply return." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List of Lists (LoL)\n", "\n", "> Outer structure is a **list** that represents the full data of the data set. Inside this list, we have a sequence of lists where each one of these \"inner\" lists represents exactly one row of the data, and the elements within the row represent the set of column fields of the row. When a representation makes it easy to work with rows, but working with columns is more cumbersome, it is a **row-centric** representation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Depiction of CSV: topnames.csv**\n", "\n", "```\n", "year,sex,name,count\n", "2018,Male,Liam,19837\n", "2018,Female,Emma,18688\n", "2017,Male,Liam,18798\n", "2017,Female,Emma,19800\n", "2016,Male,Noah,19117\n", "2016,Female,Emma,19496\n", " ...\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hard-coded Value Initialization of LoL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topnames = [[2018, 'Male', 'Liam', 19837],\n", " [2018, 'Female', 'Emma', 18688],\n", " [2017, 'Male', 'Liam', 18798],\n", " [2017, 'Female', 'Emma', 19800],\n", " [2016, 'Male', 'Noah', 19117],\n", " [2016, 'Female', 'Emma', 19496]]\n", "columns = ['year', 'sex', 'name', 'count']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSV Value Initialization of LoL\n", "\n", "> Use file `tn10.csv` to get 10 entry example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Write code to create LoL representation of data in `tn10.csv`, name the LoL `tn10data` and the columns `tn10columns`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "369ec0e8ca958b4a62c8e57753f99e26", "grade": true, "grade_id": "cell-9fc3d0a7e84122e8", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "path = os.path.join(datadir, \"tn10.csv\")\n", "\n", "tn10data = []\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "tn10data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Write an expression that assigns to `rowsLoL` the number of rows in `tn10data` based on the data structure itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "58a6f6b2b8ef48ff5b804247c2135802", "grade": true, "grade_id": "cell-97be42d39b25667b", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(rowsLoL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** Write an expression that assigns to `colsLoL` the number of columns in `tn10data` based on the data structure itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "97119f621f6684e940e2465286c38e50", "grade": true, "grade_id": "cell-4b45597e10f5ae4a", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "colsLoL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Row operation\n", "\n", "row_index = 7\n", "print(tn10data[row_index])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Deleting a row\n", "\n", "del tn10data[4]\n", "tn10data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Row operation\n", "\n", "row_index = 7\n", "print(tn10data[row_index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Assuming a CSV that has the same format of $x$ rows of data, with one header lineand data lines with the same four fields, write a function\n", "\n", " readTopNamesLoL(path)\n", "\n", "that reads the file and creates a LoL representation and returns both the list of column names and the list of lists structure from the function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "be802bb3b449b4f2bbe979c01f3880f4", "grade": true, "grade_id": "cell-6d8d05fa784528c6", "locked": false, "points": 0, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "tn10columns, tn10data = readTopNamesLoL(os.path.join(datadir, \"tn10.csv\"))\n", "print(tn10columns)\n", "print(tn10data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Write a function \n", "\n", " filterTopNamesLoL(tnLoL, threshold)\n", " \n", "to **filter** a topnames LoL `tnLoL` (with columns `year`, `sex`, `name`, `count`) so that only rows with a count value greater than or equal to `threshold` are present in the newly created LoL. Note that, for this function (and different from `filterTopNamesLoL`, you are creating a **new** LoL with the filtered data, and **not** modifying `tnLoL` in place. Your function should return the new LoL." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Write a function \n", "\n", " addCatColumnLoL(tnLoL, threshold1, threshold2)\n", " \n", "that adds a categorical column to a LoL representation in parameter `tnDoL` with the new column named `category` whose values are the strings `\"small\"` when count is below `threshold1`, is `\"medium\"` when count is greater than or equal to `threshold1` and less than `threshold2`, and `large` when count is greater than or equal to `threshold2`. The function should perform its modifications in place." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q (HW)** Write a function\n", "\n", " dropColumnLoL(LoL, columns, columnname)\n", " \n", "that drops the column specified by `columnname` from the list of lists representation given in `LoL`. This should be done \"in place\" and should **not** assume the topnames columns, but rather use the list of columns specified in `columns` to determine which column to drop." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }