{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Denison CS181/DA210 Homework\n", "\n", "Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\\rightarrow$Restart And Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "import pandas as pd\n", "\n", "datadir = \"publicdata\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1** In the data directory, you will find two csv files, `educationTop.csv` and `educationBottom.csv`, both based on data hosted by `www.census.gov`. Both have the same columns, and each row is a U.S. metropolitan area, with data on population, education, and unemployment. The first `csv` file contains metropolitan areas starting with A-K, and the second starting with L-Z. Read both into `pandas` data frames, using the column `GEO.id2` as an index. Be careful, as these files have **two** lines that might be considered header lines. The first one is more descriptive, and the second one is more concise and give succint variable names. So when we read from the csv, we want to **skip** the first line and then use the second line as the header line. The `pandas` `read_csv` constructor can do this for you, but you have to discover and use the correct arguments.\n", "\n", "Concatenate these two data frames along the row dimension (with the top one on top), and call the result `educationDF`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "54c620e0e9583ba4720b710b9ff5468c", "grade": false, "grade_id": "cell-b25fab8c12339430", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Solution Cell\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "educationDF.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b7fb19819cf48177537217d4a05d5bb2", "grade": true, "grade_id": "cell-8f07ee547e7c1ff8", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2** In the data directory, you will find two csv files, `educationLeft.csv` and `educationRight.csv`, both based on data hosted by `www.census.gov`. Both have the same rows, and each row is a U.S. metropolitan area, with data on population, education, and unemployment. The first has information on individuals without a college degree, and the second has information on individuals with a college degree. Read both into `pandas` data frames. The `GEO.display-label` field should serve as the row index. Read these files into data frames and then concatenate these two data frames along the column dimension, and call the result `educationDF2`.\n", "\n", "The same cautions on the CSV having an informational row of headers followed by the header variable names applies to these two CSV files as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "30da97feded2c3f2e453fdd16c25a33e", "grade": false, "grade_id": "cell-698c9ada24032095", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Solution Cell\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "educationDF2.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "ff7cdd95653586519e9373e4f9f27783", "grade": true, "grade_id": "cell-f413a9fabbcd0427", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q3** In the data directory, you will find two csv files, `educationLeftJ.csv` and `educationRightJ.csv`, both based on data hosted by `www.census.gov`. In both, rows represent U.S. metropolitan area, with data on population, education, and unemployment. However, they do not have exactly the same set of rows, and the columns are totally different except for the index column `Geography`. In these CSV files, there is not an \"extra\" header line. The row label index should come from `Geography` unless you are instructed otherwise.\n", "\n", "0. Read both into `pandas` data frames, with names `educationLeftJ` and `educationRightJ`.\n", "1. Make a copy of `educationLeftJ` called `educationLeftOriginal`.\n", "2. Starting with `educationLeftJ`, do a left join to bring in the data from `educationRightJ`, storing your answer as `dfJ`.\n", "3. Starting with `educationLeftOriginal`, do an inner join to bring in the data from `educationRightJ`, storing your answer as `dfJ2`.\n", "4. Now read the original csv files in as `eduLeft` and `eduRight` with no meaningful index. Then, starting from `eduLeft`, do an inner merge along the column `Geography`, storing your answer as `dfJ3`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "ee9fa84634819e01573d6da8e698d71f", "grade": false, "grade_id": "cell-f3cb3f384a53b7b2", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# Solution Cell\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "b11df030d202429c9e2dbca3561f92ee", "grade": true, "grade_id": "cell-a1e03d05287e5c02", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q4** Explain the difference in the number of rows and columns in `dfJ`, `dfJ2`, and `dfJ3`. Be specific." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "6c0dc4e981e60063f4b53a186643b0c9", "grade": true, "grade_id": "cell-60bd9e8dd53dd5f0", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "source": [ "YOUR ANSWER HERE" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }