{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Denison CS181/DA210 Homework\n", "\n", "Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\\rightarrow$Restart And Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "import pandas as pd\n", "\n", "datadir = \"publicdata\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1** Read CSV file `topnames.csv` in `datadir` into a data frame named `topnames0`, with no index. Using individual operations on the `count` column, find the mean, the median, and the max count, assigning to `mean_counts`, `median_counts`, and `max_counts`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "56a21f4d74ca19c2fd0b3be41c1d6a0d", "grade": false, "grade_id": "cell-273768f03f8357e7", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(\"mean: \", mean_counts, \", median: \", median_counts, \", max: \", max_counts, sep=\"\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "33717e77cbc90703a46c471991939d68", "grade": true, "grade_id": "cell-057e3c90c18b71de", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2** Using the `agg` method on the `Series` of the column vector of counts, perform the same calculation of mean, median, and max in a single step, and assign to `agg_values`. Note in a comment in the code solution cell the **data type** of the result. Note that this invocation may not have an exact correspondent in the book, so you may have to look up documentation of using `agg` on a `Series`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a13203da17c6ea29f0c767ea0ec98983", "grade": false, "grade_id": "cell-b34127acc546301e", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "agg_values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "0318b1296664a2fde935df60396140c4", "grade": true, "grade_id": "cell-69c4bfbbc5c3e7c6", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q3** Create a subset of `topnames0` restricted to `Female` entries between 1960 and 1969 inclusive. ssign this to `female_subset`. Then use the `agg` function, in one step, to determine the mean and median count and the number of unique names, assigning to `female_aggvalues`. In a comment in the markdown solution cell, indicate the data type of the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e99a40d1c35bb027073f74d32e7f679d", "grade": false, "grade_id": "cell-66dbb0e2aa2de096", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "female_aggvalues" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "9c023894c1ade68ec83ff46c4684e4b4", "grade": true, "grade_id": "cell-a39ace3d52b87ed8", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q4** The constraints for selecting the rows from the last problem are based on `sex` and `year`. We often use these independent variables to set an index for a data set. Then, when we want to filter rows, our operations that use row label/`index` values for filtering are different.\n", "\n", "Start by creating dataframe `topnames` with its index drawn from the columns `year` and `sex`. Then,\n", "with a goal of the same use of the `agg` function from **Q3**, use `xs` to take a cross section of `topnames` to get the Female entries and then use `loc` to get a data frame, `female_subset`. Finally, use `agg` on this data frame to, in one step, determine the mean and median count and the number of unique names, assigning to female_aggvalues." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e4ebd7ea5c540803aee99c39ae158f75", "grade": false, "grade_id": "cell-26dd6084fbc01496", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "female_aggvalues" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2a55f1a050739815b8fe278f3edc2f5b", "grade": true, "grade_id": "cell-28e7b414f2cfbfdb", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q5** Read CSV file `indicators2016.csv` in `datadir` into a data frame named `indicators0`, with no index. Write code to add a new column `popSize` to `indicators0` which takes value 'high' if `pop > 300`, 'low' if `pop < 50`, and 'medium' otherwise. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "9fd6fb8998593a87d41d8b90bd19914e", "grade": false, "grade_id": "cell-12f1961184fc392c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "indicators0.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "2fb7c51b66f77ef77a3eb9f3036de4a6", "grade": true, "grade_id": "cell-2b5d0d4dc0d329f3", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q6** Building on the question above, use `groupby` to partition `indicators0` by this new column `popSize`, assigning to variable `groupby_pop`. Note in a comment the data type of `groupby_pop`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "41cf473f64d9393b8af048383d7b5ef6", "grade": false, "grade_id": "cell-023cf542221ea589", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(len(groupby_pop))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "80be9380098501633e920a6db849ed57", "grade": true, "grade_id": "cell-b3a522c4ffed3e54", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q7** Building on the question above, aggregate this groupby partitioning, determining the **number of non-missing elements** for each of the columns by partitiion. Assign to `partition_counts` and also include a comment giving the data type of the result, and the row labels of the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "4f2f284bd522c5d9cf53bee23d16cd06", "grade": false, "grade_id": "cell-0c97c6a6e4ad592d", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "partition_counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "718874848940afc4004b1b565ede7c6f", "grade": true, "grade_id": "cell-66f8badba21e1cb5", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q8** In similar fashion, determine the mean, by partition, of pop, gdp, and cell, and the max of life, assigning to `partition_aggvalues`. Use the `round()` method of DataFrames to round the numeric values to 2 decimal places." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6fb0e630743d95db2c1a3a47fb2914c7", "grade": false, "grade_id": "cell-a188147fe195416e", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "partition_aggvalues" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "96f4b68141020b6b367eddb098be4aa2", "grade": true, "grade_id": "cell-527de16856555a86", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q8** In similar fashion, determine the mean, min, and max of `gdp` and `life`, again assiging to `partition_aggvalues`. Explain in the markdown cell after the code cell how the columns of this result differ from the last two questions, and **why**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "d9af44e0deb5ebdec044311a5d94f65f", "grade": false, "grade_id": "cell-4a555edd9837afc7", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "partition_aggvalues" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "3fcfff28bd3f04d2c80b1c0d5e0309fd", "grade": true, "grade_id": "cell-0eac3eb8a2736913", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing Cell\n", "\n", "assert True" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "ec32bc897cca88b7d7d32e7845f2d6ad", "grade": true, "grade_id": "cell-e305e2a6bcb52188", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "source": [ "YOUR ANSWER HERE" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }