{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name and collaborators below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "COLLABORATORS = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "import pandas as pd\n", "\n", "datadir = \"publicdata\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = {'animal': ['cat','cat','snake','dog',\n", " 'dog','cat','snake','cat','dog','dog'],\n", "'age': [2.5, 3, 0.5, 7, 5, 2, 4.5, 4, 7, 3],\n", "'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],\n", "'priority': ['yes','yes','no','yes','no', \n", " 'no','no','yes','no', 'no']}\n", "\n", "labels = ['a', 'b', 'c', 'd', 'e', \n", " 'f', 'g', 'h', 'i', 'j']\n", "\n", "df = pd.DataFrame(data, index=labels)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Create a column vector `agevisit` (a Series) which, for each row, is the age of the animal divided by the number of visits for the animal." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c068a965a79657e36aff0cb6c2381711", "grade": true, "grade_id": "cell-faa8f45899a67949", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "agevisit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** Create a dataframe 'df2' with only the rows where the `age` is greater than 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "5aaefd12ddcc50a93905a30b2117058d", "grade": true, "grade_id": "cell-bc3eb33cf76dcfbd", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q:** From the original data frame, project the animal and visits columns from those rows where priority is yes. Assign the result to `df3`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "fca6f013556284f4816397b5efb3a50f", "grade": true, "grade_id": "cell-9ebe96b34aa148ec", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q** The code below defines a List of Lists (plus a header/column names list). The data is based on data from [**R for Data Science**](http://r4ds.had.co.nz/) by Garrett Grolemund and Hadley Wickham. Sometimes \"missing data\" is indicated by a sentinal value, and in this case, the integer -1 is used for this purpose. (Note that this is a poor practice and we will learn better ways in the future.)\n", "\n", "\n", "In the single cell that follows, perform the following sequence of operations, making sure to assign the intermediate results to the variables as specified. Operations are **not** cumulative. Each of the operations performed should start with the original data frame. Note that most operations create a new data frame as part of their operation, but if you need to explicitly create a copy, you can use the `copy()` method.\n", "\n", "0. Use `pandas` to read this List of Lists into a dataframe called `df`, with column labels given by `colnames`. \n", "\n", "Then: \n", "\n", "1. Extract the `shape` of the data into variables called `nrows` and `ncols`\n", "2. Extract the *list* of `cases`, into a variable called `L` (hint: be very careful about type).\n", "3. Define a list of booleans `K` where `K[i]` is `False` exactly when `population[i]` represents missing data. \n", "4. Create a new data frame `df2`, via slicing and `iloc`, corresponding to the rows that are about `Brazil` and the columns for `year` and `cases`. Your data frame should have two rows and two columns.\n", "5. Define a new dataframe `df3` consisting of only the rows of `df` without missing data.\n", "6. Use a projection to define a new dataframe `df4` consisting of only the columns without missing data. Use `loc`.\n", "7. Create a new data frame `df5` whose row names are of the form `X_year` where `X` is the first letter of the country name, e.g., the first row name should be `A_1999` and the last `C_2001`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "946f52a4b075609124f499c56fdf1951", "grade": false, "grade_id": "cell-b7f4480f1f399631", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Provided data initialization\n", "\n", "data = [['Afghanistan', '1999', 745, 19987071, 0],\n", " ['Afghanistan', '2000', 2666, 20595360, 1],\n", " ['Afghanistan', '2001', -1, 31527618, 0],\n", " [ 'Brazil', '1999', 37737, 172006362, 2],\n", " [ 'Brazil', '2000', 80488, 174504898, 0],\n", " [ 'China', '1999', 212258, 972915272, 1],\n", " [ 'China', '2000', 213766, 980428583, 2],\n", " [ 'China', '2001', 215626, -1, 1]]\n", "\n", "colnames = ['country', 'year', 'cases', 'population', 'category']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f4b1cfbc493977a4a79f8044ca62de65", "grade": false, "grade_id": "cell-911d93b25195445c", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "d67b34cf0dd3ee4fe06d785c02f57c2d", "grade": true, "grade_id": "cell-bd1ab02718f34831", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Testing cell\n", "\n", "assert df.at[1,'year'] == '2000'\n", "assert df.at[7,'cases'] == 215626.0\n", "assert df.at[2,'category'] == 0\n", "assert df.at[3,'country'] == 'Brazil'\n", "assert list(df.columns) == ['country', 'year', 'cases', 'population', 'category']\n", "assert nrows == 8\n", "assert ncols == 5\n", "assert len(L) == 8\n", "assert 2666.0 in L\n", "assert type(L) == list\n", "assert type(K) == list\n", "assert K[2] == False\n", "assert K[6] == True\n", "assert df2.at[3,'year'] == '1999'\n", "assert df2.shape == (2,2)\n", "assert df2.at[4,'cases'] == 80488.0\n", "assert df3.shape == (6,5)\n", "assert df3.at[1,'year'] == '2000'\n", "assert df3.at[6,'cases'] == 213766.0\n", "assert df3.at[3,'country'] == 'Brazil'\n", "assert df4.shape == (8,3)\n", "assert df4.at[1,'year'] == '2000'\n", "assert df4.at[6,'category'] == 2\n", "assert df4.at[3,'country'] == 'Brazil'\n", "assert df5.shape == (8,5)\n", "assert list(df5.index) == ['A_1999', 'A_2000', 'A_2001', 'B_1999', 'B_2000', 'C_1999', 'C_2000', 'C_2001']\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }