{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\\rightarrow$Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\", as well as your name and collaborators below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NAME = \"\"\n", "COLLABORATORS = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# JSON Practicum" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "import json\n", "import pandas as pd\n", "\n", "datadir = \"publicdata\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Function | Description\n", "------------------------|----------------------------------------------------\n", "`json.load(file)` | Read and return the JSON-formatted data structure from the file object `file`.\n", "`json.dump(data, file)` | Write the data structure `data` to the file object `file` in JSON text format.\n", "`json.loads(s)` | Using the JSON-formatted string given by `s`, interpret and construct and return the corresponding data structure.\n", "`json.dumps(data)` | Translate the data structure `data` into a JSON-formatted string and return the string.\n", "\n", "Note that, when we `dump()` or `dumps()`, we are **converting** from an in-memory data structure that consists of dictionaries, lists, strings, and numbers (that one can \"do math\" on), into what is fundamentally a **text string**, either referenced by a string variable, or that is now the text contents of a text file in the file system, which could be opened with an editor independently of anything Python'esque.\n", "\n", "When we `load()` or `loads()`, we are going the other direction, and are converting from a string or from the contents of a text file in the file system, and are building an in-memory data structure that we can then traverse and compute with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing to JSON\n", "\n", "Typical steps when we want to **create** a file with the JSON text representation of a data structure from our Python program (to be able to send to another scientist, much like we might build a two-D structure and then want to send a CSV file).\n", "\n", "1. Create the data structure, making sure it is limited to dictionaries, lists, strings, integers, and floating point numbers in a single data structure. Call this `D` as a Python variable referencing the structure.\n", "2. Using a file system `open`, create and open a file for **writing** specifying the path and name for the desired file. We can call the file object `F`.\n", "3. From the `json` module (imported above as its \"natural\" name, `json`), invoke the `dump()` function, passing it `D` and `F`.\n", "4. Close `F`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1:** Use the four step process given above to write to a JSON-formatted file. In particular, create a Python data structure consisting of a list of dictionaries representation of the following table, where student names are strings, gpa's are floating point numbers, and (class) years are integers.\n", "\n", "student | gpa | year\n", "--------|-----|-------\n", "Jane | 3.75 | 3\n", "Bill | 2.85 | 2\n", "Fred | 3.5 | 3\n", "Mary | 3.25 | 1\n", "\n", "Call the in-memory data structure `LoD`. For this exercise, the desired destination file should be in the current directory and will be named `students.json`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "490753c841b8d174d9149c230ebc1636", "grade": false, "grade_id": "cell-86c98fb8736ed7b3", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "path = os.path.join(\".\", \"students.json\")\n", "if os.path.isfile(path):\n", " os.remove(path)\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f01f37565d002d95b3f06587429a25e8", "grade": true, "grade_id": "cell-b3992585ad29a7cd", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert os.path.isfile(path)\n", "with open(path, \"r\") as F2:\n", " LoD2 = json.load(F2)\n", "assert isinstance(LoD2, list)\n", "assert len(LoD2) == 4\n", "assert isinstance(LoD2[0], dict)\n", "assert len(LoD2[0]) == 3\n", "assert 'student' in LoD2[0]\n", "assert 'gpa' in LoD2[0]\n", "assert 'year' in LoD2[0]\n", "assert isinstance(LoD[0]['gpa'], float)\n", "assert isinstance(LoD[0]['year'], int)\n", "assert LoD[0]['gpa'] == 3.75\n", "assert LoD[0]['year'] == 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2** Double click on `students.json` in the current (practicum) directory. Use the arrows to show and hide portions of the tree structure in the text file. Now right-click on `students.json` and select to open with `Editor` to see the underlying text. Compare the places where there are strings. Are the strings the same (use the same delimiters) as what you used to create an initializer for your in-memory data structure? Why or why not? Do they have to be?" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "346a622616536c4b52cbc0a6d5c70dbc", "grade": true, "grade_id": "cell-62873513f3db7389", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Clean up cell\n", "\n", "path = os.path.join(\".\", \"students.json\")\n", "if os.path.isfile(path):\n", " os.remove(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading from JSON\n", "\n", "In the data directory is a text file named `eu_covid.json`. Right click and select `Editor` to open with a simple text editor. Be patient, as it is a **large** file. Then use the following cell to print the first 30 lines of the file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = os.path.join(datadir, \"eu_covid.json\")\n", "with open(path, 'r') as covid_file:\n", " for i in range(30):\n", " line = covid_file.readline()\n", " print(line, end='')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q3** As mentioned in class, to be able to process tree-structured data, you first must **understand** the structure before you attempt to process it. The next couple of sub-questions begin that process based on **inspection** of the text file, and before we convert it into an in-memory data structure.\n", "\n", "**A:** What is the JSON data type for the structure at the root of the tree?\n", "\n", "**B:** What is the JSON data type for the **value** that the top-level child **maps to**.\n", "\n", "**C.** Within the data type you answered for **B**, what is the JSON data type for the elements at this next level of the tree?" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "nbgrader": { "cell_type": "markdown", "checksum": "52fe15a72818b289a6ff88da40e07cfd", "grade": true, "grade_id": "cell-b12276de6bbde4dc", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false } }, "source": [ "YOUR ANSWER HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q4** Using the technique shown in the textbook section 2.4.2, create a variable for the path to the `eu_covid.json` JSON file and then `open()` and `load()` from the text file into an in-memory data structure in Python referred to by variable `covid_data`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c6ec3038eec456fa7df32213b3bc831b", "grade": true, "grade_id": "cell-f267174d30854f0e", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q5** If **you** were writing the testing `assert` statements that made sure that `covid_data` contained the in-memory data structure and that your answers to **A**, **B**, and **C** of **Q3** were correct, you would have three lines that checked \n", "\n", "1. the Python data type of the top level, \n", "2. the Python data type of the child mapping from the root top level to its value\n", "3. the Python data type of an element (probably the first) in the elements within the structure asserted by the previous step.\n", "\n", "All three of these would take the form of:\n", "\n", "`assert isinstance(, )`\n", "\n", "Note that your answers in **Q3** were in terms of the JSON data types, and these asserts are in terms of Python data types.\n", "\n", "Make these three assertions in the cell that follows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "f17ad7a8220daab3535b71a4946e5eeb", "grade": true, "grade_id": "cell-b9b0a8fc3f7b8ff4", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q6** Hopefully, you have determined that the top level of the in-memory data structure is a dictionary, that there is only one child mapping, with maps the string `\"records\"` to a list, and that the elements of this list are each a dictionary. One last sanity check you should perform before writing code to traverse the structure:\n", "\n", "- because the file is large, it is difficult to be sure that there is **only** one child mapping in the top level dictionary. So in the cell that follows, just print the set of dictionary **keys** found in the top level dictionary (converted to a list). If there is only one element in the list, then we are ready to start building a tabular structure from this tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "6d501f25b173617c4c6de3a1d82f4666", "grade": true, "grade_id": "cell-5c13676d1f927f2e", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q7** If our goal is one or more `pandas` DataFrame tables of **tidy** data, there are a number of possible solution paths we could pursue. For now, we are going to assume we want **all** of the innermost dictionary fields as the columns in a `pandas` DataFrame result and will perform normalization after we get this single table. Given the innermost structure, representing the **rows** of our desired DataFrame is a **dictionary**, one path would be for us to build a **List of Dictionaries** data structure by starting with an empty list and then appending a **copy** of each of the dictionaries within the `covid_data['records']` list.\n", "\n", "In the cell that follows, write this loop and build a structure named `LoD`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "82387c5c26287f9e8220159327c21897", "grade": false, "grade_id": "cell-76959c470d139c00", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7852702ff7cc7dbb85dbbdcfefa7f0cc", "grade": true, "grade_id": "cell-abde9c76b50503ba", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert isinstance(LoD, list)\n", "assert len(LoD) == 44136\n", "assert isinstance(LoD[0], dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q8** With the successfully created `LoD`, create a pandas DataFrame named `covid_df`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "a401aedbdb8a49dedd4d238a2e3b68de", "grade": true, "grade_id": "cell-5645b511e37e6fd0", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q9** Find the number of unique countries, assigning to `ncountry`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "228346a55272962781533629f876d7dd", "grade": true, "grade_id": "cell-e292daf935471ed3", "locked": false, "points": 1, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(ncountry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q10** Use a GroupBy to group by `\"countryterritoryCode\"` and then aggregate, computing the max and mean for the columns `cases` and `deaths`. Name the resultant dataframe `aggs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "64ca3abdb67f57e7844e0799d6697e86", "grade": true, "grade_id": "cell-aa5f6ab036596702", "locked": false, "points": 2, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "aggs.head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }