Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [ ]:
NAME = ""
COLLABORATORS = ""

JSON Practicum

In [ ]:
import os
import os.path
import json
import pandas as pd

datadir = "publicdata"
Function Description
json.load(file) Read and return the JSON-formatted data structure from the file object file.
json.dump(data, file) Write the data structure data to the file object file in JSON text format.
json.loads(s) Using the JSON-formatted string given by s, interpret and construct and return the corresponding data structure.
json.dumps(data) Translate the data structure data into a JSON-formatted string and return the string.

Note that, when we dump() or dumps(), we are converting from an in-memory data structure that consists of dictionaries, lists, strings, and numbers (that one can "do math" on), into what is fundamentally a text string, either referenced by a string variable, or that is now the text contents of a text file in the file system, which could be opened with an editor independently of anything Python'esque.

When we load() or loads(), we are going the other direction, and are converting from a string or from the contents of a text file in the file system, and are building an in-memory data structure that we can then traverse and compute with.

Writing to JSON

Typical steps when we want to create a file with the JSON text representation of a data structure from our Python program (to be able to send to another scientist, much like we might build a two-D structure and then want to send a CSV file).

  1. Create the data structure, making sure it is limited to dictionaries, lists, strings, integers, and floating point numbers in a single data structure. Call this D as a Python variable referencing the structure.
  2. Using a file system open, create and open a file for writing specifying the path and name for the desired file. We can call the file object F.
  3. From the json module (imported above as its "natural" name, json), invoke the dump() function, passing it D and F.
  4. Close F.

Q1: Use the four step process given above to write to a JSON-formatted file. In particular, create a Python data structure consisting of a list of dictionaries representation of the following table, where student names are strings, gpa's are floating point numbers, and (class) years are integers.

student gpa year
Jane 3.75 3
Bill 2.85 2
Fred 3.5 3
Mary 3.25 1

Call the in-memory data structure LoD. For this exercise, the desired destination file should be in the current directory and will be named students.json.

In [ ]:
path = os.path.join(".", "students.json")
if os.path.isfile(path):
    os.remove(path)
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert os.path.isfile(path)
with open(path, "r") as F2:
    LoD2 = json.load(F2)
assert isinstance(LoD2, list)
assert len(LoD2) == 4
assert isinstance(LoD2[0], dict)
assert len(LoD2[0]) == 3
assert 'student' in LoD2[0]
assert 'gpa' in LoD2[0]
assert 'year' in LoD2[0]
assert isinstance(LoD[0]['gpa'], float)
assert isinstance(LoD[0]['year'], int)
assert LoD[0]['gpa'] == 3.75
assert LoD[0]['year'] == 3

Q2 Double click on students.json in the current (practicum) directory. Use the arrows to show and hide portions of the tree structure in the text file. Now right-click on students.json and select to open with Editor to see the underlying text. Compare the places where there are strings. Are the strings the same (use the same delimiters) as what you used to create an initializer for your in-memory data structure? Why or why not? Do they have to be?

YOUR ANSWER HERE

In [ ]:
# Clean up cell

path = os.path.join(".", "students.json")
if os.path.isfile(path):
    os.remove(path)

Reading from JSON

In the data directory is a text file named eu_covid.json. Right click and select Editor to open with a simple text editor. Be patient, as it is a large file. Then use the following cell to print the first 30 lines of the file.

In [ ]:
path = os.path.join(datadir, "eu_covid.json")
with open(path, 'r') as covid_file:
    for i in range(30):
        line = covid_file.readline()
        print(line, end='')

Q3 As mentioned in class, to be able to process tree-structured data, you first must understand the structure before you attempt to process it. The next couple of sub-questions begin that process based on inspection of the text file, and before we convert it into an in-memory data structure.

A: What is the JSON data type for the structure at the root of the tree?

B: What is the JSON data type for the value that the top-level child maps to.

C. Within the data type you answered for B, what is the JSON data type for the elements at this next level of the tree?

YOUR ANSWER HERE

Q4 Using the technique shown in the textbook section 2.4.2, create a variable for the path to the eu_covid.json JSON file and then open() and load() from the text file into an in-memory data structure in Python referred to by variable covid_data.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q5 If you were writing the testing assert statements that made sure that covid_data contained the in-memory data structure and that your answers to A, B, and C of Q3 were correct, you would have three lines that checked

  1. the Python data type of the top level,
  2. the Python data type of the child mapping from the root top level to its value
  3. the Python data type of an element (probably the first) in the elements within the structure asserted by the previous step.

All three of these would take the form of:

assert isinstance(<expression>, <expected data type>)

Note that your answers in Q3 were in terms of the JSON data types, and these asserts are in terms of Python data types.

Make these three assertions in the cell that follows.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q6 Hopefully, you have determined that the top level of the in-memory data structure is a dictionary, that there is only one child mapping, with maps the string "records" to a list, and that the elements of this list are each a dictionary. One last sanity check you should perform before writing code to traverse the structure:

  • because the file is large, it is difficult to be sure that there is only one child mapping in the top level dictionary. So in the cell that follows, just print the set of dictionary keys found in the top level dictionary (converted to a list). If there is only one element in the list, then we are ready to start building a tabular structure from this tree.
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q7 If our goal is one or more pandas DataFrame tables of tidy data, there are a number of possible solution paths we could pursue. For now, we are going to assume we want all of the innermost dictionary fields as the columns in a pandas DataFrame result and will perform normalization after we get this single table. Given the innermost structure, representing the rows of our desired DataFrame is a dictionary, one path would be for us to build a List of Dictionaries data structure by starting with an empty list and then appending a copy of each of the dictionaries within the covid_data['records'] list.

In the cell that follows, write this loop and build a structure named LoD.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert isinstance(LoD, list)
assert len(LoD) == 44136
assert isinstance(LoD[0], dict)

Q8 With the successfully created LoD, create a pandas DataFrame named covid_df

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q9 Find the number of unique countries, assigning to ncountry

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
print(ncountry)

Q10 Use a GroupBy to group by "countryterritoryCode" and then aggregate, computing the max and mean for the columns cases and deaths. Name the resultant dataframe aggs.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
aggs.head()