Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE", as well as your name and collaborators below:
NAME = ""
COLLABORATORS = ""
import os
import os.path
import json
import pandas as pd
datadir = "publicdata"
Function | Description |
---|---|
json.load(file) |
Read and return the JSON-formatted data structure from the file object file . |
json.dump(data, file) |
Write the data structure data to the file object file in JSON text format. |
json.loads(s) |
Using the JSON-formatted string given by s , interpret and construct and return the corresponding data structure. |
json.dumps(data) |
Translate the data structure data into a JSON-formatted string and return the string. |
Note that, when we dump()
or dumps()
, we are converting from an in-memory data structure that consists of dictionaries, lists, strings, and numbers (that one can "do math" on), into what is fundamentally a text string, either referenced by a string variable, or that is now the text contents of a text file in the file system, which could be opened with an editor independently of anything Python'esque.
When we load()
or loads()
, we are going the other direction, and are converting from a string or from the contents of a text file in the file system, and are building an in-memory data structure that we can then traverse and compute with.
Typical steps when we want to create a file with the JSON text representation of a data structure from our Python program (to be able to send to another scientist, much like we might build a two-D structure and then want to send a CSV file).
D
as a Python variable referencing the structure.open
, create and open a file for writing specifying the path and name for the desired file. We can call the file object F
.json
module (imported above as its "natural" name, json
), invoke the dump()
function, passing it D
and F
.F
.Q1: Use the four step process given above to write to a JSON-formatted file. In particular, create a Python data structure consisting of a list of dictionaries representation of the following table, where student names are strings, gpa's are floating point numbers, and (class) years are integers.
student | gpa | year |
---|---|---|
Jane | 3.75 | 3 |
Bill | 2.85 | 2 |
Fred | 3.5 | 3 |
Mary | 3.25 | 1 |
Call the in-memory data structure LoD
. For this exercise, the desired destination file should be in the current directory and will be named students.json
.
path = os.path.join(".", "students.json")
if os.path.isfile(path):
os.remove(path)
# YOUR CODE HERE
raise NotImplementedError()
assert os.path.isfile(path)
with open(path, "r") as F2:
LoD2 = json.load(F2)
assert isinstance(LoD2, list)
assert len(LoD2) == 4
assert isinstance(LoD2[0], dict)
assert len(LoD2[0]) == 3
assert 'student' in LoD2[0]
assert 'gpa' in LoD2[0]
assert 'year' in LoD2[0]
assert isinstance(LoD[0]['gpa'], float)
assert isinstance(LoD[0]['year'], int)
assert LoD[0]['gpa'] == 3.75
assert LoD[0]['year'] == 3
Q2 Double click on students.json
in the current (practicum) directory. Use the arrows to show and hide portions of the tree structure in the text file. Now right-click on students.json
and select to open with Editor
to see the underlying text. Compare the places where there are strings. Are the strings the same (use the same delimiters) as what you used to create an initializer for your in-memory data structure? Why or why not? Do they have to be?
YOUR ANSWER HERE
# Clean up cell
path = os.path.join(".", "students.json")
if os.path.isfile(path):
os.remove(path)
In the data directory is a text file named eu_covid.json
. Right click and select Editor
to open with a simple text editor. Be patient, as it is a large file. Then use the following cell to print the first 30 lines of the file.
path = os.path.join(datadir, "eu_covid.json")
with open(path, 'r') as covid_file:
for i in range(30):
line = covid_file.readline()
print(line, end='')
Q3 As mentioned in class, to be able to process tree-structured data, you first must understand the structure before you attempt to process it. The next couple of sub-questions begin that process based on inspection of the text file, and before we convert it into an in-memory data structure.
A: What is the JSON data type for the structure at the root of the tree?
B: What is the JSON data type for the value that the top-level child maps to.
C. Within the data type you answered for B, what is the JSON data type for the elements at this next level of the tree?
YOUR ANSWER HERE
Q4 Using the technique shown in the textbook section 2.4.2, create a variable for the path to the eu_covid.json
JSON file and then open()
and load()
from the text file into an in-memory data structure in Python referred to by variable covid_data
.
# YOUR CODE HERE
raise NotImplementedError()
Q5 If you were writing the testing assert
statements that made sure that covid_data
contained the in-memory data structure and that your answers to A, B, and C of Q3 were correct, you would have three lines that checked
All three of these would take the form of:
assert isinstance(<expression>, <expected data type>)
Note that your answers in Q3 were in terms of the JSON data types, and these asserts are in terms of Python data types.
Make these three assertions in the cell that follows.
# YOUR CODE HERE
raise NotImplementedError()
Q6 Hopefully, you have determined that the top level of the in-memory data structure is a dictionary, that there is only one child mapping, with maps the string "records"
to a list, and that the elements of this list are each a dictionary. One last sanity check you should perform before writing code to traverse the structure:
# YOUR CODE HERE
raise NotImplementedError()
Q7 If our goal is one or more pandas
DataFrame tables of tidy data, there are a number of possible solution paths we could pursue. For now, we are going to assume we want all of the innermost dictionary fields as the columns in a pandas
DataFrame result and will perform normalization after we get this single table. Given the innermost structure, representing the rows of our desired DataFrame is a dictionary, one path would be for us to build a List of Dictionaries data structure by starting with an empty list and then appending a copy of each of the dictionaries within the covid_data['records']
list.
In the cell that follows, write this loop and build a structure named LoD
.
# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(LoD, list)
assert len(LoD) == 44136
assert isinstance(LoD[0], dict)
Q8 With the successfully created LoD
, create a pandas DataFrame named covid_df
# YOUR CODE HERE
raise NotImplementedError()
Q9 Find the number of unique countries, assigning to ncountry
# YOUR CODE HERE
raise NotImplementedError()
print(ncountry)
Q10 Use a GroupBy to group by "countryterritoryCode"
and then aggregate, computing the max and mean for the columns cases
and deaths
. Name the resultant dataframe aggs
.
# YOUR CODE HERE
raise NotImplementedError()
aggs.head()