Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


In [ ]:
import io
from lxml import etree
import json
import sys
import os.path
import pandas as pd

datadir = "publicdata"

Q1 Consider the following table of subjects data:

subject name department
CS Computer Science MATH
MATH Mathematics MATH
ENGL English Literature ENGL

Using a text editor, edit and create a file named subjects.xml in the current directory that creates a legal XML representation of this data. Once created, write a Python code sequence to read and parse the file, and then, using the technique from this section, print the entire tree. In the penultimate step, you create a Python string to reference the decoded string version of the tree before you print. Name this variable subjects_str.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing Cell

path = os.path.join(".", "subjects.xml")
assert os.path.isfile(path)
assert isinstance(subjects_str, str)
assert 75 < len(subjects_str)

Q2 Now consider the courses table below. Using a text editor, edit and create courses.xml that contains a an XML tree representing this table:

subject coursenum title
CS 110 Computing with Digital Media
CS 372 Operating Systems
MATH 210 Proof Techniques
ENGL 213 Early British Literature

Once created, write a Python code sequence to read and parse the file, and then, using the technique from this section, print the entire tree. Assign the string version of the courses tree to the variable courses_str.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing Cell

path = os.path.join(".", "courses.xml")
assert os.path.isfile(path)
assert isinstance(courses_str, str)
assert 75 < len(courses_str)

Q3 Suppose you wanted a tree that contained both of the above tables. Write a file named school.xml in the current directory that composes as a single tree both of the above component tables.

As before, once created, write a Python code sequence to read and parse the file, and the print the entire tree. In order to not depend on the correctness of the prior two questions, this problem will be graded manually, so you do not need any particular variable names.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q3 Write a function:

getLocalXML(filename, datadir=".", parser=None)

that performs the common steps of creating a path from the given filename and datadir and parses the XML file, using the passed parser, if any, and returns the Element at the root of the tree. If parser is not passed, the standard XMLParser should be used.

If the file is not found, or if the parse is unsuccessful (due to XML not being "well formed"), the function should return None. Remember that if a parse is unsuccessful, the etree module raises an exception. That means that you should have a try block, and indented within that block, the parse() invocation should occur. The try block is followed by an except Exception as e: line, and within that, your return None. If no exception is raised, code execution will proceed beyond the try/except block, and that is where you would return the root of the parsed tree.

In [ ]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
myparser = etree.XMLParser(remove_blank_text=True)
# Testing cell
wroot = getLocalXML("widombooks.xml", datadir, myparser)
assert len(wroot) == 8
bad = getLocalXML("foo.xml", datadir, myparser)
assert bad == None
bad2 = getLocalXML("bad.xml", datadir)
assert bad2==None