Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


In the questions that follow, we are looking for XPath declarative solutions to the problems, not procedural solutions. You will only get 1/2 credit for procedural solutions.

Please begin by importing whatever modules you need, reading in and parsing the relevant datasets, and familiarizing yourself with them.

Q1: Using the provided bookstore.xml file, create a Python list called "books" containing the titles of all books. Your list books should be a list of strings.

In [ ]:
# Solution cell
books = []
# YOUR CODE HERE
raise NotImplementedError()
print(books)
type(books[0])
In [ ]:
assert len(books) > 0 and type(books[0]) is etree._ElementUnicodeResult
assert 'Lover Birds' in books and 'Splish Splash' in books
assert len(books)==12

Q2: Create a list of books ids named less that cost less than $6. Note that id is an attribute.

In [ ]:
# Solution cell
less = []
# YOUR CODE HERE
raise NotImplementedError()
less
In [ ]:
assert len(less) > 0 and type(less[0]) is etree._ElementUnicodeResult
assert 'bk104' in less
assert 'bk101' not in less
assert len(less)==7

Q3: Create a list of book titles called "eva" where Eva Corets was the author. Your list eva should be a list of strings.

In [ ]:
# Solution cell
eva = []
# YOUR CODE HERE
raise NotImplementedError()
eva
In [ ]:
assert len(eva) > 0 and type(eva[0]) is etree._ElementUnicodeResult
assert len(eva)==3
assert 'Maeve Ascendant' in eva
assert 'Paradox Lost' not in eva

Q4: Find the average book price for all books that are not fantasy in this file, assigning to variable avgprice. Hints First, use XPath to get a list of the price strings (text) based on a single XPath query. Then use a list comprehension to build a list of float values converting the strings to real-valued numbers. Finally, perform the average based on the values and length of the list.

In [ ]:
# Solution cell
avgprice = 0
# YOUR CODE HERE
raise NotImplementedError()
avgprice
In [ ]:
assert(avgprice > 23.82)
assert(avgprice < 24)

Q5: Create a list called lessFantasy containing the titles of the books where the price is under $40 and not in the fantasy genre.

In [ ]:
# Solution cell
lessFantasy = []
# YOUR CODE HERE
raise NotImplementedError()
lessFantasy
In [ ]:
assert len(lessFantasy)==6
assert 'Paradox Lost' in lessFantasy
assert 'Maeve Ascendant' not in lessFantasy

Q6: Using countries.xml, generate a list of all the countries in the countries.xml file, assigning to a variable countries; then assign the number of countries to the variable countrycount. When you read in and parse the file, please name the root element croot.

In [ ]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert(countrycount == 231)
assert('Uruguay' in countries)
assert type(croot) is etree._Element

Q7: Write a function findPop(root,country) that finds the population of a given country in the dataset countries.xml. Use an XPath expression and a format string. Return your answer as an integer.

In [ ]:
# Solution cell
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# Testing cell
assert findPop(croot,'Cuba') == 10951334
assert findPop(croot,'Uruguay') == 3238952

Q8: Study the countries data carefully. Then use the position() function to create a node set consisting of, for countries in positions 5-55 inclusive, the population of the second city listed, if there are at least two cities listed. For example, nothing is in the node set for Aruba (no cities listed) or Armenia (only Yerevan listed), but Cordoba is in the node set thanks to Argentina. Your answer should use a single XPath expression. Please store the results in a list secondPops of integers.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert len(secondPops) == 6
assert secondPops[0] == 1111811

Q9: With reference to the topnames dataset, please find all years where there was a count (either gender) that was strictly larger than 50,000. Please navigate to the appropriate attribute, rather than returning a list of elements.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert nodeset[0] == '1915'
assert len(nodeset) == 78

Q10: With reference to the topnames dataset, please find all years where the top female name had a count that was strictly larger than 50,000. Please navigate to the appropriate attribute, rather than returning a list of elements.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert nodeset[0] == '1915'
assert len(nodeset) == 68