Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
import os
import io
import sys
import importlib
import pandas as pd
from lxml import etree
import requests
from IPython.display import Image
htmlparser = etree.HTMLParser()
if os.path.isdir(os.path.join("../../..", "modules")):
module_dir = os.path.join("../../..", "modules")
else:
module_dir = os.path.join("../..", "modules")
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
sys.path.append(module_path)
import util
importlib.reload(util)
Consider the web page at http://datasystems.denison.edu/topnames.html, shown in its rendered form below:
Image("figs/topnames.jpg", width=400)
Student Action Go to the above web page using Chrome, navigate to View->Developer->Inspect Elements.
Find where in the HTML that the "topnames" label for the table exists (not the tab, but in the content of the page proper). Traverse "up" to the first div
ancestor, and draw the subtree starting at that point, down to the point where you have included the label text, but only covering the first full subtree of the div
(containing the label).
Find the table
node within the overall tree, and then draw the subtee rooted at table
. You need only include the first two data-carrying rows of data. You can also use abbreviations of your own choosing to make this less onerous.
Q1 For submission, upload, with your notebook, a picture of your two drawings in answer to the above.
# Reading from the web into an XML Element, using custom parser
url = util.buildURL("/topnames.html", "datasystems.denison.edu")
response = requests.get(url)
assert response.status_code == 200
tree = etree.parse(io.BytesIO(response.content), htmlparser)
root = tree.getroot()
Q2 Create in xs
an XPath expression that (uniquely)
finds the div
that is the common ancestor of both the topnames label and the table
containing the data. Assign the node itself (not the list containing the node) to the variable divroot
and use the print_xml
with an nlines
argument to show the first 20 lines of the tree.
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
Q3 Either using the root
of the html tree, or from the divroot
found above, assign to xs
an XPath expression to get a list of the tables in the HTML, then execute the XPath and assign to the Python variable table
the first such table. Then use print_xml
to make sure your understanding of the table tree matches your hand-drawn tree from earlier. If you chose to start from root
, would your expression be guaranteed to get the table you are interested in? Why or why not? Comment in the following markdown cell.
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(table, nlines=30)
Interpretation Solution Cell
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print(col_names)
Note that if the HTML document had multiple tables, then the XPath expression above would match header cells from all th
elements anywhere in the document, including ones in totally separate tables. In such a case, we may have to make further assumptions about the structure of the HTML to uniquely find the desired table.
In the following, we will assume that the variable table
refers to the correct table, and focus on extracting the data and transforming it into a data frame.
A method from the textbook solves this problem by the following steps:
td
nodes under the tbody
.In the following cells, we ask you to reproduce that method. Use your textbook if you need help.
Q5 In the following cell, use XPath on table
to obtain a single list of the text of all the td
-tagged nodes. Assign the result to tdlist
.
# YOUR CODE HERE
raise NotImplementedError()
print(len(tdlist))
print("td list prefix:", tdlist[0:10])
Q6 Create an empty list, LoL
, and a counter variable. Then write a for-loop that iterates over tdlist
and, if the counter is zero, create a new empty row list. Add the current td item to the list, and then, conditionally, increments the counter if there are more fields to be accumulated into the row list, and resets the counter to zero if a row is complete. When a row is complete, the row should be appended to the LoL
.
Upon completion, LoL should be a list of row lists with string versions of the data for each row.
LoL = []
# YOUR CODE HERE
raise NotImplementedError()
LoL[:10]
Q7 Finally, use our standard techniques for turning the list of lists into a pandas
data frame, setting the index to the combination of year and sex, and assigning the result to variable df
.
# Turning the LoL into a dataframe
# YOUR CODE HERE
raise NotImplementedError()
df.head(10)
A perhaps simpler solution involves using the regularity of the columns in a table (be it in HTML or other regular table form). Within each tr
, the position of each of the td
elements for the four fields in this table is always the same, regardless of row. So at position 1 within all the rows, we always have the year, at position 2, we always have the sex, and so forth.
Q8 Assign to xs
an XPath expression that retrieves, relative to table
, the text property for all td
elements under a tr
where the td
is the position 1 child of the tr
. Execute the xpath on table
and assign the result to variable year_vector
.
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print(len(year_vector))
print("year vector prefix:", year_vector[:10])
We could do this four times, with different values for the position, creating four vectors and then constructing the dictionary. Instead, we want to use Python string formatting to
dynamically create an xpath that retrieves the td
at a position given by a variable, and then traverses to the text attribute.
Q9 Assign to xs_template
a Python string using {}
in place of the position from your solution above, so that the testing code shows its use by obtaining the four data columns from the table.
xs_template = ""
# YOUR CODE HERE
raise NotImplementedError()
years = table.xpath(xs_template.format(1))
print(years[:8])
sexes = table.xpath(xs_template.format(2))
print(sexes[:8])
names = table.xpath(xs_template.format(3))
print(names[:8])
counts = table.xpath(xs_template.format(4))
print(counts[:8])
Q10 Starting with an empty dictionary, DoL
, we want create a for loop that, at each iteration, allows us to define a position
and a columnname
. The position should be used to format an appropriate xpath to get the set of column values at that position (like we did without a loop in the last question). The xpath should be executed, and the resultant list assigned to a dictionary entry whose key is the column name. Fill in the above-described loop.
# YOUR CODE HERE
raise NotImplementedError()
Consider the web page at https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php
We can infer from a PHP page, often with a php
extension, that the page is a dynamic one, that, on an HTTP request, will respond to the request by dynamically generating the HTML content. PHP is a scripting language that lets a server execute code instead of serving up static content.
On the given page, navigate toward the bottom of the page where you will find a Select Year
drop down and a button labeled Get different year
, and pick the year you were born, or 1999 if you were born earlier than 1999, and then click the Get different year
button.
The two GUI elements of the drop-down and the submission button consistute what, in HTML, is called a form (albeit, one of the simplest forms one might imagine).
The way that a PHP or other dynamic resource path at a web server obtains the information for processing is via the client making a POST request. The request includes information in the body of the request that passes, from client to server, the information needed to do the dynamic generation. ... otherwise, the page would simply be static.
In this case, by making a year choice and clicking the submission button (Get different year
), the web browser makes a POST request and formats the body with the form data. We need our web scraping client applications to be able to perform the same way.
Q11 Go to the above web page using Chrome, follow the instructions, and answer any questions in the Markdown cell that follows.
Network
tabSelect different year
botton.Name
subwindow, should be labeled index_cms.php
index_cms.php
Headers
, Preview
, Response
, etc.Headers
sub-tabRequest Headers
and the Form Data
elements under Headers
are expanded and the others are collapsed.view source
next to the Request Headers
element:Content-length
header line tell you?Content-type
header?Form Data
element:view source
view parsed
to get back to the key-value viewview URL encoded
and view decoded
; what is the difference? Which of these two is used in the view source
view?YOUR ANSWER HERE
Q12 In the following cell, after my provision of the necessary url elements as Python variables, create a url and then issue a GET request. Process the result into an XML tree. By the conclusion of your cell, you should assign to root
the root of the HTML tree.
protocol = "https"
location = "ww2.energy.ca.gov"
resourcepath = "/almanac/transportation_data/gasoline/margins/index_cms.php"
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=10)
If we use Developer Tools (or View Source) we can find the form that contains the year drop-down and the submit button labeled Get different year
. We want to examine that form
subtree within our HTML.
Q13 Assign to xs
an Xpath string that finds the form
node where the action
attribute is set to the page's PHP of index_cms.php
. Execute the xpath and set form
to the first element of the result.
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(form)
method
attribute is "post"
. That means that, when the embedded form is "filled out" and the user submits the form, an HTTP POST is the result:action
attribute of the form determines the resource path, relative to the current location, for the URI/resource path needed in the HTTP POSToption
nodes, and whose values are the possible years. The key for this field is called year
, as given in the select
node. The value will be one of the year values.input
node determines the submission of the form. In this case, when the user clicks the "Get different year"
, the form will be submitted and, in addition to the key=value items from the form items, the name
of the input
attribute, newYear
will be mapped to the value
of "Get different year".We use an HTTP POST to convey information from the client to the server. The information conveyed is in the $\textit{body}$ of the request. So, in contrast to most earlier examples, we need to change two things in using the requests
module to make this request:
For (1), the requests
module has a post
top level function. For (2), we construct a dictionary with the desired mappings. We pass that to the post()
using named parameter data
. The requests module is very flexible in how it interprets an argument provided through data
. If it is a string, it simply puts the encoded bytes of the string in the body. If it is a dictionary, it interprets it and generates a URL-encoded version, as we will see below:
Q14 Using the same URL as the GET above, make a POST request to this site with the body of the post request specified in a dictionary that maps year
to the desired year of data, and the newYear
key mapped to the string "Get different year"
. This emulates the request observed from Chrome.
Assign to response
the result of the request, and be sure to check for a successful status code.
# YOUR CODE HERE
raise NotImplementedError()
Q15 In the following code cells, we obtain the HTTP request through the response, and then print the value of the POST request body. In the markdown cell following, write a couple sentences explaining exactly why the format is what it is, and how the request headers relate to the body.
request = response.request
print(request.headers)
print()
print(request.body)
YOUR ANSWER HERE
In the result, there is a table per week. Since this practicum and homework are long enough, I will not ask you to extract data from that set of tables. But extracting data from multiple tables should be well within your skills at this point, and I encourage you to try your hand at it here.