Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


Web Scraping Practicum and Homework

In [ ]:
import os
import io
import sys
import importlib
import pandas as pd
from lxml import etree
import requests
from IPython.display import Image

htmlparser =  etree.HTMLParser()

if os.path.isdir(os.path.join("../../..", "modules")):
    module_dir = os.path.join("../../..", "modules")
else:
    module_dir = os.path.join("../..", "modules")

module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

Topnames Table with GET

Discovery

Consider the web page at http://datasystems.denison.edu/topnames.html, shown in its rendered form below:

In [ ]:
Image("figs/topnames.jpg", width=400)

Student Action Go to the above web page using Chrome, navigate to View->Developer->Inspect Elements.

  • Find where in the HTML that the "topnames" label for the table exists (not the tab, but in the content of the page proper). Traverse "up" to the first div ancestor, and draw the subtree starting at that point, down to the point where you have included the label text, but only covering the first full subtree of the div (containing the label).

  • Find the table node within the overall tree, and then draw the subtee rooted at table. You need only include the first two data-carrying rows of data. You can also use abbreviations of your own choosing to make this less onerous.

Q1 For submission, upload, with your notebook, a picture of your two drawings in answer to the above.

Code Setup

In [ ]:
# Reading from the web into an XML Element, using custom parser
url = util.buildURL("/topnames.html", "datasystems.denison.edu")
response = requests.get(url)
assert response.status_code == 200

tree = etree.parse(io.BytesIO(response.content), htmlparser)
root = tree.getroot()

Q2 Create in xs an XPath expression that (uniquely) finds the div that is the common ancestor of both the topnames label and the table containing the data. Assign the node itself (not the list containing the node) to the variable divroot and use the print_xml with an nlines argument to show the first 20 lines of the tree.

In [ ]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()

Q3 Either using the root of the html tree, or from the divroot found above, assign to xs an XPath expression to get a list of the tables in the HTML, then execute the XPath and assign to the Python variable table the first such table. Then use print_xml to make sure your understanding of the table tree matches your hand-drawn tree from earlier. If you chose to start from root, would your expression be guaranteed to get the table you are interested in? Why or why not? Comment in the following markdown cell.

In [ ]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(table, nlines=30)

Interpretation Solution Cell

Programmatic Extraction

List of Lists

Q4 Assign to xs an XPath expression that retrieves the names of the columns of the table, assigning to col_names. Write your expression as an absolute one, working from the root.

In [ ]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print(col_names)

Note that if the HTML document had multiple tables, then the XPath expression above would match header cells from all th elements anywhere in the document, including ones in totally separate tables. In such a case, we may have to make further assumptions about the structure of the HTML to uniquely find the desired table.

In the following, we will assume that the variable table refers to the correct table, and focus on extracting the data and transforming it into a data frame.

A method from the textbook solves this problem by the following steps:

  1. Use XPath to retrieve a single list of the text property of all td nodes under the tbody.
  2. Using the knowledge that there are four fields per row, iterate over this single list and, by putting sequential sets of four elements into a row list, create a LoL representation of the data.
  3. Build the dataframe based on the LoL and the column names.

In the following cells, we ask you to reproduce that method. Use your textbook if you need help.

Q5 In the following cell, use XPath on table to obtain a single list of the text of all the td-tagged nodes. Assign the result to tdlist.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
print(len(tdlist))
print("td list prefix:", tdlist[0:10])

Q6 Create an empty list, LoL, and a counter variable. Then write a for-loop that iterates over tdlist and, if the counter is zero, create a new empty row list. Add the current td item to the list, and then, conditionally, increments the counter if there are more fields to be accumulated into the row list, and resets the counter to zero if a row is complete. When a row is complete, the row should be appended to the LoL.

Upon completion, LoL should be a list of row lists with string versions of the data for each row.

In [ ]:
LoL = []
# YOUR CODE HERE
raise NotImplementedError()
LoL[:10]

Q7 Finally, use our standard techniques for turning the list of lists into a pandas data frame, setting the index to the combination of year and sex, and assigning the result to variable df.

In [ ]:
# Turning the LoL into a dataframe
# YOUR CODE HERE
raise NotImplementedError()
df.head(10)

Dictionary of Column Lists

A perhaps simpler solution involves using the regularity of the columns in a table (be it in HTML or other regular table form). Within each tr, the position of each of the td elements for the four fields in this table is always the same, regardless of row. So at position 1 within all the rows, we always have the year, at position 2, we always have the sex, and so forth.

Q8 Assign to xs an XPath expression that retrieves, relative to table, the text property for all td elements under a tr where the td is the position 1 child of the tr. Execute the xpath on table and assign the result to variable year_vector.

In [ ]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
print(len(year_vector))
print("year vector prefix:", year_vector[:10])

We could do this four times, with different values for the position, creating four vectors and then constructing the dictionary. Instead, we want to use Python string formatting to dynamically create an xpath that retrieves the td at a position given by a variable, and then traverses to the text attribute.

Q9 Assign to xs_template a Python string using {} in place of the position from your solution above, so that the testing code shows its use by obtaining the four data columns from the table.

In [ ]:
xs_template = ""
# YOUR CODE HERE
raise NotImplementedError()
years = table.xpath(xs_template.format(1))
print(years[:8])
sexes = table.xpath(xs_template.format(2))
print(sexes[:8])
names = table.xpath(xs_template.format(3))
print(names[:8])
counts = table.xpath(xs_template.format(4))
print(counts[:8])

Q10 Starting with an empty dictionary, DoL, we want create a for loop that, at each iteration, allows us to define a position and a columnname. The position should be used to format an appropriate xpath to get the set of column values at that position (like we did without a loop in the last question). The xpath should be executed, and the resultant list assigned to a dictionary entry whose key is the column name. Fill in the above-described loop.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

POST Example

Discovery

Consider the web page at https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php

We can infer from a PHP page, often with a php extension, that the page is a dynamic one, that, on an HTTP request, will respond to the request by dynamically generating the HTML content. PHP is a scripting language that lets a server execute code instead of serving up static content.

On the given page, navigate toward the bottom of the page where you will find a Select Year drop down and a button labeled Get different year, and pick the year you were born, or 1999 if you were born earlier than 1999, and then click the Get different year button.

The two GUI elements of the drop-down and the submission button consistute what, in HTML, is called a form (albeit, one of the simplest forms one might imagine).

The way that a PHP or other dynamic resource path at a web server obtains the information for processing is via the client making a POST request. The request includes information in the body of the request that passes, from client to server, the information needed to do the dynamic generation. ... otherwise, the page would simply be static.

In this case, by making a year choice and clicking the submission button (Get different year), the web browser makes a POST request and formats the body with the form data. We need our web scraping client applications to be able to perform the same way.

Q11 Go to the above web page using Chrome, follow the instructions, and answer any questions in the Markdown cell that follows.

  • navigate to View->Developer->Developer Tools
  • In the Tools window, select the Network tab
  • In the window showing the web page itself, navigate to near the bottom and repeat your earlier action of selecting a year and then clicking the Select different year botton.
    • This action should result in a set of entries appearing in the Network window. The first of these in the Name subwindow, should be labeled index_cms.php
  • Click on that first entry, index_cms.php
    • The lower window will subdivide, and you shoud see sub-tabs with names Headers, Preview, Response, etc.
  • Click on the Headers sub-tab
  • Make sure the Request Headers and the Form Data elements under Headers are expanded and the others are collapsed.
  • Click the view source next to the Request Headers element:
    • Find the HTTP request line. Do you recognize the syntax?
    • What does the Content-length header line tell you?
    • How about the Content-type header?
    • Are these headers formed by the client or by the server?
  • Now examine the Form Data element:
    • what are the key-value pairs?
    • Now click on the view source
    • Where, in an HTTP request, would this information be placed?
    • Click back on the view parsed to get back to the key-value view
    • Toggle back and form between view URL encoded and view decoded; what is the difference? Which of these two is used in the view source view?

YOUR ANSWER HERE

Q12 In the following cell, after my provision of the necessary url elements as Python variables, create a url and then issue a GET request. Process the result into an XML tree. By the conclusion of your cell, you should assign to root the root of the HTML tree.

In [ ]:
protocol = "https"
location = "ww2.energy.ca.gov"
resourcepath = "/almanac/transportation_data/gasoline/margins/index_cms.php"
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=10)

If we use Developer Tools (or View Source) we can find the form that contains the year drop-down and the submit button labeled Get different year. We want to examine that form subtree within our HTML.

Q13 Assign to xs an Xpath string that finds the form node where the action attribute is set to the page's PHP of index_cms.php. Execute the xpath and set form to the first element of the result.

In [ ]:
xs = ""
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(form)

Observations and Conclusions

  1. A GET to this resource path results in an HTML page with multiple (weekly) tables, each of which has data of interest.
  2. The page has a form element, whose method attribute is "post". That means that, when the embedded form is "filled out" and the user submits the form, an HTTP POST is the result:
    • The action attribute of the form determines the resource path, relative to the current location, for the URI/resource path needed in the HTTP POST
    • The "form", in this case, just consists of a dropdown list, whose entries are given by the sequence of option nodes, and whose values are the possible years. The key for this field is called year, as given in the select node. The value will be one of the year values.
    • The input node determines the submission of the form. In this case, when the user clicks the "Get different year", the form will be submitted and, in addition to the key=value items from the form items, the name of the input attribute, newYear will be mapped to the value of "Get different year".

Emulating an Interactive Form-Based POST

We use an HTTP POST to convey information from the client to the server. The information conveyed is in the $\textit{body}$ of the request. So, in contrast to most earlier examples, we need to change two things in using the requests module to make this request:

  1. We must invoke a POST request instead of a GET request.
  2. The request must include a body that consists of key-value pairs.

For (1), the requests module has a post top level function. For (2), we construct a dictionary with the desired mappings. We pass that to the post() using named parameter data. The requests module is very flexible in how it interprets an argument provided through data. If it is a string, it simply puts the encoded bytes of the string in the body. If it is a dictionary, it interprets it and generates a URL-encoded version, as we will see below:

Q14 Using the same URL as the GET above, make a POST request to this site with the body of the post request specified in a dictionary that maps year to the desired year of data, and the newYear key mapped to the string "Get different year". This emulates the request observed from Chrome.

Assign to response the result of the request, and be sure to check for a successful status code.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q15 In the following code cells, we obtain the HTTP request through the response, and then print the value of the POST request body. In the markdown cell following, write a couple sentences explaining exactly why the format is what it is, and how the request headers relate to the body.

In [ ]:
request = response.request
print(request.headers)
print()
print(request.body)

YOUR ANSWER HERE

Processing the Data in the HTML Tree

In the result, there is a table per week. Since this practicum and homework are long enough, I will not ask you to extract data from that set of tables. But extracting data from multiple tables should be well within your skills at this point, and I encourage you to try your hand at it here.