Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


Client Data Acquisition

Focus on obtaining and then using data requested over the network, and in CSV and XML format. Requisite use of StringIO and BytesIO.

In [ ]:
import os
import os.path
import sys
import importlib
import io
import pandas as pd
from lxml import etree

if os.path.isdir(os.path.join("../../..", "modules")):
    module_dir = os.path.join("../../..", "modules")
else:
    module_dir = os.path.join("../..", "modules")

module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

Q1 The purpose of io.StringIO() is to create a file-like object from any string in a Python program. The object created "acts" just like a file object obtained from an open() file would.

Consider the following single Python string, s, composed over multiple continued lines:

s = "Twilight and evening bell,\n" \
    "And after that the dark!\n" \
    "And may there be no sadness of farewell,\n" \
    "When I embark;\n"

First, write some code to deal with s as a string:

  • determine the length of s, assign to len_s
  • find the integer start and end indices (inclusive) of the substring "dark" within s, and assign to dark_start/dark_end
  • create string s2 by replacing "embark" with "disembark"

Now, create a file-like object from s, and perform a first readline(), assigning to variable line1 and then write a for loop to use the file-like object as an iterator to accumulate into a list called lines a list of the remaining lines. For each of the strings in line1 and lines, make sure you omit any trailing newline.

In [ ]:
s = "Twilight and evening bell,\n" \
        "And after that the dark!\n" \
        "And may there be no sadness of farewell,\n" \
        "When I embark;\n"

# YOUR CODE HERE
raise NotImplementedError()
print(len_s)
print("start:", dark_start, "end:", dark_end, "substring:", s[dark_start:dark_end+1])
print("length s2:", len(s2))
print(line1)
print(lines)
In [ ]:
assert True

The next set of exercises involve a file at resource path /data/mystery3.dat on host datasystems.denison.edu. You can assume the file is textual, and is a tab-separated data collection where each line consists of:

male_name <tab> male_count <tab> female_name <tab> female_count

for the top 10 name applications of each sex to the US Social Security Administration for the year 2015.

Q2 Suppose the encoding of the file is unknown, but will be from one of the following:

  • 'UTF-8'
  • 'UTF-16BE'
  • 'UTF-16LE'
  • 'cp037'
  • 'latin_1'

Write code to:

  • acquire the file from the web server
  • ensure the status_code is 200
  • assign to content_type the value of the Content-Type header line of the response
  • determine the correct encoding and assign to real_encoding
  • set the .encoding attribute of the response to real_encoding
  • assign to tsv_body the string text for the body of the response.
In [ ]:
import requests
# YOUR CODE HERE
raise NotImplementedError()
print(tsv_body)
In [ ]:
assert True

Q3 In this question, you will start with a string and create a Dictionary of Lists representation of the data entailed in the string. It is suggested to use the result of the previous problem, csv_ body as the starting point. But to start independently, you can use the following string literal constant assignment to get to the same starting point:

csv_body = "Noah\t19635\tEmma\t20455\n" \
           "Liam\t18374\tOlivia\t19691\n" \
           "Mason\t16627\tSophia\t17417\n" \
           "Jacob\t15949\tAva\t16378\n" \
           "William\t15909\tIsabella\t15617\n" \
           "Ethan\t15077\tMia\t14905\n" \
           "James\t14824\tAbigail\t12401\n" \
           "Alexander\t14547\tEmily\t11786\n" \
           "Michael\t14431\tCharlotte\t11398\n" \
           "Benjamin\t13700\tHarper\t10295\n"

Construct a file-like object from csv_body and then use file object operations to create a dictionary of lists representation of the tab-separated data. Note that there is no header line in the data, so you can name the columns malename, malecount, femalename, femalecount.

In [ ]:
tsv_body = "Noah\t19635\tEmma\t20455\n" \
               "Liam\t18374\tOlivia\t19691\n" \
               "Mason\t16627\tSophia\t17417\n" \
               "Jacob\t15949\tAva\t16378\n" \
               "William\t15909\tIsabella\t15617\n" \
               "Ethan\t15077\tMia\t14905\n" \
               "James\t14824\tAbigail\t12401\n" \
               "Alexander\t14547\tEmily\t11786\n" \
               "Michael\t14431\tCharlotte\t11398\n" \
               "Benjamin\t13700\tHarper\t10295\n"
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert len(DoL['malename']) == 10
assert DoL['malename'][0] == 'Noah'
assert DoL['femalename'][9] == 'Harper'
In [ ]:
assert True

Q4 Use pandas to obtain a data frame named df by using a file-like object based on tsv_body and use read_csv(). Name your resultant data frame df. Make sure you have reasonable column names.

Be careful to call read_csv so that the separators are tabs, not commas.

In [ ]:
tsv_body = "Noah\t19635\tEmma\t20455\n" \
               "Liam\t18374\tOlivia\t19691\n" \
               "Mason\t16627\tSophia\t17417\n" \
               "Jacob\t15949\tAva\t16378\n" \
               "William\t15909\tIsabella\t15617\n" \
               "Ethan\t15077\tMia\t14905\n" \
               "James\t14824\tAbigail\t12401\n" \
               "Alexander\t14547\tEmily\t11786\n" \
               "Michael\t14431\tCharlotte\t11398\n" \
               "Benjamin\t13700\tHarper\t10295\n"
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert len(df) == 10
assert isinstance(df.columns[0],str)
assert df.iloc[0,0] == 'Noah'
assert df.iloc[0,1] == 19635
assert df.iloc[9,2] == 'Harper'
assert df.iloc[9,3] == 10295

In many of the following exercises, we will show a curl incantation that obtains XML-formatted text data from the Internet. Your task will be to translate the incantation into the equivalent requests module programming steps, and to obtain the parsed XML-based ElementTree structure from the result, assigning to variable root the root of the result. You must always check the status code returned from the request before further processing. In some cases, we will ask for a specific method from among those demonstrated in the textbook section.

Q5 Using any method, get the XML data from school0.xml:

curl -s -o school0.xml \
     https://datasystems.denison.edu/data/school0.xml
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=20)
In [ ]:
assert True

Q6 Using the bytes data in .content, a file-like-object, and etree.parse(), get the XML data from school0.xml.

curl -s -o school0.xml \
     https://datasystems.denison.edu/data/school0.xml
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=20)
In [ ]:
assert True

Q7 Write a function

getXMLdata(resource, location, protocol='http')

that makes a request to location for resource with the specified protocol, then uses the bytes data in the .content of the response, with a file-like-object, and etree.parse(), to get the XML data. On success, return the root of the tree. On failure of either the request or the parse of the data, return None.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
root = getJSONdata("/data/school0.xml", "datasystems.denison.edu", "https")
util.print_xml(root, nlines=15)
In [ ]:
assert True

Q8 The school0_32.xml resource is encoded with utf-32be. Use the method of setting the .encoding attribute of the response and then accessing the .text string body, and using fromstring(). Remember that fromstring() expects to start from an Element, not from the header line, so you will need to skip the header to get the string to pass.

curl -s -o school0_32.xml \
     https://datasystems.denison.edu/data/school0_32.xml
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=20)
In [ ]:
assert True