Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
Focus on obtaining and then using data requested over the network, and in CSV and XML format. Requisite use of
StringIO
andBytesIO
.
import os
import os.path
import sys
import importlib
import io
import pandas as pd
from lxml import etree
if os.path.isdir(os.path.join("../../..", "modules")):
module_dir = os.path.join("../../..", "modules")
else:
module_dir = os.path.join("../..", "modules")
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
sys.path.append(module_path)
import util
importlib.reload(util)
Q1 The purpose of io.StringIO()
is to create a file-like object from any string in a Python program. The object created "acts" just like a file object obtained from an open()
file would.
Consider the following single Python string, s
, composed over multiple continued lines:
s = "Twilight and evening bell,\n" \
"And after that the dark!\n" \
"And may there be no sadness of farewell,\n" \
"When I embark;\n"
First, write some code to deal with s
as a string:
s
, assign to len_s
"dark"
within s
, and assign to dark_start
/dark_end
s2
by replacing "embark"
with "disembark"
Now, create a file-like object from s
, and perform a first readline()
, assigning to variable line1
and then write a for
loop to use the file-like object as an iterator to accumulate into a list called lines
a list of the remaining lines. For each of the strings in line1
and lines
, make sure you omit any trailing newline.
s = "Twilight and evening bell,\n" \
"And after that the dark!\n" \
"And may there be no sadness of farewell,\n" \
"When I embark;\n"
# YOUR CODE HERE
raise NotImplementedError()
print(len_s)
print("start:", dark_start, "end:", dark_end, "substring:", s[dark_start:dark_end+1])
print("length s2:", len(s2))
print(line1)
print(lines)
assert True
The next set of exercises involve a file at resource path /data/mystery3.dat
on host datasystems.denison.edu
. You can assume the file is textual, and is a tab-separated data collection where each line consists of:
male_name <tab> male_count <tab> female_name <tab> female_count
for the top 10 name applications of each sex to the US Social Security Administration for the year 2015.
Q2 Suppose the encoding of the file is unknown, but will be from one of the following:
Write code to:
content_type
the value of the Content-Type
header line of the responsereal_encoding
.encoding
attribute of the response to real_encoding
tsv_body
the string text for the body of the response.import requests
# YOUR CODE HERE
raise NotImplementedError()
print(tsv_body)
assert True
Q3 In this question, you will start with a string and create a Dictionary of Lists representation of the data entailed in the string. It is suggested to use the result of the previous problem, csv_ body
as the starting point. But to start independently, you can use the following string literal constant assignment to get to the same starting point:
csv_body = "Noah\t19635\tEmma\t20455\n" \
"Liam\t18374\tOlivia\t19691\n" \
"Mason\t16627\tSophia\t17417\n" \
"Jacob\t15949\tAva\t16378\n" \
"William\t15909\tIsabella\t15617\n" \
"Ethan\t15077\tMia\t14905\n" \
"James\t14824\tAbigail\t12401\n" \
"Alexander\t14547\tEmily\t11786\n" \
"Michael\t14431\tCharlotte\t11398\n" \
"Benjamin\t13700\tHarper\t10295\n"
Construct a file-like object from csv_body
and then use file object operations to create a dictionary of lists representation of the tab-separated data. Note that there is no header line in the data, so you can name the columns malename
, malecount
, femalename
, femalecount
.
tsv_body = "Noah\t19635\tEmma\t20455\n" \
"Liam\t18374\tOlivia\t19691\n" \
"Mason\t16627\tSophia\t17417\n" \
"Jacob\t15949\tAva\t16378\n" \
"William\t15909\tIsabella\t15617\n" \
"Ethan\t15077\tMia\t14905\n" \
"James\t14824\tAbigail\t12401\n" \
"Alexander\t14547\tEmily\t11786\n" \
"Michael\t14431\tCharlotte\t11398\n" \
"Benjamin\t13700\tHarper\t10295\n"
# YOUR CODE HERE
raise NotImplementedError()
assert len(DoL['malename']) == 10
assert DoL['malename'][0] == 'Noah'
assert DoL['femalename'][9] == 'Harper'
assert True
Q4 Use pandas
to obtain a data frame named df
by using a file-like object based on tsv_body
and use read_csv()
. Name your resultant data frame df
. Make sure you have reasonable column names.
Be careful to call read_csv
so that the separators are tabs, not commas.
tsv_body = "Noah\t19635\tEmma\t20455\n" \
"Liam\t18374\tOlivia\t19691\n" \
"Mason\t16627\tSophia\t17417\n" \
"Jacob\t15949\tAva\t16378\n" \
"William\t15909\tIsabella\t15617\n" \
"Ethan\t15077\tMia\t14905\n" \
"James\t14824\tAbigail\t12401\n" \
"Alexander\t14547\tEmily\t11786\n" \
"Michael\t14431\tCharlotte\t11398\n" \
"Benjamin\t13700\tHarper\t10295\n"
# YOUR CODE HERE
raise NotImplementedError()
assert len(df) == 10
assert isinstance(df.columns[0],str)
assert df.iloc[0,0] == 'Noah'
assert df.iloc[0,1] == 19635
assert df.iloc[9,2] == 'Harper'
assert df.iloc[9,3] == 10295
In many of the following exercises, we will show a curl
incantation that obtains XML-formatted text data from the Internet. Your task will be to translate the incantation into the equivalent requests
module programming steps, and to obtain the parsed XML-based ElementTree
structure from the result, assigning to variable root
the root of the result. You must always check the status code returned from the request before further processing. In some cases, we will ask for a specific method from among those demonstrated in the textbook section.
Q5 Using any method, get the XML data from school0.xml
:
curl -s -o school0.xml \
https://datasystems.denison.edu/data/school0.xml
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=20)
assert True
Q6 Using the bytes data in .content
, a file-like-object, and etree.parse()
, get the XML data from school0.xml
.
curl -s -o school0.xml \
https://datasystems.denison.edu/data/school0.xml
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=20)
assert True
Q7 Write a function
getXMLdata(resource, location, protocol='http')
that makes a request to location
for resource
with the specified protocol, then uses the bytes data in the .content
of the response, with a file-like-object, and etree.parse()
, to get the XML data. On success, return the root of the tree. On failure of either the request or the parse of the data, return None
.
# YOUR CODE HERE
raise NotImplementedError()
root = getJSONdata("/data/school0.xml", "datasystems.denison.edu", "https")
util.print_xml(root, nlines=15)
assert True
Q8 The school0_32.xml
resource is encoded with utf-32be
. Use the method of setting the .encoding
attribute of the response and then accessing the .text
string body, and using fromstring()
. Remember that fromstring()
expects to start from an Element, not from the header line, so you will need to skip the header to get the string to pass.
curl -s -o school0_32.xml \
https://datasystems.denison.edu/data/school0_32.xml
# YOUR CODE HERE
raise NotImplementedError()
util.print_xml(root, nlines=20)
assert True