{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Denison CS181/DA210 Homework\n", "\n", "Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\\rightarrow$Restart And Run All).\n", "\n", "Make sure you fill in any place that says `YOUR CODE HERE` or \"YOUR ANSWER HERE\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Client Data Acquisition\n", "\n", "> Focus on obtaining and then using data requested over the network, and in CSV and XML format. Requisite use of `StringIO` and `BytesIO`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import os.path\n", "import sys\n", "import importlib\n", "import io\n", "import pandas as pd\n", "from lxml import etree\n", "\n", "if os.path.isdir(os.path.join(\"../../..\", \"modules\")):\n", " module_dir = os.path.join(\"../../..\", \"modules\")\n", "else:\n", " module_dir = os.path.join(\"../..\", \"modules\")\n", "\n", "module_path = os.path.abspath(module_dir)\n", "if not module_path in sys.path:\n", " sys.path.append(module_path)\n", "\n", "import util\n", "importlib.reload(util)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q1** The purpose of `io.StringIO()` is to create a file-like object from *any* string in a Python program. The object created \"acts\" just like a file object obtained from an `open()` file would.\n", "\n", "Consider the following single Python string, `s`, composed over multiple continued lines:\n", "\n", " s = \"Twilight and evening bell,\\n\" \\\n", " \"And after that the dark!\\n\" \\\n", " \"And may there be no sadness of farewell,\\n\" \\\n", " \"When I embark;\\n\"\n", "\n", "First, write some code to deal with `s` as a string:\n", "\n", "- determine the length of `s`, assign to `len_s`\n", "- find the integer start and end indices (inclusive) of the substring `\"dark\"` within `s`, and assign to `dark_start`/`dark_end`\n", "- create string `s2` by replacing `\"embark\"` with `\"disembark\"`\n", "\n", "Now, create a file-like object from `s`, and perform a first `readline()`, assigning to variable `line1` and then write a `for` loop to use the file-like object as an iterator to accumulate into a list called `lines` a list of the remaining lines. For each of the strings in `line1` and `lines`, make sure you omit any trailing newline." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "8c5aa5f46db2ef30705092d581601525", "grade": false, "grade_id": "cell-7af55ab11001c5ef", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "s = \"Twilight and evening bell,\\n\" \\\n", " \"And after that the dark!\\n\" \\\n", " \"And may there be no sadness of farewell,\\n\" \\\n", " \"When I embark;\\n\"\n", "\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(len_s)\n", "print(\"start:\", dark_start, \"end:\", dark_end, \"substring:\", s[dark_start:dark_end+1])\n", "print(\"length s2:\", len(s2))\n", "print(line1)\n", "print(lines)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "f2fd93d0e89b10bade86cff95b35d739", "grade": true, "grade_id": "cell-125cbee27c447392", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "392f00d9a5031517ba860eafb22b47c8", "grade": false, "grade_id": "cell-c7302157b1568dc5", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "The next set of exercises involve a file at resource path `/data/mystery3.dat` on host `datasystems.denison.edu`. You can assume the file is textual, and is a tab-separated data collection where each line consists of:\n", "\n", " male_name male_count female_name female_count\n", " \n", "for the top 10 name applications of each sex to the US Social Security Administration for the year 2015." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q2** Suppose the encoding of the file is unknown, but will be from one of the following:\n", "\n", "- 'UTF-8'\n", "- 'UTF-16BE'\n", "- 'UTF-16LE'\n", "- 'cp037'\n", "- 'latin_1'\n", "\n", "Write code to:\n", "\n", "- acquire the file from the web server\n", "- ensure the status_code is 200\n", "- assign to `content_type` the *value* of the `Content-Type` header line of the response\n", "- determine the *correct* encoding and assign to `real_encoding`\n", "- set the `.encoding` attribute of the response to `real_encoding`\n", "- assign to `tsv_body` the string text for the body of the response." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e7d6d070bf8146c863977bccc4e77164", "grade": false, "grade_id": "cell-4be125be19293b31", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "import requests\n", "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "print(tsv_body)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "7756a6774a7f3aa0c995f73492125c80", "grade": true, "grade_id": "cell-c7b0f0b8c66a829d", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q3** In this question, you will start with a *string* and create a *Dictionary of Lists* representation of the data entailed in the string. It is suggested to use the result of the previous problem, `csv_ body` as the starting point. But to start independently, you can use the following string literal constant assignment to get to the same starting point:\n", "\n", " csv_body = \"Noah\\t19635\\tEmma\\t20455\\n\" \\\n", " \"Liam\\t18374\\tOlivia\\t19691\\n\" \\\n", " \"Mason\\t16627\\tSophia\\t17417\\n\" \\\n", " \"Jacob\\t15949\\tAva\\t16378\\n\" \\\n", " \"William\\t15909\\tIsabella\\t15617\\n\" \\\n", " \"Ethan\\t15077\\tMia\\t14905\\n\" \\\n", " \"James\\t14824\\tAbigail\\t12401\\n\" \\\n", " \"Alexander\\t14547\\tEmily\\t11786\\n\" \\\n", " \"Michael\\t14431\\tCharlotte\\t11398\\n\" \\\n", " \"Benjamin\\t13700\\tHarper\\t10295\\n\"\n", " \n", "Construct a file-like object from `csv_body` and then use file object operations to create a dictionary of lists representation of the tab-separated data. Note that there is no header line in the data, so you can name the columns `malename`, `malecount`, `femalename`, `femalecount`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "66fc42b4d5380f9596a691c954c9ea78", "grade": false, "grade_id": "cell-8e6f74a53dcbf0e9", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "tsv_body = \"Noah\\t19635\\tEmma\\t20455\\n\" \\\n", " \"Liam\\t18374\\tOlivia\\t19691\\n\" \\\n", " \"Mason\\t16627\\tSophia\\t17417\\n\" \\\n", " \"Jacob\\t15949\\tAva\\t16378\\n\" \\\n", " \"William\\t15909\\tIsabella\\t15617\\n\" \\\n", " \"Ethan\\t15077\\tMia\\t14905\\n\" \\\n", " \"James\\t14824\\tAbigail\\t12401\\n\" \\\n", " \"Alexander\\t14547\\tEmily\\t11786\\n\" \\\n", " \"Michael\\t14431\\tCharlotte\\t11398\\n\" \\\n", " \"Benjamin\\t13700\\tHarper\\t10295\\n\"\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "13f8f57060ea122c8cd63e6f7f1c0947", "grade": true, "grade_id": "cell-52311628c4f73476", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert len(DoL['malename']) == 10\n", "assert DoL['malename'][0] == 'Noah'\n", "assert DoL['femalename'][9] == 'Harper'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "943f19a53a35c60ed1538e9be35ad463", "grade": true, "grade_id": "cell-6378c904ae819202", "locked": true, "points": 1, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q4** Use `pandas` to obtain a data frame named `df` by using a file-like object based on `tsv_body` and use `read_csv()`. Name your resultant data frame `df`. Make sure you have reasonable column names.\n", "\n", "Be careful to call `read_csv` so that the separators are tabs, not commas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "574ee630f368e1f38872a03a99add78f", "grade": false, "grade_id": "cell-3c3d6bac2b935136", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "tsv_body = \"Noah\\t19635\\tEmma\\t20455\\n\" \\\n", " \"Liam\\t18374\\tOlivia\\t19691\\n\" \\\n", " \"Mason\\t16627\\tSophia\\t17417\\n\" \\\n", " \"Jacob\\t15949\\tAva\\t16378\\n\" \\\n", " \"William\\t15909\\tIsabella\\t15617\\n\" \\\n", " \"Ethan\\t15077\\tMia\\t14905\\n\" \\\n", " \"James\\t14824\\tAbigail\\t12401\\n\" \\\n", " \"Alexander\\t14547\\tEmily\\t11786\\n\" \\\n", " \"Michael\\t14431\\tCharlotte\\t11398\\n\" \\\n", " \"Benjamin\\t13700\\tHarper\\t10295\\n\"\n", "# YOUR CODE HERE\n", "raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "07e1288cd58965ac7ec68dde34a371cc", "grade": true, "grade_id": "cell-f5595d0d045d8f6e", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert len(df) == 10\n", "assert isinstance(df.columns[0],str)\n", "assert df.iloc[0,0] == 'Noah'\n", "assert df.iloc[0,1] == 19635\n", "assert df.iloc[9,2] == 'Harper'\n", "assert df.iloc[9,3] == 10295" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many of the following exercises, we will show a `curl` incantation that obtains XML-formatted text data from the Internet. Your task will be to translate the incantation into the equivalent `requests` module programming steps, and to obtain the *parsed* XML-based `ElementTree` structure from the result, assigning to variable `root` the root of the result. You must **always** check the status code returned from the request before further processing. In some cases, we will ask for a specific method from among those demonstrated in the textbook section." ] }, { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "markdown", "checksum": "4d8118d32da061d0df31343bd082ae22", "grade": false, "grade_id": "cell-e1cb47c5d7b2dfd2", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "source": [ "**Q5** Using any method, get the XML data from `school0.xml`:\n", "\n", " curl -s -o school0.xml \\\n", " https://datasystems.denison.edu/data/school0.xml" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "e4a719f7c0ceaf02fc938315df4db08a", "grade": false, "grade_id": "cell-44990515d3a7ff46", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "util.print_xml(root, nlines=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "97e0f884b236ae05f326890168a52c03", "grade": true, "grade_id": "cell-b38607585e9356f3", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q6** Using the bytes data in `.content`, a *file-like-object*, and `etree.parse()`, get the XML data from `school0.xml`.\n", "\n", " curl -s -o school0.xml \\\n", " https://datasystems.denison.edu/data/school0.xml" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "c734952d46144b0c3f276bb85ec0d790", "grade": false, "grade_id": "cell-2a101731e1d15e91", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "util.print_xml(root, nlines=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "aa4d35e43e422346b808e78f463c6035", "grade": true, "grade_id": "cell-7dd6de64c3c283a6", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q7** Write a function\n", "\n", " getXMLdata(resource, location, protocol='http')\n", "\n", "that makes a request to `location` for `resource` with the specified protocol, then uses the bytes data in the `.content` of the response, with a *file-like-object*, and `etree.parse()`, to get the XML data. On success, return the root of the tree. On failure of either the request or the parse of the data, return `None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "22be6d60f0a80fd60e39993760b695f6", "grade": false, "grade_id": "cell-8156b2e8688663b6", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "root = getJSONdata(\"/data/school0.xml\", \"datasystems.denison.edu\", \"https\")\n", "util.print_xml(root, nlines=15)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "5dd91c6278e5fca1a1a9052df28f1f5f", "grade": true, "grade_id": "cell-cb7277b1e202c4e4", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Q8** The `school0_32.xml` resource is encoded with `utf-32be`. Use the method of setting the `.encoding` attribute of the response and then accessing the `.text` string body, and using `fromstring()`. Remember that `fromstring()` expects to start from an Element, not from the header line, so you will need to skip the header to get the string to pass.\n", "\n", " curl -s -o school0_32.xml \\\n", " https://datasystems.denison.edu/data/school0_32.xml" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "nbgrader": { "cell_type": "code", "checksum": "76eab7ff4869bbf98366d008c9c56534", "grade": false, "grade_id": "cell-493b2d0830c02eae", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "# YOUR CODE HERE\n", "raise NotImplementedError()\n", "util.print_xml(root, nlines=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": false, "editable": false, "nbgrader": { "cell_type": "code", "checksum": "12225fd6c98ff51f3c5b196f12931668", "grade": true, "grade_id": "cell-2ff200546e2864d1", "locked": true, "points": 2, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "assert True" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }