Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


SQL Single Table Exercises

In [ ]:
import pandas as pd
import os
import os.path
import json
import sys
import importlib

module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import dbutil
importlib.reload(dbutil)

%load_ext sql

Instructions

Set User Credentials

Edit creds.json to reflect your mysql user and password

This must be done prior to executing the following cell

In general, you will be able to choose whether you are using the remote MySQL database or the SQLite database(s) by setting the dbsource variable to "mysql" or "sqlite" respectively. The function dbutil.db_cstring function computes a connection string for the chosen dbsource using the information in the creds.json file. If the last argument to this function is present, the generated connection string uses that datbase as superceding the name of the database in creds.json.

In [ ]:
dbsource = "mysql"
db = "book"
cstring = dbutil.db_cstring(dbsource, "creds.json", ".", db)
In [ ]:
print("Connection string:", cstring)

Establish Connection from Client to Server

The cell below uses the Python variable constructed earlier to actually establish a connection with the server.

If you cannot get a connection established, switch over to MySQL Workbench and try from that different client.

In [ ]:
%sql $cstring

In the following cells, your only action is to, as usual, cut out the two lines, and to put a valid SQL statement as the value of string variable query. In each case, when you execute the cell, the query will be sent to the database management system, a result obtained, and the result converted into a pandas data frame, whose prefix is shown.

Q1 Using the table countries, give the SQL query you would use to obtain a table all country names. There should be no onter columns projected.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 217
assert 'Aruba' in list(resultdf['country'])

Q2 Project the year, code, population, and number of cell-phones from indicators. In a comment in the solution cell, describe a derived column and additional processing that could be applied to obtain useful result.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 12862
assert resultdf.shape == (12862,4)
assert 'pop' in resultdf.columns
assert 'cell' in resultdf.columns

Q3 Using the SQL table countries, project all columns and produce a table of rows ordered by landmass, from smallest to largest. What kind of aggregation might be applied to this result? Answer this question as a comment within your solution cell.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 217
assert resultdf.iloc[0,0] == 'CUW'
assert resultdf.iloc[-1,0] == 'RUS'

Q4 Use an SQL query to answer: what country and in what year was the greatest number of cell phones? Project year, code, and cell. An answer that yields multiple rows, but allowing easy answer to the question will get most of the points, but the ideal answer will only have a single row.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert resultdf.loc[0,'cell'] == 1469.88
assert resultdf.loc[0,'year'] == 2017
assert resultdf.loc[0,'code'] == 'CHN'
In [ ]:
# Secondary testing cell
assert True

Q5 Use SQL to find the rows with the top 20 GDP values in indicators. You may project all the columns.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
assert len(resultdf) == 20
assert resultdf.loc[0, 'year'] == 2018
assert resultdf.loc[0, 'code'] == 'USA'
assert resultdf.loc[0, 'gdp'] == 20494.1

Q6 In reference to the table indicators, write a query to find all unique years that appear in the table.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 59

Q7 In reference to the table indicators, write a query to find all rows with no missing data for gdp.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 9660

Q8 In reference to the table indicators, write a query to find all rows with no missing data for any of the numeric fields.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 7117

Q9 In reference to the table indicators, write a query to find all rows where exports is higher than imports. Select all fields for such rows.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 2492
assert resultdf.shape == (2492,8)

Q10 In reference to the table indicators, write a query to find the minimum non-zero number for cell that appears. Since we haven't learned about MIN you'll have to use ORDER BY to arrange it so that the first row in your result has that minimum value.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert resultdf.loc[0,'cell'] == 0.01

Q11 Use a subquery to select the top ten entries in indicators with the highest population, then select the three from that group of 10 that have the lowest GDP. If you need to, review section 11.2.3 on subqueries.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 3
assert resultdf.loc[0,'gdp'] == 2652.55
assert resultdf.loc[1,'pop'] == 1352.62