Denison CS181/DA210 Homework

Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE".


SQL Group By Exercises

In [ ]:
import pandas as pd
import os
import os.path
import json
import sys
import importlib

module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import dbutil
importlib.reload(dbutil)

%load_ext sql

Instructions

Set User Credentials

Edit creds.json to reflect your mysql user and password

This must be done prior to executing the following cell

In general, you will be able to choose whether you are using the remote MySQL database or the SQLite database(s) by setting the dbsource variable to "mysql" or "sqlite" respectively. The function dbutil.db_cstring function computes a connection string for the chosen dbsource using the information in the creds.json file. If the last argument to this function is present, the generated connection string uses that datbase as superceding the name of the database in creds.json.

In [ ]:
dbsource = "sqlite"
db = "book"
cstring = dbutil.db_cstring(dbsource, "creds.json", ".", db)
In [ ]:
print("Connection string:", cstring)
In [ ]:
%sql $cstring

In the following cells, your only action is to, as usual, cut out the two lines, and to put a valid SQL statement as the value of string variable query. In each case, when you execute the cell, the query will be sent to the database management system, a result obtained, and the result converted into a pandas data frame, whose prefix is shown.

Q1 Using the SQL table countries, use a select query to answer the question: how many countries are there in each region? Alias your new column as new.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf
In [ ]:
# Testing cell

assert len(resultdf) == 7
assert 58 in list(resultdf['new'])

Q2 Use the indicators database to find the total world population in each year (as the sum of country populations). Use the alias total_pop for your new column. Sort the result in ascending year value.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 59
assert resultdf.loc[58,'total_pop'] > 7500
assert resultdf.loc[58,'total_pop'] < 7600

Q3 Treating your query above as a subquery (without the ORDER BY), find the minimum for total_pop over all years. Use the alias m for the new column.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 1
assert resultdf.loc[0,'m'] > 3014
assert resultdf.loc[0,'m'] < 3015

Q4 Not all countries are growing, so the largest population a country ever had might be in a previous year. For each country code in indicators, find the max population that country ever had. Alias your new column as max_pop. You should have one row per country. Don't change the order (that is, your records should still be ordered by code).

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 218
assert resultdf.loc[0,'max_pop'] == 0.11
assert resultdf.loc[1,'max_pop'] == 37.17

Q5 With reference to the above, find all records where the max population is less than 1 (remember, this is measured in millions of people). Use a HAVING clause. Keep the original ordering of the data (alphabetically, by code).

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 58
assert resultdf.loc[0,'max_pop'] == 0.11
assert resultdf.loc[1,'max_pop'] == 0.08

Q6 Use the indicators database to find the total world population in each year (as the sum of country populations), then return the rows where the total population is greater than 6000 (measured in millions of people). Use the alias total_pop for your new column.

In [ ]:
# Solution cell

query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
In [ ]:
# Testing cell
assert len(resultdf) == 20
assert resultdf.loc[0,'total_pop'] < 6014
assert resultdf.loc[1,'total_pop'] > 6090