Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
import pandas as pd
import os
import os.path
import json
import sys
import importlib
module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
sys.path.append(module_path)
import dbutil
importlib.reload(dbutil)
%load_ext sql
Edit
creds.json
to reflect your mysql user and passwordThis must be done prior to executing the following cell
In general, you will be able to choose whether you are using the remote MySQL database or the SQLite database(s) by setting the dbsource
variable to "mysql"
or "sqlite"
respectively. The function dbutil.db_cstring
function computes a connection string for the chosen dbsource
using the information in the creds.json
file. If the last argument to this function is present, the generated connection string uses that datbase as superceding the name of the database in creds.json
.
dbsource = "sqlite"
db = "book"
cstring = dbutil.db_cstring(dbsource, "creds.json", ".", db)
print("Connection string:", cstring)
%sql $cstring
In the following cells, your only action is to, as usual, cut out the two lines, and to put a valid SQL statement as the value of string variable query
. In each case, when you execute the cell, the query will be sent to the database management system, a result obtained, and the result converted into a pandas
data frame, whose prefix is shown.
Q1 Using the SQL table countries
, use a select query to answer the question: how many countries are there in each region? Alias your new column as new
.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf
# Testing cell
assert len(resultdf) == 7
assert 58 in list(resultdf['new'])
Q2 Use the indicators
database to find the total world population in each year (as the sum of country populations). Use the alias total_pop
for your new column. Sort the result in ascending year value.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 59
assert resultdf.loc[58,'total_pop'] > 7500
assert resultdf.loc[58,'total_pop'] < 7600
Q3 Treating your query above as a subquery (without the ORDER BY), find the minimum for total_pop
over all years. Use the alias m
for the new column.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 1
assert resultdf.loc[0,'m'] > 3014
assert resultdf.loc[0,'m'] < 3015
Q4 Not all countries are growing, so the largest population a country ever had might be in a previous year. For each country code in indicators
, find the max population that country ever had. Alias your new column as max_pop
. You should have one row per country. Don't change the order (that is, your records should still be ordered by code
).
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 218
assert resultdf.loc[0,'max_pop'] == 0.11
assert resultdf.loc[1,'max_pop'] == 37.17
Q5 With reference to the above, find all records where the max population is less than 1 (remember, this is measured in millions of people). Use a HAVING
clause. Keep the original ordering of the data (alphabetically, by code
).
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 58
assert resultdf.loc[0,'max_pop'] == 0.11
assert resultdf.loc[1,'max_pop'] == 0.08
Q6 Use the indicators
database to find the total world population in each year (as the sum of country populations), then return the rows where the total population is greater than 6000 (measured in millions of people). Use the alias total_pop
for your new column.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 20
assert resultdf.loc[0,'total_pop'] < 6014
assert resultdf.loc[1,'total_pop'] > 6090