Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
import pandas as pd
import os
import os.path
import json
import sys
import importlib
module_dir = "../../modules"
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
sys.path.append(module_path)
import dbutil
importlib.reload(dbutil)
%load_ext sql
Edit
creds.json
to reflect your mysql user and passwordThis must be done prior to executing the following cell
In general, you will be able to choose whether you are using the remote MySQL database or the SQLite database(s) by setting the dbsource
variable to "mysql"
or "sqlite"
respectively. The function dbutil.db_cstring
function computes a connection string for the chosen dbsource
using the information in the creds.json
file. If the last argument to this function is present, the generated connection string uses that datbase as superceding the name of the database in creds.json
.
dbsource = "mysql"
db = "book"
cstring = dbutil.db_cstring(dbsource, "creds.json", ".", db)
print("Connection string:", cstring)
The cell below uses the Python variable constructed earlier to actually establish a connection with the server.
If you cannot get a connection established, switch over to MySQL Workbench and try from that different client.
%sql $cstring
In the following cells, your only action is to, as usual, cut out the two lines, and to put a valid SQL statement as the value of string variable query
. In each case, when you execute the cell, the query will be sent to the database management system, a result obtained, and the result converted into a pandas
data frame, whose prefix is shown.
Q1 Using the table countries
, give the SQL query you would use to obtain a table all country names. There should be no onter columns projected.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 217
assert 'Aruba' in list(resultdf['country'])
Q2 Project the year, code, population, and number of cell-phones from indicators
. In a comment in the solution cell, describe a derived column and additional processing that could be applied to obtain useful result.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 12862
assert resultdf.shape == (12862,4)
assert 'pop' in resultdf.columns
assert 'cell' in resultdf.columns
Q3 Using the SQL table countries
, project all columns and produce a table of rows ordered by landmass, from smallest to largest. What kind of aggregation might be applied to this result? Answer this question as a comment within your solution cell.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 217
assert resultdf.iloc[0,0] == 'CUW'
assert resultdf.iloc[-1,0] == 'RUS'
Q4 Use an SQL query to answer: what country and in what year was the greatest number of cell phones? Project year, code, and cell. An answer that yields multiple rows, but allowing easy answer to the question will get most of the points, but the ideal answer will only have a single row.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert resultdf.loc[0,'cell'] == 1469.88
assert resultdf.loc[0,'year'] == 2017
assert resultdf.loc[0,'code'] == 'CHN'
# Secondary testing cell
assert True
Q5 Use SQL to find the rows with the top 20 GDP values in indicators
. You may project all the columns.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
assert len(resultdf) == 20
assert resultdf.loc[0, 'year'] == 2018
assert resultdf.loc[0, 'code'] == 'USA'
assert resultdf.loc[0, 'gdp'] == 20494.1
Q6 In reference to the table indicators
, write a query to find all unique years that appear in the table.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 59
Q7 In reference to the table indicators
, write a query to find all rows with no missing data for gdp
.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 9660
Q8 In reference to the table indicators
, write a query to find all rows with no missing data for any of the numeric fields.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 7117
Q9 In reference to the table indicators
, write a query to find all rows where exports is higher than imports. Select all fields for such rows.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 2492
assert resultdf.shape == (2492,8)
Q10 In reference to the table indicators
, write a query to find the minimum non-zero number for cell
that appears. Since we haven't learned about MIN
you'll have to use ORDER BY
to arrange it so that the first row in your result has that minimum value.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert resultdf.loc[0,'cell'] == 0.01
Q11 Use a subquery to select the top ten entries in indicators
with the highest population, then select the three from that group of 10 that have the lowest GDP. If you need to, review section 11.2.3 on subqueries.
# Solution cell
query = """
"""
# YOUR CODE HERE
raise NotImplementedError()
resultset = %sql $query
resultdf = resultset.DataFrame()
resultdf.head()
# Testing cell
assert len(resultdf) == 3
assert resultdf.loc[0,'gdp'] == 2652.55
assert resultdf.loc[1,'pop'] == 1352.62