Before you turn this problem in, make sure everything runs as expected. This is a combination of restarting the kernel and then running all cells (in the menubar, select Kernel$\rightarrow$Restart And Run All).
Make sure you fill in any place that says YOUR CODE HERE
or "YOUR ANSWER HERE".
No SQL Magics should be used in solving any of the following problems
import pandas as pd
import os
import os.path
import json
import sys
import importlib
import sqlalchemy as sa
if os.path.isdir(os.path.join("../../..", "modules")):
module_dir = os.path.join("../../..", "modules")
else:
module_dir = os.path.join("../..", "modules")
module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
sys.path.append(module_path)
import dbutil
importlib.reload(dbutil)
dbfiledir = "../../dbfiles"
if not os.path.isdir(dbfiledir):
dbfiledir = "../" + dbfiledir
assert os.path.isdir(dbfiledir)
Q1 Write a function
sqlite_cstring(database, dbdir=".", driver="pysqlite", protocol="sqlite")
that generates a connection string suitable for use with a local SQLite database. Note how we order the parameters and give default values for the database directory, the driver, and the protocol, so that invocation of the function can omit the later arguments if the default values have the correct (common) values. So, for instance, an invocation sqlite_cstring("book", "../../dbfiles")
yields the string "sqlite+pysqlite:///../../dbfiles/book.db"
.
If the passed value for argument driver
is None
, then both the +
and the driver should be omitted. So the invocation sqlite_cstring("book", "../../dbfiles", driver=None)
yields the string "sqlite:///../../dbfiles/book.db"
You should be able to use the steps we followed in our inclass on database programming to help write this function.
# YOUR CODE HERE
raise NotImplementedError()
print(sqlite_cstring("book", "../../dbfiles"))
print(sqlite_cstring("book", "../../dbfiles", driver=None))
# Testing Cell
assert sqlite_cstring("book", "../../dbfiles") == "sqlite+pysqlite:///../../dbfiles/book.db"
assert sqlite_cstring("book", "../../dbfiles", driver=None) == "sqlite:///../../dbfiles/book.db"
Q2 In this question, we again have the goal of creating a function that gives us a connection string. But this time, we combine the formatting for a remote database covered in class with the convenience of getting many of the pieces of information from a JSON file. Write a function:
mysql_creds_cstring(database, credsdir=".", credsfile="creds.json")
that generates a connection string suitable for use with a remote MySQL database. The majority of the arguments for the connection string will come from the credsfile
file. In that JSON-formatted file, there will be a top-level dictionary that maps from string "mysql"
to a dictionary. This subordinate dictionary will have keys of "user"
, "pass"
, "server"
, and "scheme"
(which has both the protocol and the lower-level driver). The database part of the connection string will come from the database
argument, not from the creds file. We will omit the port, and let it default to the default port for mysql protocol connections.
If the creds file is not found in the credsdir
, or if any of the required keys are missing, the function should return None
.
# YOUR CODE HERE
raise NotImplementedError()
print(mysql_creds_cstring("book"))
# Testing Cell
assert mysql_creds_cstring("book") == "mysql+mysqlconnector://studen_j1:studen_j1@hadoop.mathsci.denison.edu/book"
The following cell assumes correct operation for the functions you defined in Q1 and/or Q2. As in the past, you can change the value of dbsource
depending on whether you want to use local or remote database.
If your answers to Q1 and/or Q2 are not generating the correct values for connection strings, you should be able to uncomment the last line in this next cell to get the instructor-provided utility function to generate a connection string.
dbsource = "sqlite"
db = "school"
if dbsource == "sqlite":
cstring = sqlite_cstring(db, dbfiledir)
elif dbsource == "mysql":
cstring = mysql_creds_cstring(db)
# cstring = dbutil.db_cstring(dbsource, "creds.json", ".", db)
print("Connection string:", cstring)
Q3 Write a function
db_connect(cstring)
that performs the two sqlalchemy
steps of creating a database engine based on a connection string, cstring
, and then establishes a connection. The function should return both the engine and the connection object. Note that the sqlalchemy
import has already been done for you in the first code cell of this notebook.
# YOUR CODE HERE
raise NotImplementedError()
# Testing Cell
engine, connection = db_connect(cstring)
assert isinstance(engine, sa.engine.base.Engine)
assert isinstance(connection, sa.engine.base.Connection)
Q4 Write a function
db_shutdown(engine, connection)
that, under the protection of a try
/except
block, attempts to close the given connection and then deletes the given engine. The function has no return value.
# YOUR CODE HERE
raise NotImplementedError()
# Testing Cell
## Assumes Q3 testing cell has been executed and that, on entry,
## engine and connection are defined global variables.
db_shutdown(engine, connection)
assert connection.closed
# Calling db_shutdwon a second time should **not** result in an exception
# if properly implemented
db_shutdown(engine, connection)
We now use the functions defined above, along with the value of cstring
to establish a connection to be used for subsequent operations. If prior questions have not been correctly answered, you may substitute other code to create a connection named connection
in place of the following cell.
dbsource = "sqlite"
db = "book"
if dbsource == "sqlite":
cstring = sqlite_cstring(db, dbfiledir)
elif dbsource == "mysql":
cstring = mysql_creds_cstring(db)
# cstring = dbutil.db_cstring(dbsource, "creds.json", ".", db)
print("Connection string:", cstring)
engine, connection = db_connect(cstring)
As part of good programming developement, we should develop a single function for each SQL query we want to make. Such functions should be made general and use function parameters for parts of the query where associating Python variables as part of the query makes sense.
This will be the approach taken in the remaining exercises in this homework.
Q5 Write a function
indicators0_table(conn)
that, given a connection, conn
, obtains all rows and all columns of the indicators0
table, returning the result as a pandas
data frame.
# YOUR CODE HERE
raise NotImplementedError()
resultdf = indicators0_table(connection)
resultdf.head()
# Testing Cell
resultdf = indicators0_table(connection)
assert isinstance(resultdf, pd.core.frame.DataFrame)
assert resultdf.shape == (5, 5)
assert 'code' in resultdf.columns
Q6 Write a function
indicators_pop(conn, pop_threshold)
that obtains all the columns of the indicators
table over the connection, conn
, but limits the rows to those whose pop
column is greater than or equal to pop_threshold
.
# YOUR CODE HERE
raise NotImplementedError()
resultdf = indicators_pop(connection, 1000)
resultdf.head()
# Testing Cell
resultdf = indicators_pop(connection, 1000)
assert isinstance(resultdf, pd.core.frame.DataFrame)
assert resultdf.shape == (59, 8)
assert 'code' in resultdf.columns
Q7 Write a function
topnames_names(conn, name_list)
that queries the topnames
table and returns the rows in which the name
field of the table matches one of the names in name_list
. Hint: On the SQL side, this is a form of a set inclusion predicate. This means the need for "translating" name_list
into the equivalent SQL string.
# YOUR CODE HERE
raise NotImplementedError()
resultdf = topnames_names(connection, ['Mary', 'John'])
resultdf.head()
# Testing Cell
resultdf = topnames_names(connection, ['Mary', 'John'])
assert isinstance(resultdf, pd.core.frame.DataFrame)
assert resultdf.shape == (120, 4)