Partition Testing
All software teams face the common challenge of adding code to a source code base without breaking the code which already exists. Software engineers have traditionally solved this problem by executing unit tests which are run before new code is deployed to production. The unit tests verify that each “unit” of code executes properly.
Our source code is used to transform tables in SQL - the output of our code is a collection of tables in SQL. In our case, the goal of our unit tests was to validate the data in our SQL tables, rather than to validate the functionality of our Python code. Specifically, we used the concept of unit testing to validate the following attributes of our output tables in SQL:
- Table exists
- Table has at least one row of data
- SQL query evaluates to an empty set
- SQL query evaluates to a non-empty set
Table Exists
The output data model, in our case, was a number of different SQL tables. To validate that all these tables exist, we queried the INFORMATION_SCHEMA
of our SQL database. Speficially, we wrote a function which took a list of tables as an input. If all tables were represented, the function returns True
, otherwise, the function returns False
.
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
Table has data
In a similar manner, we can query the SQL database to return a list of tables that have at least 1 row of data. We constructed a function which took a list of tables as an input. The function returns True
if all the tables have at least 1 row of data, and False
otherwise.
SELECT T.NAME AS TABLE_NAME,
SUM(P.rows) AS NumRecords
FROM sys.tables T
INNER JOIN sys.indexes I ON T.OBJECT_ID = I.OBJECT_ID
INNER JOIN sys.partitions P ON I.OBJECT_ID = P.OBJECT_ID AND I.index_id = P.index_id
WHERE I.index_id <= 1
GROUP BY T.NAME
HAVING SUM(P.rows) > 0
ORDER BY SUM(P.rows) DESC
Custom SQL Queries
We borrowed the concept of assertions from Python unit tests to construct SQL assertions, which compare the result of a SQL query to an expected result. The most basic construction of a SQL assertion is a query that evaluates to an emtpy table. Suppose that the column AmountPaid is always positive. The below query should return an empty table.
-- expect empty
SELECT TOP 10 *
FROM database.dbo.table
WHERE AmountPaid < 0
We could also construct a query which we expect to return a non-empty table. For example, suppose we want to verify that a client is represented within a table. The below query should return a table with 10 rows.
-- expect full
SELECT TOP 10 *
FROM database.dbo.table
WHERE Client = 'Mercedes'
Ideally, these queries would be exected by a Python function so that it can be automatically executed. In order to distinguish between the two types of queries outlined above, we add frontmatter to our SQL queries. The Python function responsible for executing the custom SQL queries can read the frontmatter and identify which type of assertion should be applied.
In the above queries, the first line is a comment which is not executed by the SQL interpreter. We can use Python to parse the comment and identify the expectation value of the SQL query. In our implementation, we use regex to parse the frontmatter.
with open(file, 'r') as f:
test = os.path.split(file)[1]
test_type = re.search(r'(?<=expect )\\w+', file).group(0)
if(test_type =='full'):
assertFull(file)
if(test_type =='empty'):
assertEmpty(file)
Advanced Custom Queries
Any test case can be reduced to an assertion. For example, if we wanted to ensure that two tables had the same number of records, we could write a query to compare the COUNT
of records in two tables.
-- expect 1
SELECT
CASE WHEN (SELECT COUNT(*) FROM TableA) = (SELECT COUNT(*) FROM TableB)
THEN 1
ELSE 0
END
The idea of embedding metadata in comments opens a world of possibilities - you could build tests which are expected to return certain values, or a certain number of rows. We have found that automating unit tests has decreased the downtime of our production datasets by 10X and increased the number of pull requests per week by a similar increment.