All software teams face the common challenge of adding code to a source code base without breaking the code which already exists. Software engineers have traditionally solved this problem by executing unit tests which are run before new code is deployed to production. The unit tests verify that each “unit” of code executes properly.
Our source code is used to transform tables in SQL - the output of our code is a collection of tables in SQL. In our case, the goal of our unit tests was to validate the data in our SQL tables, rather than to validate the functionality of our Python code. Specifically, we used the concept of unit testing to validate the following attributes of our output tables in SQL:
- Table exists
- Table has at least one row of data
- SQL query evaluates to an empty set
- SQL query evaluates to a non-empty set
The output data model, in our case, was a number of different SQL tables. To validate that all these tables exist, we queried the
INFORMATION_SCHEMA of our SQL database. Speficially, we wrote a function which took a list of tables as an input. If all tables were represented, the function returns
True, otherwise, the function returns
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE'
Table has data
In a similar manner, we can query the SQL database to return a list of tables that have at least 1 row of data. We constructed a function which took a list of tables as an input. The function returns
True if all the tables have at least 1 row of data, and
SELECT T.NAME AS TABLE_NAME, SUM(P.rows) AS NumRecords FROM sys.tables T INNER JOIN sys.indexes I ON T.OBJECT_ID = I.OBJECT_ID INNER JOIN sys.partitions P ON I.OBJECT_ID = P.OBJECT_ID AND I.index_id = P.index_id WHERE I.index_id <= 1 GROUP BY T.NAME HAVING SUM(P.rows) > 0 ORDER BY SUM(P.rows) DESC
Custom SQL Queries
We borrowed the concept of assertions from Python unit tests to construct SQL assertions, which compare the result of a SQL query to an expected result. The most basic construction of a SQL assertion is a query that evaluates to an emtpy table. Suppose that the column AmountPaid is always positive. The below query should return an empty table.
-- expect empty SELECT TOP 10 * FROM database.dbo.table WHERE AmountPaid < 0
We could also construct a query which we expect to return a non-empty table. For example, suppose we want to verify that a client is represented within a table. The below query should return a table with 10 rows.
-- expect full SELECT TOP 10 * FROM database.dbo.table WHERE Client = 'Mercedes'
Ideally, these queries would be exected by a Python function so that it can be automatically executed. In order to distinguish between the two types of queries outlined above, we add frontmatter to our SQL queries. The Python function responsible for executing the custom SQL queries can read the frontmatter and identify which type of assertion should be applied.
In the above queries, the first line is a comment which is not executed by the SQL interpreter. We can use Python to parse the comment and identify the expectation value of the SQL query. In our implementation, we use regex to parse the frontmatter.
with open(file, 'r') as f: test = os.path.split(file) test_type = re.search(r'(?<=expect )\\w+', file).group(0) if(test_type =='full'): assertFull(file) if(test_type =='empty'): assertEmpty(file)
Advanced Custom Queries
Any test case can be reduced to an assertion. For example, if we wanted to ensure that two tables had the same number of records, we could write a query to compare the
COUNT of records in two tables.
-- expect 1 SELECT CASE WHEN (SELECT COUNT(*) FROM TableA) = (SELECT COUNT(*) FROM TableB) THEN 1 ELSE 0 END
The idea of embedding metadata in comments opens a world of possibilities - you could build tests which are expected to return certain values, or a certain number of rows. We have found that automating unit tests has decreased the downtime of our production datasets by 10X and increased the number of pull requests per week by a similar increment.