Getting Started
Glossary
A
DataSource
represents a way to retrieve data from database. Typically, this corresponds to a table in the database. Yet, it could also be a more elaborate object. See the section on ‘AlternativeDataSource
s’ for more detail.A
Constraint
captures a concrete expectation between either twoDataSource
s or a singleDataSource
and a reference value.A
Requirement
captures allConstraint
s between two givenDataSource
s or allConstraint
s within a singleDataSource
. If aRequirement
refers links to twoDataSource
s, it is aBetweenRequirement
. If aRequirement
merely refers to a singleDataSource
, it is aWithinRequirement
.Conceptually, a ‘specification’ captures all
Requirement
s against a database. In practice that means it is usually a separate python file which:gathers all relevant
Requirement
sturns these
Requirement
s’Constraint
s into individual testscan be ‘tested’ by pytest
Creating a specification
In order to get going, you might want to use the following snippet in a new python file. This file will represent a specification.
import pytest
import sqlalchemy as sa
from datajudge.pytest_integration import collect_data_tests
@pytest.fixture(scope="module")
def datajudge_engine():
# TODO: Adapt connection string to database at hand.
return sa.create_engine("your_connection_string")
# TODO: Insert Requirement objects to list.
requirements = []
test_constraints = collect_data_tests(requirements)
This file will eventually lead as an input to pytest. More on that in the section ‘Testing a specification’.
In case you haven’t worked with sqlalchemy engines before, you might need to install drivers to connect to your database. You might want to install snowflake-sqlalchemy when using Snowflake, pyscopg when using Postgres and platform-specific drivers (Windows, Linux, macOS) when using MSSQL.
Specifying Constraints
In order to discover possible Constraint
s, please investigate the _add_*_constraint
methods
for BetweenRequirement
and WithinRequirement
respectively.
These methods are meant to be mostly self-documenting through the usage of expressive parameters.
Note that most Constraint
s will allow for at least one Condition
. A Condition
can be thought of as a conditional event in probability theory or a filter/clause in a database
query. Please consult the doc string of Condition
for greater detail. For examples, please
see tests/unit/test_condition.py
.
Many Constraint
s have optional columns
parameters. If no argument is given, all
available columns will be used.
Defining limitations of change
BetweenRequirement
s allow for Constraint
s expressing the limitation of a loss or gain. For example, the NRowsMinGain
Constraint
expresses by how much the number of rows must at least grow from the first DataSource
to the second. In the example of NRowsMinGain
,
this growth limitation is expressed relative to the number of rows of the first DataSource
.
Generally, such relative limitations can be defined in two ways:
manually, based on domain knowledge (e.g. ‘at least 5% growth’)
automatically, based on date ranges
The former would translate to
#rows_table_2 > (1 + min_relative_gain) * #rows_table_1
while the latter would translate to
date_growth := (max_date_table_2 - min_date_table_2) / (max_date_table_1 - min_date_table_1)
#rows_table_2 > (1 + date_growth) * #rows_table_1
In the latter case a date column must be passed during the instantiation of the BetweenRequirement
. Moreover, the date_range_*
must be passed
in the respective add_*_constraint
method. When using date ranges as an indicator of change, the constant_max_*
argument can safely be ignored. Additionally,
an additional buffer to the date growth can be added with help of the date_range_gain_deviation
parameter:
date_growth := (max_date_table_2 - min_date_table_2) / (max_date_table_1 - min_date_table_1)
#rows_table_2 > (1 + date_growth + date_range_gain_deviation) + * #rows_table_1
This example revolving around NRowsMinGain
generalizes to many Constraint
s concerned with growth, gain, loss or shrinkage limitations.
Testing a specification
In order to test whether the Constraint
s expressed in a specification hold true, you can simply run
pytest your_specification.py
This will produce results directly in your terminal. If you prefer to additionally generate a report, you can run
pytest your_specification.py --html=your_report.html
As the testing relies on pytest, all of pytest’s features can be used. More on this in the article on testing.
Test information
When calling a Constraint
’s test
method, a TestResult
is returned. The latter comes with a
logging_message
field. This field comprises information about the test failure, the constraint at hand
as well as the underlying database queries.
Depending on the use case at hand, it might make sense to rely on this information for logging or data investigation purposes. Again, more on this in the article on testing.
Assertion Message Styling
Constraints can use styling to increase the readability of their assertion messages. The styling can be set independently of the platform and converted to e.g. ANSI color codes for command line output or CSS color tags for HTML reports. The styling tags describe use cases and not concrete colors, so formatters can use arbitrary color palettes, and these are not fixed by the constraint.
The following table lists all the supported codes, along with their descriptions and examples of how they can be used:
Code |
Description |
Example |
---|---|---|
numMatch |
Indicates the part of a number that matches the expected value. |
[numMatch]3.141[/numMatch] |
numDiff |
Indicates the part of a number that differs. |
[numDiff]6[/numDiff] |
Alternative DataSources
A Requirement
is instantiated with either one or two fixed DataSource
s.
While the most typical example of a DataSource
would be a table in a database, datajudge
allows
for other DataSource
s as well. These are often derived from primitive tables of a database.
|
explanation |
|
|
---|---|---|---|
|
represents a table in a database |
|
|
|
represents the result of a |
|
|
|
represents the result of a sql query expressed via a string |
|
|
Typically, a user does not need to instantiate a corresponding DataSource
themselves. Rather, this is taken care
of by using the appropriate constructor for WithinRequirement
or BetweenRequirement
.
Note that in principle, several tables can be combined to make up for a single DataSource
. Yet, most of
the time when trying to compare two tables, it is more convenient to create a BetweenRequirement
and use
the from_tables
constructor.
Column capitalization
Different database management systems handle the capitalization of entities, such as column names, differently. For the time being:
Mssql:
datajudge
expects column name capitalization as is seen in database, either lowercase or uppercase.Postgres:
datajudge
expects lowercase column names.Snowflake:
datajudge
will lowercase independently of the capitalization provided.
The Snowflake behavior is due to an upstream bug in snowflake-sqlalchemy.
This behavior is subject to change.