Example: Exploration
While datajudge seeks to tackle the use case of expressing and evaluating tests against data, its fairly generic inner workings allow for using it in a rather explorative workflow as well.
Let’s first clarify terminology by exemplifying both scenarios. A person wishing to test data might ask the question
Has the number of rows not grown too much from version 1 of the table to version 2 of the table?
whereas a person wishing to explore the data might ask the question
By how much has the number of rows grown from version 1 to version 2 of the table?
Put differently, a test typically revolves around a binary outcome while an exploration usually doesn’t.
In the following we will attempt to illustrate possible usages of datajudge for exploration by looking at three simple examples.
These examples rely on some insight about how most datajudge Constraint
s work under
the hood. Importantly, Constraint
s typically come with
a
retrieve
method: this method fetches relevant data from database, given aDataReference
a
get_factual_value
method: this is typically a wrapper aroundretrieve
for the firstDataReference
of the givenRequirement
/Constraint
a
get_target_value
method: this is either a wrapper aroundretrieve
for the secondDataReference
in the case of aBetweenRequirement
or an echoing of theConstraint
s key reference value in the case of aWithinRequirement
Moreover, as is the case when using datajudge for testing purposes, these approaches rely on a sqlalchemy engine. The latter is the gateway to the database at hand.
Example 1: Comparing numbers of rows
Assume we have two tables in the same database called table1
and table2
. Now we
would like to compare their numbers of rows. Naturally, we would like to retrieve
the respective numbers of rows before we can compare them. For this purpose we create
a BetweenTableRequirement
referring to both tables and add a NRowsEquality
Constraint
onto it.
import sqlalchemy as sa
from datajudge import BetweenRequirement
engine = sa.create_engine(your_connection_string)
req = BetweenRequirement.from_tables(
db_name,
schema_name,
"table1",
db_name,
schema_name,
"table2",
)
req.add_n_rows_equality_constraint()
n_rows1 = req[0].get_factual_value(engine)
n_rows2 = req[0].get_target_value(engine)
Note that here, we access the first (and only) Constraint
that has been added to the
BetweenRequirement
by writing req[0]
. Requirements
are are sequences of
Constraint
s, after all.
Once the numbers of rows are retrieved, we can compare them as we wish. For instance, we
could compute the absolute and relative growth (or loss) of numbers of rows from
table1
to table2
:
absolute_change = abs(n_rows2 - n_rows1)
relative_change = (absolute_change) / n_rows1 if n_rows1 != 0 else None
Importantly, many datajudge staples, such as Condition
s can be used, too. We shall see
this in our next example.
Example 2: Investigating unique values
In this example we will suppose that there is a table called table
consisting of
several columns. Two of its columns are supposed to be called col_int
and
col_varchar
. We are now interested in the unique values in these two columns combined.
Put differently, we are wondering:
Which unique pairs of values in
col_int
andcol_varchar
have we encountered?
To add to the mix, we will moreover only be interested in tuples in which col_int
has a
value of larger than 10.
As before, we will start off by creating a Requirement
. Since we are only dealing with
a single table this time, we will create a WithinRequirement
.
import sqlalchemy as sa
from datajudge import WithinRequirement, Condition
engine = sa.create_engine(your_connection_string)
req = requirements.WithinRequirement.from_table(
db_name,
schema_name,
"table",
)
condition = Condition(raw_string="col_int >= 10")
req.add_uniques_equality_constraint(
columns=["col_int", "col_varchar"],
uniques=[], # This is really just a placeholder.
condition=condition,
)
uniques = req[0].get_factual_value(engine)
If one was to investigate this uniques
variable further, one could, e.g. see the
following:
([(10, 'hi10'), (11, 'hi11'), (12, 'hi12'), (13, 'hi13'), (14, 'hi14'), (15, 'hi15'), (16, 'hi16'), (17, 'hi17'), (18, 'hi18'), (19, 'hi19')], [1, 100, 12, 1, 7, 8, 1, 1, 1337, 1])
This becomes easier to parse when inspecting the underlying retrieve
method of the
UniquesEquality
Constraint
: the first value of the tuple corresponds to the list
of unique pairs in columns col_int
and col_varchar
. The second value of the tuple
are the respective counts thereof.
Moreoever, one could manually customize the underlying SQL query. In order to do so, one
can use the fact that retrieve
methods typically return an actual result or value
as well as the sqlalchemy selections that led to said result or value. We can use these
selections and compile them to a standard, textual SQL query:
values, selections = req[0].retrieve(engine, constraint.ref)
print(str(selections[0].compile(engine, compile_kwargs={"literal_binds": True}))
In the case from above, this would return the following query:
SELECT
anon_1.col_int,
anon_1.col_varchar,
count(*) AS count_1
FROM
(SELECT
tempdb.dbo.table.col_int AS col_int,
tempdb.dbo.table.col_varchar AS col_varchar
FROM
tempdb.dbo.table WITH (NOLOCK)
WHERE col_int >= 10) AS anon_1
GROUP BY anon_1.col_int, anon_1.col_varchar
Example 3: Comparing column structure
While we often care about value tuples of given columns, i.e. rows, it can also provide meaningful insights to compare the column structure of two tables. In particular, we might want to compare whether columns of one table are a subset or superset of another table. Moreover, for columns present in both tables, we’d like to learn about their respective types.
In order to illustrate such an example, we will again assume that there are two tables
called table1
and table2
, irrespective of prior examples.
We can now create a BetweenRequirement
for these two tables and use the
ColumnSubset
Constraint
. As before, we will rely on the get_factual_value
method to retrieve the values of interest for the first table passed to the
BetweenRequirement
and the get_target_value
method for the second table passed
to the BetweenRequirement
.
import sqlalchemy as sa
from datajudge import BetweenRequirement
engine = sa.create_engine(your_connection_string)
req = BetweenRequirement.from_tables(
db_name,
schema_name,
"table1",
db_name,
schema_name,
"table2",
)
req.add_column_subset_constraint()
columns1 = req[0].get_factual_value(engine)
columns2 = req[0].get_target_value(engine)
print(f"Columns present in both: {set(columns1) & set(columns2)}")
print(f"Columns present in only table1: {set(columns1) - set(columns2)}")
print(f"Columns present in only table2: {set(columns2) - set(columns1)}")
This could, for instance result in the following printout:
Columns present in both: {'col_varchar', 'col_int'}
Columns present in only table1: set()
Columns present in only table2: {'col_date'}
Now, we can investigate the types of the columns present in both tables:
for column in set(columns1) & set(columns2):
req.add_column_type_constraint(column1=column, column2=column)
type1 = req[0].get_factual_value(engine)
type2 = req[0].get_target_value(engine)
print(f"Column '{column}' has type '{type1}' in table1 and type '{type2}' in table2.")
Depending on the underlying database management system and data, the output of this could for instance be:
Column 'col_varchar' has type 'varchar' in table1 and type 'varchar' in table2.
Column 'col_int' has type 'integer' in table1 and type 'integer' in table2.