datajudge package
Submodules
Module contents
datajudge allows to assess whether data from database complies with reference information.
- class datajudge.BetweenRequirement(data_source, data_source2, date_column=None, date_column2=None)
Bases:
Requirement
Methods
add_column_subset_constraint
([name])Columns of first table are subset of second table.
add_column_superset_constraint
([name])Columns of first table are superset of columns of second table.
add_column_type_constraint
(column1, column2)Check that the columns have the same type.
add_date_max_constraint
(column1, column2[, ...])Compare date max of first table to date max of second table.
add_date_min_constraint
(column1, column2[, ...])Ensure date min of first table is greater or equal date min of second table.
add_ks_2sample_constraint
(column1, column2)Apply the so-called two-sample Kolmogorov-Smirnov test to the distributions of the two given columns.
add_max_null_fraction_constraint
(column1, ...)Assert that the fraction of
NULL
values of one is at most that of the other.#rows from first table <= #rows from second table * (1 + max_growth).
#rows from first table >= #rows from second table * (1 - max_loss).
#rows from first table >= #rows from second table * (1 + min_growth).
add_n_uniques_max_gain_constraint
(columns1, ...)#uniques or first table <= #uniques of second table* (1 + max_growth).
add_n_uniques_max_loss_constraint
(columns1, ...)#uniques in first table <= #uniques in second table * (1 - max_loss).
add_numeric_percentile_constraint
(column1, ...)Assert that the
percentage
-th percentile is approximately equal.add_row_equality_constraint
(columns1, ...[, ...])At most
max_missing_fraction
of rows in T1 and T2 are absent in either.add_row_matching_equality_constraint
(...[, ...])Match tables in matching_columns, compare for equality in comparison_columns.
add_row_subset_constraint
(columns1, ...[, ...])At most
max_missing_fraction
of rows in T1 are not in T2.add_row_superset_constraint
(columns1, ...[, ...])At most
max_missing_fraction
of rows in T2 are not in T1.add_uniques_equality_constraint
(columns1, ...)Check if the data's unique values in given columns are equal.
add_uniques_subset_constraint
(columns1, columns2)Check if the given columns's unique values in are contained in reference data.
add_uniques_superset_constraint
(columns1, ...)Check if unique values of columns are contained in the reference data.
append
(value)S.append(value) -- append value to the end of the sequence
clear
()count
(value)extend
(values)S.extend(iterable) -- extend sequence by appending elements from the iterable
from_expressions
(expression1, expression2, ...)Create a
BetweenTableRequirement
based on sqlalchemy expressions.from_raw_queries
(query1, query2, name1, name2)Create a
BetweenRequirement
based on raw query strings.index
(value, [start, [stop]])Raises ValueError if the value is not present.
insert
(index, value)S.insert(index, value) -- insert value before index
pop
([index])Raise IndexError if list is empty or index is out of range.
remove
(value)S.remove(value) -- remove first occurrence of value.
reverse
()S.reverse() -- reverse IN PLACE
add_n_rows_equality_constraint
add_n_uniques_equality_constraint
add_numeric_max_constraint
add_numeric_mean_constraint
add_numeric_min_constraint
add_varchar_max_length_constraint
add_varchar_min_length_constraint
from_tables
get_date_growth_rate
get_deviation_getter
test
- Parameters:
data_source (
DataSource
) –data_source2 (
DataSource
) –date_column (
Optional
[str
]) –date_column2 (
Optional
[str
]) –
- add_column_subset_constraint(name=None)
Columns of first table are subset of second table.
- Parameters:
name (
str
) –
- add_column_superset_constraint(name=None)
Columns of first table are superset of columns of second table.
- Parameters:
name (
str
) –
- add_column_type_constraint(column1, column2, name=None)
Check that the columns have the same type.
- Parameters:
column1 (
str
) –column2 (
str
) –name (
str
) –
- add_date_max_constraint(column1, column2, use_upper_bound_reference=True, column_type='date', condition1=None, condition2=None, name=None)
Compare date max of first table to date max of second table.
The used columns of both tables need to be of the same type.
For more information on
column_type
values, seeadd_column_type_constraint
.If
use_upper_bound_reference
, the max of the first table has to be smaller or equal to the max of the second table. If notuse_upper_bound_reference
, the max of the first table has to be greater or equal to the max of the second table.
- add_date_min_constraint(column1, column2, use_lower_bound_reference=True, column_type='date', condition1=None, condition2=None, name=None)
Ensure date min of first table is greater or equal date min of second table.
The used columns of both tables need to be of the same type.
For more information on
column_type
values, seeadd_column_type_constraint
.If
use_lower_bound_reference
, the min of the first table has to be greater or equal to the min of the second table. If notuse_upper_bound_reference
, the min of the first table has to be smaller or equal to the min of the second table.
- add_ks_2sample_constraint(column1, column2, condition1=None, condition2=None, name=None, significance_level=0.05)
Apply the so-called two-sample Kolmogorov-Smirnov test to the distributions of the two given columns. The constraint is fulfilled, when the resulting p-value of the test is higher than the significance level (default is 0.05, i.e., 5%). The signifance_level must be a value between 0.0 and 1.0.
- add_max_null_fraction_constraint(column1, column2, max_relative_deviation, condition1=None, condition2=None, name=None)
Assert that the fraction of
NULL
values of one is at most that of the other.Given that
column2
’s underlying data has a fractionq
ofNULL
values, themax_relative_deviation
parameter allowscolumn1
’s underlying data to have a fraction(1 + max_relative_deviation) * q
ofNULL
values.
- add_n_rows_equality_constraint(condition1=None, condition2=None, name=None)
- add_n_rows_max_gain_constraint(constant_max_relative_gain=None, date_range_gain_deviation=None, condition1=None, condition2=None, name=None)
#rows from first table <= #rows from second table * (1 + max_growth).
See readme for more information on max_growth.
- add_n_rows_max_loss_constraint(constant_max_relative_loss=None, date_range_loss_deviation=None, condition1=None, condition2=None, name=None)
#rows from first table >= #rows from second table * (1 - max_loss).
See readme for more information on max_loss.
- add_n_rows_min_gain_constraint(constant_min_relative_gain=None, date_range_gain_deviation=None, condition1=None, condition2=None, name=None)
#rows from first table >= #rows from second table * (1 + min_growth).
See readme for more information on min_growth.
- add_n_uniques_equality_constraint(columns1, columns2, condition1=None, condition2=None, name=None)
- add_n_uniques_max_gain_constraint(columns1, columns2, constant_max_relative_gain=None, date_range_gain_deviation=None, condition1=None, condition2=None, name=None)
#uniques or first table <= #uniques of second table* (1 + max_growth).
#uniques in first table are defined based on columns1, #uniques in second table are defined based on columns2.
See readme for more information on max_growth.
- add_n_uniques_max_loss_constraint(columns1, columns2, constant_max_relative_loss=None, date_range_loss_deviation=None, condition1=None, condition2=None, name=None)
#uniques in first table <= #uniques in second table * (1 - max_loss).
#uniques in first table are defined based on columns1, #uniques in second table are defined based on columns2.
See readme for more information on max_loss.
- add_numeric_max_constraint(column1, column2, condition1=None, condition2=None, name=None)
- add_numeric_mean_constraint(column1, column2, max_absolute_deviation, condition1=None, condition2=None, name=None)
- add_numeric_min_constraint(column1, column2, condition1=None, condition2=None, name=None)
- add_numeric_percentile_constraint(column1, column2, percentage, max_absolute_deviation=None, max_relative_deviation=None, condition1=None, condition2=None, name=None)
Assert that the
percentage
-th percentile is approximately equal.The percentile is defined as the value present in
column1
/column2
for whichpercentage
% of the values incolumn1
/column2
are less or equal.NULL
values are ignored.Hence, if
percentage
is less than the inverse of the number of non-NULL
rows,None
is received as thepercentage
-th percentile.percentage
is expected to be provided in percent. The median, for example, would correspond topercentage=50
.At least one of
max_absolute_deviation
andmax_relative_deviation
must be provided.
- add_row_equality_constraint(columns1, columns2, max_missing_fraction, condition1=None, condition2=None, name=None)
At most
max_missing_fraction
of rows in T1 and T2 are absent in either.In other words, \(\frac{|T1 - T2| + |T2 - T1|}{|T1 \cup T2|} \leq\)
max_missing_fraction
. Rows from T1 are indexed incolumns1
, rows from T2 are indexed incolumns2
.
- add_row_matching_equality_constraint(matching_columns1, matching_columns2, comparison_columns1, comparison_columns2, max_missing_fraction, condition1=None, condition2=None, name=None)
Match tables in matching_columns, compare for equality in comparison_columns.
This constraint is similar to the nature of the
RowEquality
constraint. Just as the latter, this constraint divides the cardinality of an intersection by the cardinality of a union. The difference lies in how the set are created. WhileRowEquality
considers all rows of both tables, indexed in columns,RowMatchingEquality
considers only rows in both tables having values inmatching_columns
present in both tables. At mostmax_missing_fraction
of such rows can be missing in the intersection.Alternatively, this can be thought of as counting mismatches in
comparison_columns
after performing an inner join onmatching_columns
.
- add_row_subset_constraint(columns1, columns2, constant_max_missing_fraction, date_range_loss_fraction=None, condition1=None, condition2=None, name=None)
At most
max_missing_fraction
of rows in T1 are not in T2.In other words, \(\frac{|T1-T2|}{|T1|} \leq\)
max_missing_fraction
. Rows from T1 are indexed in columns1, rows from T2 are indexed incolumns2
.In particular, the operation
|T1-T2|
relies on a sqlEXCEPT
statement. In contrast toEXCEPT ALL
, this should lead to a set subtraction instead of a multiset subtraction. In other words, duplicates in T1 are treated as single occurrences.
- add_row_superset_constraint(columns1, columns2, constant_max_missing_fraction, date_range_loss_fraction=None, condition1=None, condition2=None, name=None)
At most
max_missing_fraction
of rows in T2 are not in T1.In other words, \(\frac{|T2-T1|}{|T2|} \leq\)
max_missing_fraction
. Rows from T1 are indexed incolumns1
, rows from T2 are indexed incolumns2
.
- add_uniques_equality_constraint(columns1, columns2, map_func=None, reduce_func=None, condition1=None, condition2=None, name=None)
Check if the data’s unique values in given columns are equal.
The
UniquesEquality
constraint asserts if the values contained in a column of aDataSource
’s columns, are strictly the ones of anotherDataSource
’s columns.See the
Uniques
class for further parameter details onmap_func
andreduce_func
.
- add_uniques_subset_constraint(columns1, columns2, max_relative_violations=0, map_func=None, reduce_func=None, condition1=None, condition2=None, name=None)
Check if the given columns’s unique values in are contained in reference data.
The
UniquesSubset
constraint asserts if the values contained in given column of aDataSource
are part of the unique values of given columns of anotherDataSource
.Null values in the column are ignored. To assert the non-existence of them use the
NullAbsence
constraint via theadd_null_absence_constraint
helper method forWithinRequirement
.max_relative_violations
indicates what fraction of rows of the given table may have values not included in the reference set of unique values. Please note thatUniquesSubset
andUniquesSuperset
are not symmetrical in this regard.See
Uniques
for further details onmap_func
andreduce_func
.
- add_uniques_superset_constraint(columns1, columns2, max_relative_violations=0, map_func=None, reduce_func=None, condition1=None, condition2=None, name=None)
Check if unique values of columns are contained in the reference data.
The
UniquesSuperset
constraint asserts that reference set of expected values, derived from the unique values in given columns of the referenceDataSource
, is contained in given columns of aDataSource
.Null values in the column are ignored. To assert the non-existence of them use the
NullAbsence
constraint via theadd_null_absence_constraint
helper method forWithinRequirement
.max_relative_violations
indicates what fraction of unique values of the givenDataSource
are not represented in the reference set of unique values. Please note thatUniquesSubset
andUniquesSuperset
are not symmetrical in this regard.One use of this constraint is to test for consistency in columns with expected categorical values.
See
Uniques
for further details onmap_func
andreduce_func
.
- add_varchar_max_length_constraint(column1, column2, condition1=None, condition2=None, name=None)
- add_varchar_min_length_constraint(column1, column2, condition1=None, condition2=None, name=None)
- classmethod from_expressions(expression1, expression2, name1, name2, date_column=None, date_column2=None)
Create a
BetweenTableRequirement
based on sqlalchemy expressions.Any sqlalchemy object implementing the
alias
method can be passed as an argument for theexpression1
andexpression2
parameters. This could, e.g. be asqlalchemy.Table
object or the result of asqlalchemy.select
invocation.name1
andname2
will be used to represent the expressions in error messages, respectively.- Parameters:
name1 (
str
) –name2 (
str
) –date_column (
Optional
[str
]) –date_column2 (
Optional
[str
]) –
- classmethod from_raw_queries(query1, query2, name1, name2, columns1=None, columns2=None, date_column=None, date_column2=None)
Create a
BetweenRequirement
based on raw query strings.The
query1
andquery2
parameters can be passed any query string returning rows, e.g."SELECT * FROM myschema.mytable LIMIT 1337"
or"SELECT id, name FROM table1 UNION SELECT id, name FROM table2"
.name1
andname2
will be used to represent the queries in error messages, respectively.If constraints rely on specific columns, these should be provided here via
columns1
andcolumns2
respectively.- Parameters:
query1 (
str
) –query2 (
str
) –name1 (
str
) –name2 (
str
) –columns1 (
List
[str
]) –columns2 (
List
[str
]) –date_column (
Optional
[str
]) –date_column2 (
Optional
[str
]) –
- classmethod from_tables(db_name1, schema_name1, table_name1, db_name2, schema_name2, table_name2, date_column=None, date_column2=None)
- Parameters:
db_name1 (
str
) –schema_name1 (
str
) –table_name1 (
str
) –db_name2 (
str
) –schema_name2 (
str
) –table_name2 (
str
) –date_column (
Optional
[str
]) –date_column2 (
Optional
[str
]) –
- get_date_growth_rate(engine)
- Return type:
float
- get_deviation_getter(fix_value, deviation)
- Parameters:
fix_value (
Optional
[float
]) –deviation (
Optional
[float
]) –
- class datajudge.Condition(raw_string=None, conditions=None, reduction_operator=None)
Bases:
object
Condition allows for further narrowing down of a DataSource in a Constraint.
A Condition can be thought of as a filter, the content of a sql ‘where’ clause or a condition as known from probability theory.
While a DataSource is expressed more generally, one might be interested in testing properties of a specific part of said DataSource in light of a particular constraint. Hence using Condition`s allows for the reusage of a `DataSource, in lieu of creating a new custom `DataSource with the Condition implicitly built in.
A Condition can either be ‘atomic’, i.e. not further reducible to sub-conditions or ‘composite’, i.e. combining multiple subconditions. In the former case, it can be instantiated with help of the raw_string parameter, e.g. “col1 > 0”. In the latter case, it can be instantiated with help of the conditions and reduction_operator parameters. reduction_operator allows for two values: “and” (logical conjunction) and “or” (logical disjunction). Note that composition of `Condition`s supports arbitrary degrees of nesting.
- Attributes:
- conditions
- raw_string
- reduction_operator
Methods
snowflake_str
- Parameters:
raw_string (
str
|None
) –conditions (
Optional
[Sequence
[Condition
]]) –reduction_operator (
str
|None
) –
-
raw_string:
str
|None
= None
-
reduction_operator:
str
|None
= None
- snowflake_str()
- class datajudge.Constraint(ref, *, ref2=None, ref_value=None, name=None)
Bases:
ABC
Express a DataReference constraint against either another DataReference or a reference value.
Constraints against other DataReferences are typically referred to as ‘between’ constraints. Please use the the ref2 argument to instantiate such a constraint. Constraints against a fixed reference value are typically referred to as ‘within’ constraints. Please use the ref_value argument to instantiate such a constraint.
A constraint typically relies on the comparison of factual and target values. The former represent the key quantity of interest as seen in the database, the latter the key quantity of interest as expected a priori. Such a comparison is meant to be carried out in the test method.
In order to obtain such values, the retrieve method defines a mapping from DataReference, be it the DataReference of primary interest, ref, or a baseline DataReference, ref2, to value. If ref_value is already provided, usually no further mapping needs to be taken care of.
- Attributes:
- condition_string
- target_prefix
Methods
retrieve
(engine, ref)Retrieve the value of interest for a DataReference from database.
compare
get_description
get_factual_value
get_target_value
test
- Parameters:
ref (
DataReference
) –ref_value (
Any
) –name (
str
) –
- compare(value_factual, value_target)
- Parameters:
value_factual (
Any
) –value_target (
Any
) –
- Return type:
Tuple
[bool
,Optional
[str
]]
- property condition_string: str
- get_description()
- Return type:
str
- get_factual_value(engine)
- Parameters:
engine (
Engine
) –- Return type:
Any
- get_target_value(engine)
- Parameters:
engine (
Engine
) –- Return type:
Any
- retrieve(engine, ref)
Retrieve the value of interest for a DataReference from database.
- Parameters:
engine (
Engine
) –ref (
DataReference
) –
- Return type:
Tuple
[Any
,Optional
[List
[Select
]]]
- property target_prefix: str
- test(engine)
- Parameters:
engine (
Engine
) –- Return type:
TestResult
- class datajudge.Requirement
Bases:
ABC
,MutableSequence
Methods
append
(value)S.append(value) -- append value to the end of the sequence
clear
()count
(value)extend
(values)S.extend(iterable) -- extend sequence by appending elements from the iterable
index
(value, [start, [stop]])Raises ValueError if the value is not present.
insert
(index, value)S.insert(index, value) -- insert value before index
pop
([index])Raise IndexError if list is empty or index is out of range.
remove
(value)S.remove(value) -- remove first occurrence of value.
reverse
()S.reverse() -- reverse IN PLACE
test
- insert(index, value)
S.insert(index, value) – insert value before index
- Parameters:
index (
int
) –value (
Constraint
) –
- Return type:
None
- test(engine)
- Return type:
List
[TestResult
]
- class datajudge.WithinRequirement(data_source)
Bases:
Requirement
Methods
add_categorical_bound_constraint
(columns, ...)Check if the distribution of unique values in columns falls within the specified minimum and maximum bounds.
add_column_type_constraint
(column, column_type)Check if a column type matches the expected column_type.
add_date_between_constraint
(column, ...[, ...])Use string format: lower_bound="'20121230'".
add_date_max_constraint
(column, max_value[, ...])Ensure all dates to be superior than max_value.
add_date_min_constraint
(column, min_value[, ...])Ensure all dates to be superior than min_value.
add_date_no_gap_constraint
(start_column, ...)Express that date range rows have no gap in-between them.
add_date_no_overlap_2d_constraint
(...[, ...])Express that several date range rows do not overlap in two date dimensions.
add_date_no_overlap_constraint
(start_column, ...)Constraint expressing that several date range rows may not overlap.
add_functional_dependency_constraint
(...[, ...])Expresses a functional dependency, a constraint where the value_columns are uniquely determined by the key_columns.
add_groupby_aggregation_constraint
(columns, ...)Check whether array aggregate corresponds to an integer range.
add_max_null_fraction_constraint
(column, ...)Assert that
column
has less than a certain fraction ofNULL
values.add_numeric_between_constraint
(column, ...)Assert that the column's values lie between
lower_bound
andupper_bound
.add_numeric_max_constraint
(column, max_value)All values in column are less or equal max_value.
add_numeric_mean_constraint
(column, ...[, ...])Assert the mean of the column deviates at most max_deviation from mean_value.
add_numeric_min_constraint
(column, min_value)All values in column are greater or equal min_value.
add_numeric_no_gap_constraint
(start_column, ...)Express that numeric interval rows have no gaps larger than some max value in-between them.
add_numeric_no_overlap_constraint
(...[, ...])Constraint expressing that several numeric interval rows may not overlap.
add_numeric_percentile_constraint
(column, ...)Assert that the
percentage
-th percentile is approximatelyexpected_percentile
.add_primary_key_definition_constraint
(...[, ...])Check that the primary key constraints in the database are exactly equal to the given column names.
add_uniqueness_constraint
([columns, ...])Columns should uniquely identify row.
add_uniques_equality_constraint
(columns, uniques)Check if the data's unique values are equal to a given set of values.
add_uniques_subset_constraint
(columns, uniques)Check if the data's unique values are contained in a given set of values.
add_uniques_superset_constraint
(columns, uniques)Check if unique values of columns are contained in the reference data.
add_varchar_regex_constraint
(column, regex)Assesses whether the values in a column match a given regular expression pattern.
add_varchar_regex_constraint_db
(column, regex)Assesses whether the values in a column match a given regular expression pattern.
append
(value)S.append(value) -- append value to the end of the sequence
clear
()count
(value)extend
(values)S.extend(iterable) -- extend sequence by appending elements from the iterable
from_expression
(expression, name)Create a
WithinRequirement
based on a sqlalchemy expression.from_raw_query
(query, name[, columns])Create a
WithinRequirement
based on a raw query string.index
(value, [start, [stop]])Raises ValueError if the value is not present.
insert
(index, value)S.insert(index, value) -- insert value before index
pop
([index])Raise IndexError if list is empty or index is out of range.
remove
(value)S.remove(value) -- remove first occurrence of value.
reverse
()S.reverse() -- reverse IN PLACE
add_column_existence_constraint
add_n_rows_equality_constraint
add_n_rows_max_constraint
add_n_rows_min_constraint
add_n_uniques_equality_constraint
add_null_absence_constraint
add_varchar_max_length_constraint
add_varchar_min_length_constraint
from_table
test
- Parameters:
data_source (
DataSource
) –
- add_categorical_bound_constraint(columns, distribution, default_bounds=(0, 0), max_relative_violations=0, condition=None, name=None)
Check if the distribution of unique values in columns falls within the specified minimum and maximum bounds.
The CategoricalBoundConstraint is added to ensure the distribution of unique values in the specified columns of a DataSource falls within the given minimum and maximum bounds defined in the distribution parameter.
- Parameters:
- columnsList[str]
A list of column names from the DataSource to apply the constraint on.
- distributionDict[T, Tuple[float, float]]
A dictionary where keys represent unique values and the corresponding tuple values represent the minimum and maximum allowed proportions of the respective unique value in the columns.
- default_boundsTuple[float, float], optional, default=(0, 0)
A tuple specifying the minimum and maximum allowed proportions for all elements not mentioned in the distribution. By default, it’s set to (0, 0), which means all elements not present in distribution will cause a constraint failure.
- max_relative_violationsfloat, optional, default=0
A tolerance threshold (0 to 1) for the proportion of elements in the data that can violate the bound constraints without triggering the constraint violation.
- conditionCondition, optional
An optional parameter to specify a Condition object to filter the data before applying the constraint.
- namestr, optional
An optional parameter to provide a custom name for the constraint.
- Parameters:
columns (
List
[str
]) –distribution (
Dict
[TypeVar
(T
),Tuple
[float
,float
]]) –default_bounds (
Tuple
[float
,float
]) –max_relative_violations (
float
) –condition (
Condition
) –name (
str
) –
- add_column_existence_constraint(columns, name=None)
- Parameters:
columns (
List
[str
]) –name (
str
) –
- add_column_type_constraint(column, column_type, name=None)
Check if a column type matches the expected column_type.
The column_type can be provided as a string (backend-specific type name), a backend-specific SQLAlchemy type, or a SQLAlchemy’s generic type.
If SQLAlchemy’s generic types are used, the check is performed using isinstance, which means that the actual type can also be a subclass of the target type. For more information on SQLAlchemy’s generic types, see https://docs.sqlalchemy.org/en/20/core/type_basics.html
- Parameters:
- columnstr
The name of the column to which the constraint will be applied.
- column_typeUnion[str, sa.types.TypeEngine]
The expected type of the column. This can be a string, a backend-specific SQLAlchemy type, or a generic SQLAlchemy type.
- nameOptional[str]
An optional name for the constraint. If not provided, a name will be generated automatically.
- Parameters:
column (
str
) –column_type (
Union
[str
,TypeEngine
]) –name (
str
) –
- add_date_between_constraint(column, lower_bound, upper_bound, min_fraction, condition=None, name=None)
Use string format: lower_bound=“‘20121230’”.
- Parameters:
column (
str
) –lower_bound (
str
) –upper_bound (
str
) –min_fraction (
float
) –condition (
Condition
) –name (
str
) –
- add_date_max_constraint(column, max_value, use_upper_bound_reference=True, column_type='date', condition=None, name=None)
Ensure all dates to be superior than max_value.
Use string format: max_value=“‘20121230’”.
For more information on
column_type
values, seeadd_column_type_constraint
.If
use_upper_bound_reference
, the max of the first table has to be smaller or equal tomax_value
. If notuse_upper_bound_reference
, the max of the first table has to be greater or equal tomax_value
.- Parameters:
column (
str
) –max_value (
str
) –use_upper_bound_reference (
bool
) –column_type (
Union
[str
,TypeEngine
]) –condition (
Condition
) –name (
str
) –
- add_date_min_constraint(column, min_value, use_lower_bound_reference=True, column_type='date', condition=None, name=None)
Ensure all dates to be superior than min_value.
Use string format: min_value=“‘20121230’”.
For more information on
column_type
values, seeadd_column_type_constraint
.If
use_lower_bound_reference
, the min of the first table has to be greater or equal tomin_value
. If notuse_upper_bound_reference
, the min of the first table has to be smaller or equal tomin_value
.- Parameters:
column (
str
) –min_value (
str
) –use_lower_bound_reference (
bool
) –column_type (
Union
[str
,TypeEngine
]) –condition (
Condition
) –name (
str
) –
- add_date_no_gap_constraint(start_column, end_column, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)
Express that date range rows have no gap in-between them.
The table under inspection must consist of at least one but up to many key columns, identifying an entity. Additionally, a
start_column
and anend_column
, indicating start and end dates, should be provided.Neither of those columns should contain
NULL
values. Also, it should hold that for a given row, the value ofend_column
is strictly greater than the value ofstart_column
.Note that the value of
start_column
is expected to be included in each date range. By default, the value ofend_column
is expected to be included as well - this can however be changed by settingend_included
toFalse
.A ‘key’ is a fixed set of values in
key_columns
and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several date ranges.If`` key_columns`` is
None
or[]
, all columns of the table will be considered as composing the key.In order to express a tolerance for some violations of this gap property, use the
max_relative_n_violations
parameter. The latter expresses for what fraction of all key_values, at least one gap may exist.For illustrative examples of this constraint, please refer to its test cases.
- Parameters:
start_column (
str
) –end_column (
str
) –key_columns (
Optional
[List
[str
]]) –end_included (
bool
) –max_relative_n_violations (
float
) –condition (
Condition
) –name (
str
) –
- add_date_no_overlap_2d_constraint(start_column1, end_column1, start_column2, end_column2, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)
Express that several date range rows do not overlap in two date dimensions.
The table under inspection must consist of at least one but up to many key columns, identifying an entity. Per date dimension, a
start_column
and anend_column
should be provided.For a given row in this table,
start_column1
andend_column1
indicate a date range. Moreoever, for that same row,start_column2
andend_column2
indicate a date range. These date ranges are expected to represent different date ‘dimensions’. Example: A row indicates a forecasted value used in production.start_column1
andend_column1
represent the timespan that was forecasted, e.g. the weather from next Saturday to next Sunday.end_column1
andend_column2
might indicate the timespan when this forceast was used, e.g. from the previous Monday to Wednesday.Neither of those columns should contain
NULL
values. Also it should hold that for a given row, the value ofend_column
is strictly greater than the value ofstart_column
.Note that the values of
start_column1
andstart_column2
are expected to be included in each date range. By default, the values ofend_column1
andend_column2
are expected to be included as well - this can however be changed by settingend_included
toFalse
.A ‘key’ is a fixed set of values in key_columns and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several date ranges.
Often, you might want the date ranges for a given key not to overlap.
If key_columns is
None
or[]
, all columns of the table will be considered as composing the key.In order to express a tolerance for some violations of this non-overlapping property, use the
max_relative_n_violations
parameter. The latter expresses for what fraction of all key_values, at least one overlap may exist.For illustrative examples of this constraint, please refer to its test cases.
- Parameters:
start_column1 (
str
) –end_column1 (
str
) –start_column2 (
str
) –end_column2 (
str
) –key_columns (
Optional
[List
[str
]]) –end_included (
bool
) –max_relative_n_violations (
float
) –condition (
Condition
) –name (
str
) –
- add_date_no_overlap_constraint(start_column, end_column, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)
Constraint expressing that several date range rows may not overlap.
The
DataSource
under inspection must consist of at least one but up to manykey_columns
, identifying an entity, astart_column
and anend_column
.For a given row in this
DataSource
,start_column
andend_column
indicate a date range. Neither of those columns should contain NULL values. Also, it should hold that for a given row, the value ofend_column
is strictly greater than the value ofstart_column
.Note that the value of
start_column
is expected to be included in each date range. By default, the value ofend_column
is expected to be included as well - this can however be changed by settingend_included
toFalse
.A ‘key’ is a fixed set of values in
key_columns
and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several date ranges.Often, you might want the date ranges for a given key not to overlap.
If
key_columns
isNone
or[]
, all columns of the table will be considered as composing the key.In order to express a tolerance for some violations of this non-overlapping property, use the
max_relative_n_violations
parameter. The latter expresses for what fraction of all key values, at least one overlap may exist.For illustrative examples of this constraint, please refer to its test cases.
- Parameters:
start_column (
str
) –end_column (
str
) –key_columns (
Optional
[List
[str
]]) –end_included (
bool
) –max_relative_n_violations (
float
) –condition (
Condition
) –name (
str
) –
- add_functional_dependency_constraint(key_columns, value_columns, condition=None, name=None)
Expresses a functional dependency, a constraint where the value_columns are uniquely determined by the key_columns. This means that for each unique combination of values in the key_columns, there is exactly one corresponding combination of values in the value_columns.
The
add_unique_constraint
constraint is a special case of this constraint, where the key_columns are a primary key, and all other columns are included value_columns. This constraint allows for a more general definition of functional dependencies, where the key_columns are not necessarily a primary key.For more information on functional dependencies, see https://en.wikipedia.org/wiki/Functional_dependency.
- Parameters:
key_columns (
List
[str
]) –value_columns (
List
[str
]) –condition (
Condition
) –name (
str
) –
- add_groupby_aggregation_constraint(columns, aggregation_column, start_value, tolerance=0, condition=None, name=None)
Check whether array aggregate corresponds to an integer range.
The
DataSource
is grouped bycolumns
. Sql’sarray_agg
function is then applied to theaggregate_column
.Since we expect
aggregate_column
to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging fromstart_value
to the cardinality of the multiset.In order to allow for slight deviations from this pattern,
tolerance
expresses the fraction of all grouped-by rows, which may be incomplete ranges.- Parameters:
columns (
Sequence
[str
]) –aggregation_column (
str
) –start_value (
int
) –tolerance (
float
) –condition (
Condition
) –name (
str
) –
- add_max_null_fraction_constraint(column, max_null_fraction, condition=None, name=None)
Assert that
column
has less than a certain fraction ofNULL
values.max_null_fraction
is expected to lie within [0, 1].- Parameters:
column (
str
) –max_null_fraction (
float
) –condition (
Condition
) –name (
str
) –
- add_n_rows_equality_constraint(n_rows, condition=None, name=None)
- Parameters:
n_rows (
int
) –condition (
Condition
) –name (
str
) –
- add_n_rows_max_constraint(n_rows_max, condition=None, name=None)
- Parameters:
n_rows_max (
int
) –condition (
Condition
) –name (
str
) –
- add_n_rows_min_constraint(n_rows_min, condition=None, name=None)
- Parameters:
n_rows_min (
int
) –condition (
Condition
) –name (
str
) –
- add_n_uniques_equality_constraint(columns, n_uniques, condition=None, name=None)
- Parameters:
columns (
Optional
[List
[str
]]) –n_uniques (
int
) –condition (
Condition
) –name (
str
) –
- add_null_absence_constraint(column, condition=None, name=None)
- Parameters:
column (
str
) –condition (
Condition
) –name (
str
) –
- add_numeric_between_constraint(column, lower_bound, upper_bound, min_fraction, condition=None, name=None)
Assert that the column’s values lie between
lower_bound
andupper_bound
.Note that both bounds are inclusive.
Unless specified otherwise via the usage of a
condition
,NULL
values will be considered in the denominator ofmin_fraction
.NULL
values will never be considered to lie in the interval [lower_bound
,upper_bound
].- Parameters:
column (
str
) –lower_bound (
float
) –upper_bound (
float
) –min_fraction (
float
) –condition (
Condition
) –name (
str
) –
- add_numeric_max_constraint(column, max_value, condition=None, name=None)
All values in column are less or equal max_value.
- Parameters:
column (
str
) –max_value (
float
) –condition (
Condition
) –name (
str
) –
- add_numeric_mean_constraint(column, mean_value, max_absolute_deviation, condition=None, name=None)
Assert the mean of the column deviates at most max_deviation from mean_value.
- Parameters:
column (
str
) –mean_value (
float
) –max_absolute_deviation (
float
) –condition (
Condition
) –name (
str
) –
- add_numeric_min_constraint(column, min_value, condition=None)
All values in column are greater or equal min_value.
- Parameters:
column (
str
) –min_value (
float
) –condition (
Condition
) –
- add_numeric_no_gap_constraint(start_column, end_column, key_columns=None, legitimate_gap_size=0, max_relative_n_violations=0, condition=None, name=None)
Express that numeric interval rows have no gaps larger than some max value in-between them. The table under inspection must consist of at least one but up to many key columns, identifying an entity. Additionally, a
start_column
and anend_column
, indicating interval start and end values, should be provided.Neither of those columns should contain
NULL
values. Also, it should hold that for a given row, the value ofend_column
is strictly greater than the value ofstart_column
.legitimate_gap_size
is the maximum tollerated gap size between two intervals.A ‘key’ is a fixed set of values in
key_columns
and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several intervals.If`` key_columns`` is
None
or[]
, all columns of the table will be considered as composing the key.In order to express a tolerance for some violations of this gap property, use the
max_relative_n_violations
parameter. The latter expresses for what fraction of all key_values, at least one gap may exist.For illustrative examples of this constraint, please refer to its test cases.
- Parameters:
start_column (
str
) –end_column (
str
) –key_columns (
Optional
[List
[str
]]) –legitimate_gap_size (
float
) –max_relative_n_violations (
float
) –condition (
Condition
) –name (
str
) –
- add_numeric_no_overlap_constraint(start_column, end_column, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)
Constraint expressing that several numeric interval rows may not overlap.
The
DataSource
under inspection must consist of at least one but up to manykey_columns
, identifying an entity, astart_column
and anend_column
.For a given row in this
DataSource
,start_column
andend_column
indicate a numeric interval. Neither of those columns should contain NULL values. Also, it should hold that for a given row, the value ofend_column
is strictly greater than the value ofstart_column
.Note that the value of
start_column
is expected to be included in each interval. By default, the value ofend_column
is expected to be included as well - this can however be changed by settingend_included
toFalse
.A ‘key’ is a fixed set of values in
key_columns
and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several intervals.Often, you might want the intervals for a given key not to overlap.
If
key_columns
isNone
or[]
, all columns of the table will be considered as composing the key.In order to express a tolerance for some violations of this non-overlapping property, use the
max_relative_n_violations
parameter. The latter expresses for what fraction of all key values, at least one overlap may exist.For illustrative examples of this constraint, please refer to its test cases.
- Parameters:
start_column (
str
) –end_column (
str
) –key_columns (
Optional
[List
[str
]]) –end_included (
bool
) –max_relative_n_violations (
float
) –condition (
Condition
) –name (
str
) –
- add_numeric_percentile_constraint(column, percentage, expected_percentile, max_absolute_deviation=None, max_relative_deviation=None, condition=None, name=None)
Assert that the
percentage
-th percentile is approximatelyexpected_percentile
.The percentile is defined as the value present in
column
for whichpercentage
% of the values incolumn
are less or equal.NULL
values are ignored.Hence, if
percentage
is less than the inverse of the number of non-NULL
rows,None
is received as thepercentage
-th percentile.percentage
is expected to be provided in percent. The median, for example, would correspond topercentage=50
.At least one of
max_absolute_deviation
andmax_relative_deviation
must be provided.- Parameters:
column (
str
) –percentage (
float
) –expected_percentile (
float
) –max_absolute_deviation (
Optional
[float
]) –max_relative_deviation (
Optional
[float
]) –condition (
Condition
) –name (
str
) –
- add_primary_key_definition_constraint(primary_keys, name=None)
Check that the primary key constraints in the database are exactly equal to the given column names.
Note that this doesn’t actually check that the primary key values are unique across the table.
- Parameters:
primary_keys (
List
[str
]) –name (
str
) –
- add_uniqueness_constraint(columns=None, max_duplicate_fraction=0, condition=None, max_absolute_n_duplicates=0, infer_pk_columns=False, name=None)
Columns should uniquely identify row.
Given a set of columns, satisfy conditions of a primary key, i.e. uniqueness of tuples from said columns. This constraint has a tolerance for inconsistencies, expressed via max_duplicate_fraction. The latter suggests that the number of uniques from said columns is larger or equal to (1 - max_duplicate_fraction) the number of rows.
If infer_pk_columns is True, columns will be retrieved from the primary keys. When columns=None and infer_pk_columns=False, the fallback is validating that all rows in a table are unique.
- Parameters:
columns (
List
[str
]) –max_duplicate_fraction (
float
) –condition (
Condition
) –max_absolute_n_duplicates (
int
) –infer_pk_columns (
bool
) –name (
str
) –
- add_uniques_equality_constraint(columns, uniques, map_func=None, reduce_func=None, condition=None, name=None)
Check if the data’s unique values are equal to a given set of values.
The
UniquesEquality
constraint asserts if the values contained in a column of aDataSource
are strictly the ones of a reference set of expected values, specified via theuniques
parameter.See the
Uniques
class for further parameter details onmap_func
andreduce_func
.- Parameters:
columns (
List
[str
]) –uniques (
Collection
[TypeVar
(T
)]) –map_func (
Callable
[[TypeVar
(T
)],TypeVar
(T
)]) –reduce_func (
Callable
[[Collection
],Collection
]) –condition (
Condition
) –name (
str
) –
- add_uniques_subset_constraint(columns, uniques, max_relative_violations=0, map_func=None, reduce_func=None, condition=None, name=None)
Check if the data’s unique values are contained in a given set of values.
The
UniquesSubset
constraint asserts if the values contained in a column of aDataSource
are part of a reference set of expected values, specified viauniques
.Null values in the column are ignored. To assert the non-existence of them use the
NullAbsence
constraint via theadd_null_absence_constraint
helper method forWithinRequirement
.max_relative_violations
indicates what fraction of rows of the given table may have values not included in the reference set of unique values. Please note thatUniquesSubset
andUniquesSuperset
are not symmetrical in this regard.See
Uniques
for further details onmap_func
andreduce_func
.- Parameters:
columns (
List
[str
]) –uniques (
Collection
[TypeVar
(T
)]) –max_relative_violations (
float
) –map_func (
Callable
[[TypeVar
(T
)],TypeVar
(T
)]) –reduce_func (
Callable
[[Collection
],Collection
]) –condition (
Condition
) –name (
str
) –
- add_uniques_superset_constraint(columns, uniques, max_relative_violations=0, map_func=None, reduce_func=None, condition=None, name=None)
Check if unique values of columns are contained in the reference data.
The
UniquesSuperset
constraint asserts that reference set of expected values, specified viauniques
, is contained in given columns of aDataSource
.Null values in the column are ignored. To assert the non-existence of them use the
NullAbsence
constraint via theadd_null_absence_constraint
helper method forWithinRequirement
.max_relative_violations
indicates what fraction of unique values of the givenDataSource
are not represented in the reference set of unique values. Please note thatUniquesSubset
andUniquesSuperset
are not symmetrical in this regard.One use of this constraint is to test for consistency in columns with expected categorical values.
See
Uniques
for further details onmap_func
andreduce_func
.- Parameters:
columns (
List
[str
]) –uniques (
Collection
[TypeVar
(T
)]) –max_relative_violations (
float
) –map_func (
Callable
[[TypeVar
(T
)],TypeVar
(T
)]) –reduce_func (
Callable
[[Collection
],Collection
]) –condition (
Condition
) –name (
str
) –
- add_varchar_max_length_constraint(column, max_length, condition=None, name=None)
- Parameters:
column (
str
) –max_length (
int
) –condition (
Condition
) –name (
str
) –
- add_varchar_min_length_constraint(column, min_length, condition=None, name=None)
- Parameters:
column (
str
) –min_length (
int
) –condition (
Condition
) –name (
str
) –
- add_varchar_regex_constraint(column, regex, condition=None, name=None, allow_none=False, relative_tolerance=0.0, aggregated=True, n_counterexamples=5)
Assesses whether the values in a column match a given regular expression pattern.
The option
allow_none
can be used in cases where the column is defined as nullable and contains null values.How the tolerance factor is calculated can be controlled with the
aggregated
flag. WhenTrue
, the tolerance is calculated using unique values. If not, the tolerance is calculated using all the instances of the data.n_counterexamples
defines how many counterexamples are displayed in an assertion text. If all counterexamples are meant to be shown, provide-1
as an argument.When using this method, the regex matching will take place in memory. If instead, you would like the matching to take place in database which is typically faster and substantially more memory-saving, please consider using
add_varchar_regex_constraint_db
.- Parameters:
column (
str
) –regex (
str
) –condition (
Condition
) –name (
str
) –allow_none (
bool
) –relative_tolerance (
float
) –aggregated (
bool
) –n_counterexamples (
int
) –
- add_varchar_regex_constraint_db(column, regex, condition=None, name=None, relative_tolerance=0.0, aggregated=True, n_counterexamples=5)
Assesses whether the values in a column match a given regular expression pattern.
How the tolerance factor is calculated can be controlled with the
aggregated
flag. WhenTrue
, the tolerance is calculated using unique values. If not, the tolerance is calculated using all the instances of the data.n_counterexamples
defines how many counterexamples are displayed in an assertion text. If all counterexamples are meant to be shown, provide-1
as an argument.When using this method, the regex matching will take place in database, which is only supported for Postgres, Sqllite and Snowflake. Note that for this feature is only for Snowflake when using sqlalchemy-snowflake >= 1.4.0. As an alternative,
add_varchar_regex_constraint
performs the regex matching in memory. This is typically slower and more expensive in terms of memory but available on all supported database mamangement systems.- Parameters:
column (
str
) –regex (
str
) –condition (
Condition
) –name (
str
) –relative_tolerance (
float
) –aggregated (
bool
) –n_counterexamples (
int
) –
- classmethod from_expression(expression, name)
Create a
WithinRequirement
based on a sqlalchemy expression.Any sqlalchemy object implementing the
alias
method can be passed as an argument for theexpression
parameter. This could, e.g. be ansqlalchemy.Table
object or the result of asqlalchemy.select
call.The
name
will be used to represent this expression in error messages.- Parameters:
expression (
FromClause
) –name (
str
) –
- classmethod from_raw_query(query, name, columns=None)
Create a
WithinRequirement
based on a raw query string.The
query
parameter can be passed any query string returning rows, e.g."SELECT * FROM myschema.mytable LIMIT 1337"
or"SELECT id, name FROM table1 UNION SELECT id, name FROM table2"
.The
name
will be used to represent this query in error messages.If constraints rely on specific columns, these should be provided here via
columns
, e.g.["id", "name"]
.- Parameters:
query (
str
) –name (
str
) –columns (
List
[str
]) –
- classmethod from_table(db_name, schema_name, table_name)
- Parameters:
db_name (
str
) –schema_name (
str
) –table_name (
str
) –