datajudge package

Submodules

Module contents

datajudge allows to assess whether data from database complies with reference information.

class datajudge.BetweenRequirement(data_source, data_source2, date_column=None, date_column2=None)

Bases: Requirement

Methods

add_column_subset_constraint([name])

Columns of first table are subset of second table.

add_column_superset_constraint([name])

Columns of first table are superset of columns of second table.

add_column_type_constraint(column1, column2)

Check that the columns have the same type.

add_date_max_constraint(column1, column2[, ...])

Compare date max of first table to date max of second table.

add_date_min_constraint(column1, column2[, ...])

Ensure date min of first table is greater or equal date min of second table.

add_ks_2sample_constraint(column1, column2)

Apply the so-called two-sample Kolmogorov-Smirnov test to the distributions of the two given columns.

add_max_null_fraction_constraint(column1, ...)

Assert that the fraction of NULL values of one is at most that of the other.

add_n_rows_max_gain_constraint([...])

#rows from first table <= #rows from second table * (1 + max_growth).

add_n_rows_max_loss_constraint([...])

#rows from first table >= #rows from second table * (1 - max_loss).

add_n_rows_min_gain_constraint([...])

#rows from first table >= #rows from second table * (1 + min_growth).

add_n_uniques_max_gain_constraint(columns1, ...)

#uniques or first table <= #uniques of second table* (1 + max_growth).

add_n_uniques_max_loss_constraint(columns1, ...)

#uniques in first table <= #uniques in second table * (1 - max_loss).

add_numeric_percentile_constraint(column1, ...)

Assert that the percentage-th percentile is approximately equal.

add_row_equality_constraint(columns1, ...[, ...])

At most max_missing_fraction of rows in T1 and T2 are absent in either.

add_row_matching_equality_constraint(...[, ...])

Match tables in matching_columns, compare for equality in comparison_columns.

add_row_subset_constraint(columns1, ...[, ...])

At most max_missing_fraction of rows in T1 are not in T2.

add_row_superset_constraint(columns1, ...[, ...])

At most max_missing_fraction of rows in T2 are not in T1.

add_uniques_equality_constraint(columns1, ...)

Check if the data's unique values in given columns are equal.

add_uniques_subset_constraint(columns1, columns2)

Check if the given columns's unique values in are contained in reference data.

add_uniques_superset_constraint(columns1, ...)

Check if unique values of columns are contained in the reference data.

append(value)

S.append(value) -- append value to the end of the sequence

clear()

count(value)

extend(values)

S.extend(iterable) -- extend sequence by appending elements from the iterable

from_expressions(expression1, expression2, ...)

Create a BetweenTableRequirement based on sqlalchemy expressions.

from_raw_queries(query1, query2, name1, name2)

Create a BetweenRequirement based on raw query strings.

index(value, [start, [stop]])

Raises ValueError if the value is not present.

insert(index, value)

S.insert(index, value) -- insert value before index

pop([index])

Raise IndexError if list is empty or index is out of range.

remove(value)

S.remove(value) -- remove first occurrence of value.

reverse()

S.reverse() -- reverse IN PLACE

add_n_rows_equality_constraint

add_n_uniques_equality_constraint

add_numeric_max_constraint

add_numeric_mean_constraint

add_numeric_min_constraint

add_varchar_max_length_constraint

add_varchar_min_length_constraint

from_tables

get_date_growth_rate

get_deviation_getter

test

Parameters:
  • data_source (DataSource) –

  • data_source2 (DataSource) –

  • date_column (Optional[str]) –

  • date_column2 (Optional[str]) –

add_column_subset_constraint(name=None)

Columns of first table are subset of second table.

Parameters:

name (str) –

add_column_superset_constraint(name=None)

Columns of first table are superset of columns of second table.

Parameters:

name (str) –

add_column_type_constraint(column1, column2, name=None)

Check that the columns have the same type.

Parameters:
  • column1 (str) –

  • column2 (str) –

  • name (str) –

add_date_max_constraint(column1, column2, use_upper_bound_reference=True, column_type='date', condition1=None, condition2=None, name=None)

Compare date max of first table to date max of second table.

The used columns of both tables need to be of the same type.

For more information on column_type values, see add_column_type_constraint.

If use_upper_bound_reference, the max of the first table has to be smaller or equal to the max of the second table. If not use_upper_bound_reference, the max of the first table has to be greater or equal to the max of the second table.

Parameters:
  • column1 (str) –

  • column2 (str) –

  • use_upper_bound_reference (bool) –

  • column_type (Union[str, TypeEngine]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_date_min_constraint(column1, column2, use_lower_bound_reference=True, column_type='date', condition1=None, condition2=None, name=None)

Ensure date min of first table is greater or equal date min of second table.

The used columns of both tables need to be of the same type.

For more information on column_type values, see add_column_type_constraint.

If use_lower_bound_reference, the min of the first table has to be greater or equal to the min of the second table. If not use_upper_bound_reference, the min of the first table has to be smaller or equal to the min of the second table.

Parameters:
  • column1 (str) –

  • column2 (str) –

  • use_lower_bound_reference (bool) –

  • column_type (Union[str, TypeEngine]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_ks_2sample_constraint(column1, column2, condition1=None, condition2=None, name=None, significance_level=0.05)

Apply the so-called two-sample Kolmogorov-Smirnov test to the distributions of the two given columns. The constraint is fulfilled, when the resulting p-value of the test is higher than the significance level (default is 0.05, i.e., 5%). The signifance_level must be a value between 0.0 and 1.0.

Parameters:
  • column1 (str) –

  • column2 (str) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

  • significance_level (float) –

add_max_null_fraction_constraint(column1, column2, max_relative_deviation, condition1=None, condition2=None, name=None)

Assert that the fraction of NULL values of one is at most that of the other.

Given that column2’s underlying data has a fraction q of NULL values, the max_relative_deviation parameter allows column1’s underlying data to have a fraction (1 + max_relative_deviation) * q of NULL values.

Parameters:
  • column1 (str) –

  • column2 (str) –

  • max_relative_deviation (float) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_n_rows_equality_constraint(condition1=None, condition2=None, name=None)
Parameters:
add_n_rows_max_gain_constraint(constant_max_relative_gain=None, date_range_gain_deviation=None, condition1=None, condition2=None, name=None)

#rows from first table <= #rows from second table * (1 + max_growth).

See readme for more information on max_growth.

Parameters:
  • constant_max_relative_gain (Optional[float]) –

  • date_range_gain_deviation (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_n_rows_max_loss_constraint(constant_max_relative_loss=None, date_range_loss_deviation=None, condition1=None, condition2=None, name=None)

#rows from first table >= #rows from second table * (1 - max_loss).

See readme for more information on max_loss.

Parameters:
  • constant_max_relative_loss (Optional[float]) –

  • date_range_loss_deviation (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_n_rows_min_gain_constraint(constant_min_relative_gain=None, date_range_gain_deviation=None, condition1=None, condition2=None, name=None)

#rows from first table >= #rows from second table * (1 + min_growth).

See readme for more information on min_growth.

Parameters:
  • constant_min_relative_gain (Optional[float]) –

  • date_range_gain_deviation (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_n_uniques_equality_constraint(columns1, columns2, condition1=None, condition2=None, name=None)
Parameters:
  • columns1 (Optional[List[str]]) –

  • columns2 (Optional[List[str]]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_n_uniques_max_gain_constraint(columns1, columns2, constant_max_relative_gain=None, date_range_gain_deviation=None, condition1=None, condition2=None, name=None)

#uniques or first table <= #uniques of second table* (1 + max_growth).

#uniques in first table are defined based on columns1, #uniques in second table are defined based on columns2.

See readme for more information on max_growth.

Parameters:
  • columns1 (Optional[List[str]]) –

  • columns2 (Optional[List[str]]) –

  • constant_max_relative_gain (Optional[float]) –

  • date_range_gain_deviation (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_n_uniques_max_loss_constraint(columns1, columns2, constant_max_relative_loss=None, date_range_loss_deviation=None, condition1=None, condition2=None, name=None)

#uniques in first table <= #uniques in second table * (1 - max_loss).

#uniques in first table are defined based on columns1, #uniques in second table are defined based on columns2.

See readme for more information on max_loss.

Parameters:
  • columns1 (Optional[List[str]]) –

  • columns2 (Optional[List[str]]) –

  • constant_max_relative_loss (Optional[float]) –

  • date_range_loss_deviation (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_numeric_max_constraint(column1, column2, condition1=None, condition2=None, name=None)
Parameters:
  • column1 (str) –

  • column2 (str) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_numeric_mean_constraint(column1, column2, max_absolute_deviation, condition1=None, condition2=None, name=None)
Parameters:
  • column1 (str) –

  • column2 (str) –

  • max_absolute_deviation (float) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_numeric_min_constraint(column1, column2, condition1=None, condition2=None, name=None)
Parameters:
  • column1 (str) –

  • column2 (str) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_numeric_percentile_constraint(column1, column2, percentage, max_absolute_deviation=None, max_relative_deviation=None, condition1=None, condition2=None, name=None)

Assert that the percentage-th percentile is approximately equal.

The percentile is defined as the value present in column1 / column2 for which percentage % of the values in column1 / column2 are less or equal. NULL values are ignored.

Hence, if percentage is less than the inverse of the number of non-NULL rows, None is received as the percentage-th percentile.

percentage is expected to be provided in percent. The median, for example, would correspond to percentage=50.

At least one of max_absolute_deviation and max_relative_deviation must be provided.

Parameters:
  • column1 (str) –

  • column2 (str) –

  • percentage (float) –

  • max_absolute_deviation (Optional[float]) –

  • max_relative_deviation (Optional[float]) –

  • condition1 (Optional[Condition]) –

  • condition2 (Optional[Condition]) –

  • name (str) –

add_row_equality_constraint(columns1, columns2, max_missing_fraction, condition1=None, condition2=None, name=None)

At most max_missing_fraction of rows in T1 and T2 are absent in either.

In other words, \(\frac{|T1 - T2| + |T2 - T1|}{|T1 \cup T2|} \leq\) max_missing_fraction. Rows from T1 are indexed in columns1, rows from T2 are indexed in columns2.

Parameters:
  • columns1 (Optional[List[str]]) –

  • columns2 (Optional[List[str]]) –

  • max_missing_fraction (float) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_row_matching_equality_constraint(matching_columns1, matching_columns2, comparison_columns1, comparison_columns2, max_missing_fraction, condition1=None, condition2=None, name=None)

Match tables in matching_columns, compare for equality in comparison_columns.

This constraint is similar to the nature of the RowEquality constraint. Just as the latter, this constraint divides the cardinality of an intersection by the cardinality of a union. The difference lies in how the set are created. While RowEquality considers all rows of both tables, indexed in columns, RowMatchingEquality considers only rows in both tables having values in matching_columns present in both tables. At most max_missing_fraction of such rows can be missing in the intersection.

Alternatively, this can be thought of as counting mismatches in comparison_columns after performing an inner join on matching_columns.

Parameters:
  • matching_columns1 (List[str]) –

  • matching_columns2 (List[str]) –

  • comparison_columns1 (List[str]) –

  • comparison_columns2 (List[str]) –

  • max_missing_fraction (float) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_row_subset_constraint(columns1, columns2, constant_max_missing_fraction, date_range_loss_fraction=None, condition1=None, condition2=None, name=None)

At most max_missing_fraction of rows in T1 are not in T2.

In other words, \(\frac{|T1-T2|}{|T1|} \leq\) max_missing_fraction. Rows from T1 are indexed in columns1, rows from T2 are indexed in columns2.

In particular, the operation |T1-T2| relies on a sql EXCEPT statement. In contrast to EXCEPT ALL, this should lead to a set subtraction instead of a multiset subtraction. In other words, duplicates in T1 are treated as single occurrences.

Parameters:
  • columns1 (Optional[List[str]]) –

  • columns2 (Optional[List[str]]) –

  • constant_max_missing_fraction (Optional[float]) –

  • date_range_loss_fraction (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_row_superset_constraint(columns1, columns2, constant_max_missing_fraction, date_range_loss_fraction=None, condition1=None, condition2=None, name=None)

At most max_missing_fraction of rows in T2 are not in T1.

In other words, \(\frac{|T2-T1|}{|T2|} \leq\) max_missing_fraction. Rows from T1 are indexed in columns1, rows from T2 are indexed in columns2.

Parameters:
  • columns1 (Optional[List[str]]) –

  • columns2 (Optional[List[str]]) –

  • constant_max_missing_fraction (float) –

  • date_range_loss_fraction (Optional[float]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_uniques_equality_constraint(columns1, columns2, map_func=None, reduce_func=None, condition1=None, condition2=None, name=None)

Check if the data’s unique values in given columns are equal.

The UniquesEquality constraint asserts if the values contained in a column of a DataSource’s columns, are strictly the ones of another DataSource’s columns.

See the Uniques class for further parameter details on map_func and reduce_func.

Parameters:
  • columns1 (List[str]) –

  • columns2 (List[str]) –

  • map_func (Callable[[TypeVar(T)], TypeVar(T)]) –

  • reduce_func (Callable[[Collection], Collection]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_uniques_subset_constraint(columns1, columns2, max_relative_violations=0, map_func=None, reduce_func=None, condition1=None, condition2=None, name=None)

Check if the given columns’s unique values in are contained in reference data.

The UniquesSubset constraint asserts if the values contained in given column of a DataSource are part of the unique values of given columns of another DataSource.

Null values in the column are ignored. To assert the non-existence of them use the NullAbsence constraint via the add_null_absence_constraint helper method for WithinRequirement.

max_relative_violations indicates what fraction of rows of the given table may have values not included in the reference set of unique values. Please note that UniquesSubset and UniquesSuperset are not symmetrical in this regard.

See Uniques for further details on map_func and reduce_func.

Parameters:
  • columns1 (List[str]) –

  • columns2 (List[str]) –

  • max_relative_violations (float) –

  • map_func (Callable[[TypeVar(T)], TypeVar(T)]) –

  • reduce_func (Callable[[Collection], Collection]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_uniques_superset_constraint(columns1, columns2, max_relative_violations=0, map_func=None, reduce_func=None, condition1=None, condition2=None, name=None)

Check if unique values of columns are contained in the reference data.

The UniquesSuperset constraint asserts that reference set of expected values, derived from the unique values in given columns of the reference DataSource, is contained in given columns of a DataSource.

Null values in the column are ignored. To assert the non-existence of them use the NullAbsence constraint via the add_null_absence_constraint helper method for WithinRequirement.

max_relative_violations indicates what fraction of unique values of the given DataSource are not represented in the reference set of unique values. Please note that UniquesSubset and UniquesSuperset are not symmetrical in this regard.

One use of this constraint is to test for consistency in columns with expected categorical values.

See Uniques for further details on map_func and reduce_func.

Parameters:
  • columns1 (List[str]) –

  • columns2 (List[str]) –

  • max_relative_violations (float) –

  • map_func (Callable[[TypeVar(T)], TypeVar(T)]) –

  • reduce_func (Callable[[Collection], Collection]) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_varchar_max_length_constraint(column1, column2, condition1=None, condition2=None, name=None)
Parameters:
  • column1 (str) –

  • column2 (str) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

add_varchar_min_length_constraint(column1, column2, condition1=None, condition2=None, name=None)
Parameters:
  • column1 (str) –

  • column2 (str) –

  • condition1 (Condition) –

  • condition2 (Condition) –

  • name (str) –

classmethod from_expressions(expression1, expression2, name1, name2, date_column=None, date_column2=None)

Create a BetweenTableRequirement based on sqlalchemy expressions.

Any sqlalchemy object implementing the alias method can be passed as an argument for the expression1 and expression2 parameters. This could, e.g. be a sqlalchemy.Table object or the result of a sqlalchemy.select invocation.

name1 and name2 will be used to represent the expressions in error messages, respectively.

Parameters:
  • name1 (str) –

  • name2 (str) –

  • date_column (Optional[str]) –

  • date_column2 (Optional[str]) –

classmethod from_raw_queries(query1, query2, name1, name2, columns1=None, columns2=None, date_column=None, date_column2=None)

Create a BetweenRequirement based on raw query strings.

The query1 and query2 parameters can be passed any query string returning rows, e.g. "SELECT * FROM myschema.mytable LIMIT 1337" or "SELECT id, name FROM table1 UNION SELECT id, name FROM table2".

name1 and name2 will be used to represent the queries in error messages, respectively.

If constraints rely on specific columns, these should be provided here via columns1 and columns2 respectively.

Parameters:
  • query1 (str) –

  • query2 (str) –

  • name1 (str) –

  • name2 (str) –

  • columns1 (List[str]) –

  • columns2 (List[str]) –

  • date_column (Optional[str]) –

  • date_column2 (Optional[str]) –

classmethod from_tables(db_name1, schema_name1, table_name1, db_name2, schema_name2, table_name2, date_column=None, date_column2=None)
Parameters:
  • db_name1 (str) –

  • schema_name1 (str) –

  • table_name1 (str) –

  • db_name2 (str) –

  • schema_name2 (str) –

  • table_name2 (str) –

  • date_column (Optional[str]) –

  • date_column2 (Optional[str]) –

get_date_growth_rate(engine)
Return type:

float

get_deviation_getter(fix_value, deviation)
Parameters:
  • fix_value (Optional[float]) –

  • deviation (Optional[float]) –

class datajudge.Condition(raw_string=None, conditions=None, reduction_operator=None)

Bases: object

Condition allows for further narrowing down of a DataSource in a Constraint.

A Condition can be thought of as a filter, the content of a sql ‘where’ clause or a condition as known from probability theory.

While a DataSource is expressed more generally, one might be interested in testing properties of a specific part of said DataSource in light of a particular constraint. Hence using Condition`s allows for the reusage of a `DataSource, in lieu of creating a new custom `DataSource with the Condition implicitly built in.

A Condition can either be ‘atomic’, i.e. not further reducible to sub-conditions or ‘composite’, i.e. combining multiple subconditions. In the former case, it can be instantiated with help of the raw_string parameter, e.g. “col1 > 0”. In the latter case, it can be instantiated with help of the conditions and reduction_operator parameters. reduction_operator allows for two values: “and” (logical conjunction) and “or” (logical disjunction). Note that composition of `Condition`s supports arbitrary degrees of nesting.

Attributes:
conditions
raw_string
reduction_operator

Methods

snowflake_str

Parameters:
  • raw_string (str | None) –

  • conditions (Optional[Sequence[Condition]]) –

  • reduction_operator (str | None) –

conditions: Optional[Sequence[Condition]] = None
raw_string: str | None = None
reduction_operator: str | None = None
snowflake_str()
class datajudge.Constraint(ref, *, ref2=None, ref_value=None, name=None)

Bases: ABC

Express a DataReference constraint against either another DataReference or a reference value.

Constraints against other DataReferences are typically referred to as ‘between’ constraints. Please use the the ref2 argument to instantiate such a constraint. Constraints against a fixed reference value are typically referred to as ‘within’ constraints. Please use the ref_value argument to instantiate such a constraint.

A constraint typically relies on the comparison of factual and target values. The former represent the key quantity of interest as seen in the database, the latter the key quantity of interest as expected a priori. Such a comparison is meant to be carried out in the test method.

In order to obtain such values, the retrieve method defines a mapping from DataReference, be it the DataReference of primary interest, ref, or a baseline DataReference, ref2, to value. If ref_value is already provided, usually no further mapping needs to be taken care of.

Attributes:
condition_string
target_prefix

Methods

retrieve(engine, ref)

Retrieve the value of interest for a DataReference from database.

compare

get_description

get_factual_value

get_target_value

test

Parameters:
  • ref (DataReference) –

  • ref_value (Any) –

  • name (str) –

compare(value_factual, value_target)
Parameters:
  • value_factual (Any) –

  • value_target (Any) –

Return type:

Tuple[bool, Optional[str]]

property condition_string: str
get_description()
Return type:

str

get_factual_value(engine)
Parameters:

engine (Engine) –

Return type:

Any

get_target_value(engine)
Parameters:

engine (Engine) –

Return type:

Any

retrieve(engine, ref)

Retrieve the value of interest for a DataReference from database.

Parameters:
  • engine (Engine) –

  • ref (DataReference) –

Return type:

Tuple[Any, Optional[List[Select]]]

property target_prefix: str
test(engine)
Parameters:

engine (Engine) –

Return type:

TestResult

class datajudge.Requirement

Bases: ABC, MutableSequence

Methods

append(value)

S.append(value) -- append value to the end of the sequence

clear()

count(value)

extend(values)

S.extend(iterable) -- extend sequence by appending elements from the iterable

index(value, [start, [stop]])

Raises ValueError if the value is not present.

insert(index, value)

S.insert(index, value) -- insert value before index

pop([index])

Raise IndexError if list is empty or index is out of range.

remove(value)

S.remove(value) -- remove first occurrence of value.

reverse()

S.reverse() -- reverse IN PLACE

test

insert(index, value)

S.insert(index, value) – insert value before index

Parameters:
Return type:

None

test(engine)
Return type:

List[TestResult]

class datajudge.WithinRequirement(data_source)

Bases: Requirement

Methods

add_categorical_bound_constraint(columns, ...)

Check if the distribution of unique values in columns falls within the specified minimum and maximum bounds.

add_column_type_constraint(column, column_type)

Check if a column type matches the expected column_type.

add_date_between_constraint(column, ...[, ...])

Use string format: lower_bound="'20121230'".

add_date_max_constraint(column, max_value[, ...])

Ensure all dates to be superior than max_value.

add_date_min_constraint(column, min_value[, ...])

Ensure all dates to be superior than min_value.

add_date_no_gap_constraint(start_column, ...)

Express that date range rows have no gap in-between them.

add_date_no_overlap_2d_constraint(...[, ...])

Express that several date range rows do not overlap in two date dimensions.

add_date_no_overlap_constraint(start_column, ...)

Constraint expressing that several date range rows may not overlap.

add_functional_dependency_constraint(...[, ...])

Expresses a functional dependency, a constraint where the value_columns are uniquely determined by the key_columns.

add_groupby_aggregation_constraint(columns, ...)

Check whether array aggregate corresponds to an integer range.

add_max_null_fraction_constraint(column, ...)

Assert that column has less than a certain fraction of NULL values.

add_numeric_between_constraint(column, ...)

Assert that the column's values lie between lower_bound and upper_bound.

add_numeric_max_constraint(column, max_value)

All values in column are less or equal max_value.

add_numeric_mean_constraint(column, ...[, ...])

Assert the mean of the column deviates at most max_deviation from mean_value.

add_numeric_min_constraint(column, min_value)

All values in column are greater or equal min_value.

add_numeric_no_gap_constraint(start_column, ...)

Express that numeric interval rows have no gaps larger than some max value in-between them.

add_numeric_no_overlap_constraint(...[, ...])

Constraint expressing that several numeric interval rows may not overlap.

add_numeric_percentile_constraint(column, ...)

Assert that the percentage-th percentile is approximately expected_percentile.

add_primary_key_definition_constraint(...[, ...])

Check that the primary key constraints in the database are exactly equal to the given column names.

add_uniqueness_constraint([columns, ...])

Columns should uniquely identify row.

add_uniques_equality_constraint(columns, uniques)

Check if the data's unique values are equal to a given set of values.

add_uniques_subset_constraint(columns, uniques)

Check if the data's unique values are contained in a given set of values.

add_uniques_superset_constraint(columns, uniques)

Check if unique values of columns are contained in the reference data.

add_varchar_regex_constraint(column, regex)

Assesses whether the values in a column match a given regular expression pattern.

add_varchar_regex_constraint_db(column, regex)

Assesses whether the values in a column match a given regular expression pattern.

append(value)

S.append(value) -- append value to the end of the sequence

clear()

count(value)

extend(values)

S.extend(iterable) -- extend sequence by appending elements from the iterable

from_expression(expression, name)

Create a WithinRequirement based on a sqlalchemy expression.

from_raw_query(query, name[, columns])

Create a WithinRequirement based on a raw query string.

index(value, [start, [stop]])

Raises ValueError if the value is not present.

insert(index, value)

S.insert(index, value) -- insert value before index

pop([index])

Raise IndexError if list is empty or index is out of range.

remove(value)

S.remove(value) -- remove first occurrence of value.

reverse()

S.reverse() -- reverse IN PLACE

add_column_existence_constraint

add_n_rows_equality_constraint

add_n_rows_max_constraint

add_n_rows_min_constraint

add_n_uniques_equality_constraint

add_null_absence_constraint

add_varchar_max_length_constraint

add_varchar_min_length_constraint

from_table

test

Parameters:

data_source (DataSource) –

add_categorical_bound_constraint(columns, distribution, default_bounds=(0, 0), max_relative_violations=0, condition=None, name=None)

Check if the distribution of unique values in columns falls within the specified minimum and maximum bounds.

The CategoricalBoundConstraint is added to ensure the distribution of unique values in the specified columns of a DataSource falls within the given minimum and maximum bounds defined in the distribution parameter.

Parameters:
columnsList[str]

A list of column names from the DataSource to apply the constraint on.

distributionDict[T, Tuple[float, float]]

A dictionary where keys represent unique values and the corresponding tuple values represent the minimum and maximum allowed proportions of the respective unique value in the columns.

default_boundsTuple[float, float], optional, default=(0, 0)

A tuple specifying the minimum and maximum allowed proportions for all elements not mentioned in the distribution. By default, it’s set to (0, 0), which means all elements not present in distribution will cause a constraint failure.

max_relative_violationsfloat, optional, default=0

A tolerance threshold (0 to 1) for the proportion of elements in the data that can violate the bound constraints without triggering the constraint violation.

conditionCondition, optional

An optional parameter to specify a Condition object to filter the data before applying the constraint.

namestr, optional

An optional parameter to provide a custom name for the constraint.

Parameters:
  • columns (List[str]) –

  • distribution (Dict[TypeVar(T), Tuple[float, float]]) –

  • default_bounds (Tuple[float, float]) –

  • max_relative_violations (float) –

  • condition (Condition) –

  • name (str) –

add_column_existence_constraint(columns, name=None)
Parameters:
  • columns (List[str]) –

  • name (str) –

add_column_type_constraint(column, column_type, name=None)

Check if a column type matches the expected column_type.

The column_type can be provided as a string (backend-specific type name), a backend-specific SQLAlchemy type, or a SQLAlchemy’s generic type.

If SQLAlchemy’s generic types are used, the check is performed using isinstance, which means that the actual type can also be a subclass of the target type. For more information on SQLAlchemy’s generic types, see https://docs.sqlalchemy.org/en/20/core/type_basics.html

Parameters:
columnstr

The name of the column to which the constraint will be applied.

column_typeUnion[str, sa.types.TypeEngine]

The expected type of the column. This can be a string, a backend-specific SQLAlchemy type, or a generic SQLAlchemy type.

nameOptional[str]

An optional name for the constraint. If not provided, a name will be generated automatically.

Parameters:
  • column (str) –

  • column_type (Union[str, TypeEngine]) –

  • name (str) –

add_date_between_constraint(column, lower_bound, upper_bound, min_fraction, condition=None, name=None)

Use string format: lower_bound=“‘20121230’”.

Parameters:
  • column (str) –

  • lower_bound (str) –

  • upper_bound (str) –

  • min_fraction (float) –

  • condition (Condition) –

  • name (str) –

add_date_max_constraint(column, max_value, use_upper_bound_reference=True, column_type='date', condition=None, name=None)

Ensure all dates to be superior than max_value.

Use string format: max_value=“‘20121230’”.

For more information on column_type values, see add_column_type_constraint.

If use_upper_bound_reference, the max of the first table has to be smaller or equal to max_value. If not use_upper_bound_reference, the max of the first table has to be greater or equal to max_value.

Parameters:
  • column (str) –

  • max_value (str) –

  • use_upper_bound_reference (bool) –

  • column_type (Union[str, TypeEngine]) –

  • condition (Condition) –

  • name (str) –

add_date_min_constraint(column, min_value, use_lower_bound_reference=True, column_type='date', condition=None, name=None)

Ensure all dates to be superior than min_value.

Use string format: min_value=“‘20121230’”.

For more information on column_type values, see add_column_type_constraint.

If use_lower_bound_reference, the min of the first table has to be greater or equal to min_value. If not use_upper_bound_reference, the min of the first table has to be smaller or equal to min_value.

Parameters:
  • column (str) –

  • min_value (str) –

  • use_lower_bound_reference (bool) –

  • column_type (Union[str, TypeEngine]) –

  • condition (Condition) –

  • name (str) –

add_date_no_gap_constraint(start_column, end_column, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)

Express that date range rows have no gap in-between them.

The table under inspection must consist of at least one but up to many key columns, identifying an entity. Additionally, a start_column and an end_column, indicating start and end dates, should be provided.

Neither of those columns should contain NULL values. Also, it should hold that for a given row, the value of end_column is strictly greater than the value of start_column.

Note that the value of start_column is expected to be included in each date range. By default, the value of end_column is expected to be included as well - this can however be changed by setting end_included to False.

A ‘key’ is a fixed set of values in key_columns and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several date ranges.

If`` key_columns`` is None or [], all columns of the table will be considered as composing the key.

In order to express a tolerance for some violations of this gap property, use the max_relative_n_violations parameter. The latter expresses for what fraction of all key_values, at least one gap may exist.

For illustrative examples of this constraint, please refer to its test cases.

Parameters:
  • start_column (str) –

  • end_column (str) –

  • key_columns (Optional[List[str]]) –

  • end_included (bool) –

  • max_relative_n_violations (float) –

  • condition (Condition) –

  • name (str) –

add_date_no_overlap_2d_constraint(start_column1, end_column1, start_column2, end_column2, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)

Express that several date range rows do not overlap in two date dimensions.

The table under inspection must consist of at least one but up to many key columns, identifying an entity. Per date dimension, a start_column and an end_column should be provided.

For a given row in this table, start_column1 and end_column1 indicate a date range. Moreoever, for that same row, start_column2 and end_column2 indicate a date range. These date ranges are expected to represent different date ‘dimensions’. Example: A row indicates a forecasted value used in production. start_column1 and end_column1 represent the timespan that was forecasted, e.g. the weather from next Saturday to next Sunday. end_column1 and end_column2 might indicate the timespan when this forceast was used, e.g. from the previous Monday to Wednesday.

Neither of those columns should contain NULL values. Also it should hold that for a given row, the value of end_column is strictly greater than the value of start_column.

Note that the values of start_column1 and start_column2 are expected to be included in each date range. By default, the values of end_column1 and end_column2 are expected to be included as well - this can however be changed by setting end_included to False.

A ‘key’ is a fixed set of values in key_columns and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several date ranges.

Often, you might want the date ranges for a given key not to overlap.

If key_columns is None or [], all columns of the table will be considered as composing the key.

In order to express a tolerance for some violations of this non-overlapping property, use the max_relative_n_violations parameter. The latter expresses for what fraction of all key_values, at least one overlap may exist.

For illustrative examples of this constraint, please refer to its test cases.

Parameters:
  • start_column1 (str) –

  • end_column1 (str) –

  • start_column2 (str) –

  • end_column2 (str) –

  • key_columns (Optional[List[str]]) –

  • end_included (bool) –

  • max_relative_n_violations (float) –

  • condition (Condition) –

  • name (str) –

add_date_no_overlap_constraint(start_column, end_column, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)

Constraint expressing that several date range rows may not overlap.

The DataSource under inspection must consist of at least one but up to many key_columns, identifying an entity, a start_column and an end_column.

For a given row in this DataSource, start_column and end_column indicate a date range. Neither of those columns should contain NULL values. Also, it should hold that for a given row, the value of end_column is strictly greater than the value of start_column.

Note that the value of start_column is expected to be included in each date range. By default, the value of end_column is expected to be included as well - this can however be changed by setting end_included to False.

A ‘key’ is a fixed set of values in key_columns and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several date ranges.

Often, you might want the date ranges for a given key not to overlap.

If key_columns is None or [], all columns of the table will be considered as composing the key.

In order to express a tolerance for some violations of this non-overlapping property, use the max_relative_n_violations parameter. The latter expresses for what fraction of all key values, at least one overlap may exist.

For illustrative examples of this constraint, please refer to its test cases.

Parameters:
  • start_column (str) –

  • end_column (str) –

  • key_columns (Optional[List[str]]) –

  • end_included (bool) –

  • max_relative_n_violations (float) –

  • condition (Condition) –

  • name (str) –

add_functional_dependency_constraint(key_columns, value_columns, condition=None, name=None)

Expresses a functional dependency, a constraint where the value_columns are uniquely determined by the key_columns. This means that for each unique combination of values in the key_columns, there is exactly one corresponding combination of values in the value_columns.

The add_unique_constraint constraint is a special case of this constraint, where the key_columns are a primary key, and all other columns are included value_columns. This constraint allows for a more general definition of functional dependencies, where the key_columns are not necessarily a primary key.

For more information on functional dependencies, see https://en.wikipedia.org/wiki/Functional_dependency.

Parameters:
  • key_columns (List[str]) –

  • value_columns (List[str]) –

  • condition (Condition) –

  • name (str) –

add_groupby_aggregation_constraint(columns, aggregation_column, start_value, tolerance=0, condition=None, name=None)

Check whether array aggregate corresponds to an integer range.

The DataSource is grouped by columns. Sql’s array_agg function is then applied to the aggregate_column.

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

In order to allow for slight deviations from this pattern, tolerance expresses the fraction of all grouped-by rows, which may be incomplete ranges.

Parameters:
  • columns (Sequence[str]) –

  • aggregation_column (str) –

  • start_value (int) –

  • tolerance (float) –

  • condition (Condition) –

  • name (str) –

add_max_null_fraction_constraint(column, max_null_fraction, condition=None, name=None)

Assert that column has less than a certain fraction of NULL values.

max_null_fraction is expected to lie within [0, 1].

Parameters:
  • column (str) –

  • max_null_fraction (float) –

  • condition (Condition) –

  • name (str) –

add_n_rows_equality_constraint(n_rows, condition=None, name=None)
Parameters:
  • n_rows (int) –

  • condition (Condition) –

  • name (str) –

add_n_rows_max_constraint(n_rows_max, condition=None, name=None)
Parameters:
  • n_rows_max (int) –

  • condition (Condition) –

  • name (str) –

add_n_rows_min_constraint(n_rows_min, condition=None, name=None)
Parameters:
  • n_rows_min (int) –

  • condition (Condition) –

  • name (str) –

add_n_uniques_equality_constraint(columns, n_uniques, condition=None, name=None)
Parameters:
  • columns (Optional[List[str]]) –

  • n_uniques (int) –

  • condition (Condition) –

  • name (str) –

add_null_absence_constraint(column, condition=None, name=None)
Parameters:
  • column (str) –

  • condition (Condition) –

  • name (str) –

add_numeric_between_constraint(column, lower_bound, upper_bound, min_fraction, condition=None, name=None)

Assert that the column’s values lie between lower_bound and upper_bound.

Note that both bounds are inclusive.

Unless specified otherwise via the usage of a condition, NULL values will be considered in the denominator of min_fraction. NULL values will never be considered to lie in the interval [lower_bound, upper_bound].

Parameters:
  • column (str) –

  • lower_bound (float) –

  • upper_bound (float) –

  • min_fraction (float) –

  • condition (Condition) –

  • name (str) –

add_numeric_max_constraint(column, max_value, condition=None, name=None)

All values in column are less or equal max_value.

Parameters:
  • column (str) –

  • max_value (float) –

  • condition (Condition) –

  • name (str) –

add_numeric_mean_constraint(column, mean_value, max_absolute_deviation, condition=None, name=None)

Assert the mean of the column deviates at most max_deviation from mean_value.

Parameters:
  • column (str) –

  • mean_value (float) –

  • max_absolute_deviation (float) –

  • condition (Condition) –

  • name (str) –

add_numeric_min_constraint(column, min_value, condition=None)

All values in column are greater or equal min_value.

Parameters:
  • column (str) –

  • min_value (float) –

  • condition (Condition) –

add_numeric_no_gap_constraint(start_column, end_column, key_columns=None, legitimate_gap_size=0, max_relative_n_violations=0, condition=None, name=None)

Express that numeric interval rows have no gaps larger than some max value in-between them. The table under inspection must consist of at least one but up to many key columns, identifying an entity. Additionally, a start_column and an end_column, indicating interval start and end values, should be provided.

Neither of those columns should contain NULL values. Also, it should hold that for a given row, the value of end_column is strictly greater than the value of start_column.

legitimate_gap_size is the maximum tollerated gap size between two intervals.

A ‘key’ is a fixed set of values in key_columns and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several intervals.

If`` key_columns`` is None or [], all columns of the table will be considered as composing the key.

In order to express a tolerance for some violations of this gap property, use the max_relative_n_violations parameter. The latter expresses for what fraction of all key_values, at least one gap may exist.

For illustrative examples of this constraint, please refer to its test cases.

Parameters:
  • start_column (str) –

  • end_column (str) –

  • key_columns (Optional[List[str]]) –

  • legitimate_gap_size (float) –

  • max_relative_n_violations (float) –

  • condition (Condition) –

  • name (str) –

add_numeric_no_overlap_constraint(start_column, end_column, key_columns=None, end_included=True, max_relative_n_violations=0, condition=None, name=None)

Constraint expressing that several numeric interval rows may not overlap.

The DataSource under inspection must consist of at least one but up to many key_columns, identifying an entity, a start_column and an end_column.

For a given row in this DataSource, start_column and end_column indicate a numeric interval. Neither of those columns should contain NULL values. Also, it should hold that for a given row, the value of end_column is strictly greater than the value of start_column.

Note that the value of start_column is expected to be included in each interval. By default, the value of end_column is expected to be included as well - this can however be changed by setting end_included to False.

A ‘key’ is a fixed set of values in key_columns and represents an entity of interest. A priori, a key is not a primary key, i.e., a key can have and often has several rows. Thereby, a key will often come with several intervals.

Often, you might want the intervals for a given key not to overlap.

If key_columns is None or [], all columns of the table will be considered as composing the key.

In order to express a tolerance for some violations of this non-overlapping property, use the max_relative_n_violations parameter. The latter expresses for what fraction of all key values, at least one overlap may exist.

For illustrative examples of this constraint, please refer to its test cases.

Parameters:
  • start_column (str) –

  • end_column (str) –

  • key_columns (Optional[List[str]]) –

  • end_included (bool) –

  • max_relative_n_violations (float) –

  • condition (Condition) –

  • name (str) –

add_numeric_percentile_constraint(column, percentage, expected_percentile, max_absolute_deviation=None, max_relative_deviation=None, condition=None, name=None)

Assert that the percentage-th percentile is approximately expected_percentile.

The percentile is defined as the value present in column for which percentage % of the values in column are less or equal. NULL values are ignored.

Hence, if percentage is less than the inverse of the number of non-NULL rows, None is received as the percentage -th percentile.

percentage is expected to be provided in percent. The median, for example, would correspond to percentage=50.

At least one of max_absolute_deviation and max_relative_deviation must be provided.

Parameters:
  • column (str) –

  • percentage (float) –

  • expected_percentile (float) –

  • max_absolute_deviation (Optional[float]) –

  • max_relative_deviation (Optional[float]) –

  • condition (Condition) –

  • name (str) –

add_primary_key_definition_constraint(primary_keys, name=None)

Check that the primary key constraints in the database are exactly equal to the given column names.

Note that this doesn’t actually check that the primary key values are unique across the table.

Parameters:
  • primary_keys (List[str]) –

  • name (str) –

add_uniqueness_constraint(columns=None, max_duplicate_fraction=0, condition=None, max_absolute_n_duplicates=0, infer_pk_columns=False, name=None)

Columns should uniquely identify row.

Given a set of columns, satisfy conditions of a primary key, i.e. uniqueness of tuples from said columns. This constraint has a tolerance for inconsistencies, expressed via max_duplicate_fraction. The latter suggests that the number of uniques from said columns is larger or equal to (1 - max_duplicate_fraction) the number of rows.

If infer_pk_columns is True, columns will be retrieved from the primary keys. When columns=None and infer_pk_columns=False, the fallback is validating that all rows in a table are unique.

Parameters:
  • columns (List[str]) –

  • max_duplicate_fraction (float) –

  • condition (Condition) –

  • max_absolute_n_duplicates (int) –

  • infer_pk_columns (bool) –

  • name (str) –

add_uniques_equality_constraint(columns, uniques, map_func=None, reduce_func=None, condition=None, name=None)

Check if the data’s unique values are equal to a given set of values.

The UniquesEquality constraint asserts if the values contained in a column of a DataSource are strictly the ones of a reference set of expected values, specified via the uniques parameter.

See the Uniques class for further parameter details on map_func and reduce_func.

Parameters:
  • columns (List[str]) –

  • uniques (Collection[TypeVar(T)]) –

  • map_func (Callable[[TypeVar(T)], TypeVar(T)]) –

  • reduce_func (Callable[[Collection], Collection]) –

  • condition (Condition) –

  • name (str) –

add_uniques_subset_constraint(columns, uniques, max_relative_violations=0, map_func=None, reduce_func=None, condition=None, name=None)

Check if the data’s unique values are contained in a given set of values.

The UniquesSubset constraint asserts if the values contained in a column of a DataSource are part of a reference set of expected values, specified via uniques.

Null values in the column are ignored. To assert the non-existence of them use the NullAbsence constraint via the add_null_absence_constraint helper method for WithinRequirement.

max_relative_violations indicates what fraction of rows of the given table may have values not included in the reference set of unique values. Please note that UniquesSubset and UniquesSuperset are not symmetrical in this regard.

See Uniques for further details on map_func and reduce_func.

Parameters:
  • columns (List[str]) –

  • uniques (Collection[TypeVar(T)]) –

  • max_relative_violations (float) –

  • map_func (Callable[[TypeVar(T)], TypeVar(T)]) –

  • reduce_func (Callable[[Collection], Collection]) –

  • condition (Condition) –

  • name (str) –

add_uniques_superset_constraint(columns, uniques, max_relative_violations=0, map_func=None, reduce_func=None, condition=None, name=None)

Check if unique values of columns are contained in the reference data.

The UniquesSuperset constraint asserts that reference set of expected values, specified via uniques, is contained in given columns of a DataSource.

Null values in the column are ignored. To assert the non-existence of them use the NullAbsence constraint via the add_null_absence_constraint helper method for WithinRequirement.

max_relative_violations indicates what fraction of unique values of the given DataSource are not represented in the reference set of unique values. Please note that UniquesSubset and UniquesSuperset are not symmetrical in this regard.

One use of this constraint is to test for consistency in columns with expected categorical values.

See Uniques for further details on map_func and reduce_func.

Parameters:
  • columns (List[str]) –

  • uniques (Collection[TypeVar(T)]) –

  • max_relative_violations (float) –

  • map_func (Callable[[TypeVar(T)], TypeVar(T)]) –

  • reduce_func (Callable[[Collection], Collection]) –

  • condition (Condition) –

  • name (str) –

add_varchar_max_length_constraint(column, max_length, condition=None, name=None)
Parameters:
  • column (str) –

  • max_length (int) –

  • condition (Condition) –

  • name (str) –

add_varchar_min_length_constraint(column, min_length, condition=None, name=None)
Parameters:
  • column (str) –

  • min_length (int) –

  • condition (Condition) –

  • name (str) –

add_varchar_regex_constraint(column, regex, condition=None, name=None, allow_none=False, relative_tolerance=0.0, aggregated=True, n_counterexamples=5)

Assesses whether the values in a column match a given regular expression pattern.

The option allow_none can be used in cases where the column is defined as nullable and contains null values.

How the tolerance factor is calculated can be controlled with the aggregated flag. When True, the tolerance is calculated using unique values. If not, the tolerance is calculated using all the instances of the data.

n_counterexamples defines how many counterexamples are displayed in an assertion text. If all counterexamples are meant to be shown, provide -1 as an argument.

When using this method, the regex matching will take place in memory. If instead, you would like the matching to take place in database which is typically faster and substantially more memory-saving, please consider using add_varchar_regex_constraint_db.

Parameters:
  • column (str) –

  • regex (str) –

  • condition (Condition) –

  • name (str) –

  • allow_none (bool) –

  • relative_tolerance (float) –

  • aggregated (bool) –

  • n_counterexamples (int) –

add_varchar_regex_constraint_db(column, regex, condition=None, name=None, relative_tolerance=0.0, aggregated=True, n_counterexamples=5)

Assesses whether the values in a column match a given regular expression pattern.

How the tolerance factor is calculated can be controlled with the aggregated flag. When True, the tolerance is calculated using unique values. If not, the tolerance is calculated using all the instances of the data.

n_counterexamples defines how many counterexamples are displayed in an assertion text. If all counterexamples are meant to be shown, provide -1 as an argument.

When using this method, the regex matching will take place in database, which is only supported for Postgres, Sqllite and Snowflake. Note that for this feature is only for Snowflake when using sqlalchemy-snowflake >= 1.4.0. As an alternative, add_varchar_regex_constraint performs the regex matching in memory. This is typically slower and more expensive in terms of memory but available on all supported database mamangement systems.

Parameters:
  • column (str) –

  • regex (str) –

  • condition (Condition) –

  • name (str) –

  • relative_tolerance (float) –

  • aggregated (bool) –

  • n_counterexamples (int) –

classmethod from_expression(expression, name)

Create a WithinRequirement based on a sqlalchemy expression.

Any sqlalchemy object implementing the alias method can be passed as an argument for the expression parameter. This could, e.g. be an sqlalchemy.Table object or the result of a sqlalchemy.select call.

The name will be used to represent this expression in error messages.

Parameters:
  • expression (FromClause) –

  • name (str) –

classmethod from_raw_query(query, name, columns=None)

Create a WithinRequirement based on a raw query string.

The query parameter can be passed any query string returning rows, e.g. "SELECT * FROM myschema.mytable LIMIT 1337" or "SELECT id, name FROM table1 UNION SELECT id, name FROM table2".

The name will be used to represent this query in error messages.

If constraints rely on specific columns, these should be provided here via columns, e.g. ["id", "name"].

Parameters:
  • query (str) –

  • name (str) –

  • columns (List[str]) –

classmethod from_table(db_name, schema_name, table_name)
Parameters:
  • db_name (str) –

  • schema_name (str) –

  • table_name (str) –