We discovered the Python package pydantic through FastAPI, which we use for serving machine learning models. Pydantic is a Python package for data parsing and validation, based on type hints. We use pydantic because it is fast, does a lot of the dirty work for us, provides clear error messages and makes it easy to write readable code.

Two of our main uses cases for pydantic are:

*Validation of settings and input data.*

We often read settings from a configuration file, which we use as inputs to our functions. We often end up doing quite a bit of input validation to ensure the settings are valid for further processing. To avoid starting our functions with a long set of validations and assertions, we use pydantic to validate the input.*Sharing data requirements between machine learning applications.*

Our ML models have certain requirements to the data we use for training and prediction, for example that no data is missing. These requirements are used when we train our models, when we run online predictions and when we validate newly added training data. We use pydantic to specify the requirements we have and ensure that the same requirements are used everywhere, avoiding duplication of error-prone code across different applications.

This post will focus on the first use case, validation of settings and input data. A later post will cover the second use case.

## Validation of settings and input data

In some cases, we read settings from a configuration file, such as a
toml file, to be parsed as nested dictionaries.
We use the settings as inputs to different functions. We
often end up doing quite a bit of input validation to ensure the settings parsed from
file are valid for further processing. A concrete example is settings for machine learning
models, where we use `toml`

files for defining model parameters, features and training details for the models.

This is quite similar to how FastAPI uses pydantic for input validation: the input to the API call is json, which in Python translates to a dictionary, and input validation is done using pydantic.

In this post we will go through input validation for a function interpolating a time series to a higher frequency. If we want to do interpolation, we set the interpolation factor, i.e., the factor of upsamling, the interpolation method, and an option to interpolate on the integral. Our interpolation function is just a wrapper around pandas interpolation methods, including the validation of input and some data wrangling. The input validation code started out looking a bit like this:

from typing import Dict def validate_input_settings(params_in: Dict) -> Dict: params_validated = {} for key, value in params_in.items(): if key == "interpolation_factor": if not int(value) == value: raise ValueError(f"{key} has a non-int value") if not int(value) >= 2: raise ValueError(f"{key}: {value} should be >= 2") value = int(value) elif key == "interpolation_method": allowed_set = { "repeat", "distribute", "linear", "cubic", "akima" } if value not in allowed_set: raise ValueError(f"{key} should be one of {allowed_set}, got {value}") elif key == "interpolate_on_integral": if not isinstance(value, bool): raise ValueError(f"{key} should be bool, got {value}") else: raise ValueError(f"{key} not a recognized key") params_validated[key] = value return params_validated

This is heavily nested, which in itself makes it hard to read, and perhaps you find that the validation rules aren’t crystal clear at first glance. We use SonarQube for static code quality analysis, and this piece of code results in a code smell, complaining that the code is too complex. In fact, this already has a *cognitive complexity*
of 18 as SonarQube counts, above the default threshold of 15. Cognitive complexity is a measure of how difficult it is to read code, and increments for each break in linear flow, such as an if statement or a for loop. Nested breaks of the flow are incremented again.

Let’s summarize what we check for in `validate_input_settings`

:

`interpolation_factor`

is an integer`interpolation_factor`

is greater than or equal to 2`interpolation_method`

is in a set of allowed values`interpolate_on_integral`

is boolean- The keys in our settings dictionary are among the three mentioned above

In addition to the code above, we have a few more checks:

- if an
`interpolation_factor`

is given, but no`interpolation_method`

, use the default method`linear`

- if an
`interpolation_factor`

is given, but not`interpolate_on_integral`

, set the default option`False`

- check for invalid the invalid combination
`interpolate_on_integral = False`

and`interpolation_method = "distribute"`

At the end of another three `if`

statements inside the `for`

loop,
we end up at a cognitive complexity of 24.

## Pydantic to the rescue

We might consider using a pydantic model for the input validation.

### Minimal start

We can start out with the simplest form of a pydantic model, with field types:

from pydantic import BaseModel class InterpolationSetting(BaseModel): interpolation_factor: int interpolation_method: str interpolate_on_integral: bool

Pydantic models are simply classes inheriting from the `BaseModel`

class. We can create an instance of the new class as:

InterpolationSetting( interpolation_factor=2, interpolation_method="linear", interpolate_on_integral=True )

This automatically does two of the checks we had implemented:

`interpolation_factor`

is an`int`

`interpolate_on_integral`

is a`bool`

In the original script, the fields are in fact optional, i.e., it is possible to provide no interpolation settings, in which case we do not do interpolation. We will set the fields to optional later, and then implement the additional necessary checks.

We can verify the checks we have enforced now by supplying non-valid input:

from pydantic import ValidationError try: InterpolationSetting( interpolation_factor="text", interpolation_method="linear", interpolate_on_integral=True, ) except ValidationError as e: print(e)

which outputs:

1 validation error for InterpolationSetting interpolation_factor value is not a valid integer (type=type_error.integer)

Pydantic raises a `ValidationError`

when the validation of the model fails, stating
which *field*, i.e. attribute, raised the error and why. In this case
`interpolation_factor`

raised a
type error because the value `"text"`

is not a valid integer. The validation is
performed on instantiation of an `InterpolationSetting`

object.

### Validation of single fields and combinations of fields

Our original code also had some additional requirements:

`interpolation_factor`

should be greater than or equal to two.`interpolation_method`

must be chosen from a set of valid methods.- We do not allow the combination of
`interpolate_on_integral=False`

and`interpolation_method="distribute"`

The first restriction can be implemented using pydantic types. Pydantic provides many different types, we will use a constrained types this requirement, namely `conint`

, a constrained integer type providing automatic restrictions such as lower limits.

The remaining two restrictions can be implemented as *validators*. We decorate our validation
functions with the `validator`

decorator. The input argument to the validator decorator is the name of the attribute(s)
to perform the validation for.

All validators are run automatically when we instantiate
an object of the `InterpolationSetting`

class, as for the type checking.

Our
validation functions are class methods, and the first argument is the class,
not an instance of the class. The second argument is the value to validate, and
can be named as we wish. We implement two validators, `method_is_valid`

and `valid_combination_of_method_and_on_integral`

:

from typing import Dict from pydantic import BaseModel, conint, validator, root_validator class InterpolationSetting(BaseModel): interpolation_factor: conint(gt=1) interpolation_method: str interpolate_on_integral: bool @validator("interpolation_method") def method_is_valid(cls, method: str) -> str: allowed_set = {"repeat", "distribute", "linear", "cubic", "akima"} if method not in allowed_set: raise ValueError(f"must be in {allowed_set}, got '{method}'") return method @root_validator() def valid_combination_of_method_and_on_integral(cls, values: Dict) -> Dict: on_integral = values.get("interpolate_on_integral") method = values.get("interpolation_method") if on_integral is False and method == "distribute": raise ValueError( f"Invalid combination of interpolation_method " f"{method} and interpolate_on_integral {on_integral}" ) return values

There are a few things to note here:

- Validators should
**return a validated value**. The validators are run sequentially, and populate the fields of the data model if they are valid. - Validators should
**only raise**`ValueError`

,`TypeError`

or`AssertionError`

. Pydantic will catch these errors to populate the`ValidationError`

and raise one exception regardless of the number of errors found in validation. You can read more about*error handling*in the docs. - When we
**validate a field against another**, we can use the`root_validator`

, which runs validation on entire model. Root validators are a little different: they have access to the`values`

argument, which is a dictionary containing*all fields that have already been validated*. When the root validator runs, the`interpolation_method`

may have failed to validate, in which case it will not be added to the`values`

dictionary. Here, we handle that by using`values.get("interpolation_method")`

which returns`None`

if the key is not in`values`

. The docs contain more information on root validators and field ordering, which is important to consider when we are using the`values`

dictionary.

Again, we can verify by choosing input parameters to trigger the errors:

from pydantic import ValidationError try: InterpolationSetting( interpolation_factor=1, interpolation_method="distribute", interpolate_on_integral=False, ) except ValidationError as e: print(e)

which outputs:

2 validation errors for InterpolationSetting interpolation_factor ensure this value is greater than 1 (type=value_error.number.not_gt; limit_value=1) __root__ Invalid combination of interpolation_method distribute and interpolate_on_integral False (type=value_error)

As we see, pydantic raises a single `ValidationError`

regardless of the number of `ValueErrors`

raised in our model.

### Implementing dynamic defaults

We also had some default values if certain parameters were not given:

- If an
`interpolation_factor`

is given, set the default value`linear`

for`interpolation_method`

if none is given. - If an
`interpolation_factor`

is given, set the default value`False`

for`interpolate_on_integral`

if none is given.

In this case, we have dynamic defaults dependent on other fields.

This can also be achieved with root validators, by returning a conditional value.
As this means validating one field against another, we must take care to ensure
our code runs whether or not the two fields have passed validation and been added to
the `values`

dictionary. We will now also use
`Optional`

types, because we will handle the cases where not all values are provided. We add the new validators `set_method_given_interpolation_factor`

and `set_on_integral_given_interpolation_factor`

:

from typing import Dict, Optional from pydantic import BaseModel, conint, validator, root_validator class InterpolationSetting(BaseModel): interpolation_factor: Optional[conint(gt=2)] interpolation_method: Optional[str] interpolate_on_integral: Optional[bool] @validator("interpolation_method") def method_is_valid(cls, method: Optional[str]) -> Optional[str]: allowed_set = {"repeat", "distribute", "linear", "cubic", "akima"} if method is not None and method not in allowed_set: raise ValueError(f"must be in {allowed_set}, got '{method}'") return method @root_validator() def valid_combination_of_method_and_on_integral(cls, values: Dict) -> Dict: on_integral = values.get("interpolate_on_integral") method = values.get("interpolation_method") if on_integral is False and method == "distribute": raise ValueError( f"Invalid combination of interpolation_method " f"{method} and interpolate_on_integral {on_integral}" ) return values @root_validator() def set_method_given_interpolation_factor(cls, values: Dict) -> Dict: factor = values.get("interpolation_factor") method = values.get("interpolation_method") if method is None and factor is not None: values["interpolation_method"] = "linear" return values @root_validator() def set_on_integral_given_interpolation_factor(cls, values: Dict) -> Dict: on_integral = values.get("interpolate_on_integral") factor = values.get("interpolation_factor") if on_integral is None and factor is not None: values["interpolate_on_integral"] = False return values

We can verify that the default values are set only when `interpolation_factor`

is provided, running `InterpolationSetting(interpolation_factor=3)`

returns:

InterpolationSetting(interpolation_factor=3, interpolation_method='linear', interpolate_on_integral=None)

whereas supplying no input parameters, `InterpolationSetting()`

, returns a data model with all parameters set to `None`

:

InterpolationSetting(interpolation_factor=None, interpolation_method=None, interpolate_on_integral=None)

**Note**: If we have static defaults, we can simply set them for the fields:

class InterpolationSetting(BaseModel): interpolation_factor: Optional[int] = 42

### Final safeguard against typos

Finally, we had one more check in out previous script: That no unknown keys were provided. If we provide unknown keys to our data model now, nothing really happens, for example `InterpolationSetting(hello="world")`

outputs:

InterpolationSetting(interpolation_factor=None, interpolation_method=None, interpolate_on_integral=None)

Often, an unknown field name is the result of a typo
in the `toml`

file. Therefore we want to raise an error to alert the user.
We do this using a the model config, controlling the behaviour of the model. The `extra`

attribute of the config determines what we do with extra fields. The default is `ignore`

, which we can see in the example above, where the field is ignored, and not added to the model, as the option `allow`

does. We can use the `forbid`

option to raise an exception when extra fields are supplied.

from typing import Dict, Optional from pydantic import BaseModel, conint, validator, root_validator class InterpolationSetting(BaseModel): interpolation_factor: Optional[conint(gt=2)] interpolation_method: Optional[str] interpolate_on_integral: Optional[bool] class Config: extra = "forbid" @validator("interpolation_method") def method_is_valid(cls, method: Optional[str]) -> Optional[str]: allowed_set = {"repeat", "distribute", "linear", "cubic", "akima"} if method is not None and method not in allowed_set: raise ValueError(f"must be in {allowed_set}, got '{method}'") return method @root_validator() def valid_combination_of_method_and_on_integral(cls, values: Dict) -> Dict: on_integral = values.get("interpolate_on_integral") method = values.get("interpolation_method") if on_integral is False and method == "distribute": raise ValueError( f"Invalid combination of interpolation_method " f"{method} and interpolate_on_integral {on_integral}" ) return values @root_validator() def set_method_given_interpolation_factor(cls, values: Dict) -> Dict: factor = values.get("interpolation_factor") method = values.get("interpolation_method") if method is None and factor is not None: values["interpolation_method"] = "linear" return values @root_validator() def set_on_integral_given_interpolation_factor(cls, values: Dict) -> Dict: on_integral = values.get("interpolate_on_integral") factor = values.get("interpolation_factor") if on_integral is None and factor is not None: values["interpolation_factor"] = False return values

If we try again with an unknown key, we now get a `ValidationError`

:

from pydantic import ValidationError try: InterpolationSetting(hello=True) except ValidationError as e: print(e)

This raises a validation error for the unknown field:

1 validation error for InterpolationSetting hello extra fields not permitted (type=value_error.extra)

## Adapting our existing code is easy

Now we have implemented all our checks, and can go on to adapt our existing code to use the new data model. In our original implementation, we would do something like

params_in = toml.load(path_to_settings_file) params_validated = validate_input_settings(params_in) interpolate_result(params_validated)

We can replace the call to `validate_input_settings`

with instantiation of the pydantic model: `params_validated = InterpolationSetting(params_in)`

. Each pydantic data model has a `.dict()`

method that returns the parameters as a dictionary, so we can use it in the input argument to `interpolate_result`

directly: `interpolate_result(params_validated.dict())`

. Another option is to refactor `interpolate_result`

to use the attributes of the `InterpolationSetting`

objects, such as `params_validated.interpolation_method`

instead of the values of a dictionary.

## Conclusion

In the end, we can replace one 43 line method (for the full functionality) and cognitive complexity of 24 with one 40 line class containing six methods, each with cognitive complexity less than 4. The pydantic data models will not necessarily be shorter than the custom validation code they replace, and since there are a few quirks and concepts to pay attention to, they are not necessarily easier to read at the first try.

However, as we use the library for validation in our APIs, we are getting familiar with it, and we can understand more easily.

Some of the benefits of using `pydantic`

for this are:

- Type checking (and in fact also some type conversion), which we previously did ourselves, is now done automatically for us, saving us the work of repeating lines of error-prone code in many different functions.
- Each validator has a name which, if we put a little thought into it, makes it very
clear what we are trying to achieve. In our previous example, the purpose of each
nested condition had to be deduced from the many if clauses and error messages. This
should be a lot clearer now, especially if we use
`pydantic`

across different projects. - If speed is important, pydantic’s benchmarks show that they are fast compared to similar libraries.

Hopefully this will help you determine whether or not you should consider using pydantic models in your projects. In a later post, we will show how we use pydantic data models to share metadata between machine learning applications, and to share data requirements between the applications.

## One thought on “How we validate input data using pydantic”