Hi, I’m Patrick, a Senior Data Engineer at Statnett. I’m happy to present some of our work that has proven useful recently: automatic validation of data quality.
We have created the Data Quality Validation Tool (DQVT), which helps us define the content of our datasets by testing it against a set of expectations on a regular basis. It is built on top of some cool open-source packages: Great Expectations, streamlit, FastAPI and D-Tale.
In this post, I will explain what DQVT actually does, and why we built it the way we did. But first, let me just mention why Statnett takes data quality so seriously.
Monitor your data assets
History has showed us that cascading blackouts of the power grid can result from a single failure, often caused by extreme weather conditions or a defective component. Statnett and other transmission system operators (TSOs) learn continuously from these failures, adapt to them and prepare against them in case these physical assets fail again. This is probably also true in your job as well. Everyone experiences failures, but not everyone is prepared.
Data quality is important in the same way. Not very long ago, data could be mere logs, archived in case you might need to dig into it once in a while. Today, complex automated flows of information are crucial in our decision processes. Just like defective physical assets, we need to understand that, at some point, unexpected data may break data pipelines, possibly with huge consequences. Statnett operates critical infrastructure for an entire country, and in this context, high-quality data isn’t just gold, it is necessary.
Always know what to expect from your data
The motto of Great Expectations hints at a basic, but beautiful principle. You prepare against data surprises by testing and documenting your data. And when data surprises do arise, you want to get notified quickly, and trigger a plan B, such as switching to an alternative data pipeline.
By analyzing your data, you can often figure out what kind of values (formats, ranges, rules etc.) you are supposed to get in the usual conditions, and how this might change over time. This data knowledge allows you to test periodically that you always get what you expected.
So, a great principle, and a great package. How did we make this work at Statnett?
Understanding what DQVT is
Like many organisations, Statnett uses lots of different data sources, some well known (Oracle/PostgreSQL databases, Kafka Streams, …) and others more domain-specific (IBM Big SQL instance, e-terra data platform, …). Needless to say, a concequence of this diversity is the abundance of data quality horror stories.
In order to understand our issues with data and improve the quality of our datasets, we wanted a dedicated tool able to
- profile and document the content of datasets stored in different data sources
- check the data periodically
- identify mismatch between the data and what we expect from it and
- help us include data quality checks in our data pipelines
So we built the Data Quality Validation Tool (DQVT).
It is not a data catalog. Rather, it aims at documenting what the content of a dataset is expected to look like. DQVT helps us define tests on the data, called expectations, which are turned into documentation (thanks to Great Expectations). DQVT validates these expectations on a regular basis and reports any mismatch between the data and its documentation. Finally, DQVT computes scores on data quality metrics defined through our internal data standard.
By filling these roles, DQVT takes us towards better data quality, and consequently also more reliable and more performant software systems.
The story of DQVT
Faced with several high-profile digitalization projects, Statnett recently ramped up its data quality initiatives. At the time, Python Bytes presented Great Expectations on episode #115 (we highly recommend this podcast if you are a Pythonista🐍).
We tested Great Expectations and became fans pretty quickly, for several reasons:
- the simplicity of use: a command line interface providing guidance, supporting various types of SQL databases, Pandas and Spark.
- a beautiful concept in line with development best practices (documentation-as-code). In the words of Great Expectations, tests are docs and docs are tests.
- an extremely detailed user documentation
- and an active and inclusive open source community (Slack, Discuss)
We were interested to see if this tool could help us monitor data quality on our own infrastructure at Statnett, which includes two particularly important platforms. We use the GitLab devops platform to host our code and provide continuous integration and deployment pipelines, and we use OpenShift as our on-premises Platform-as-a-Service to run and orchestrate our Docker containers, managed by Kubernetes.
The time came to build a proof-of-concept, and we started lean: a limited amount of time and resources to reduce technology risks. The main goals and scope were revolved aroupnd a handful of features and requirements:
The goal of our first demo was to document the content of our datasets, not what the columns and fields of a table are (that is the job of a data catalog), but what was expected from the values in these fields. We were also keen on having this documentation human-readable and to be kept automatically up-to-date. Finally, we wanted to get notified when data expectations were not met, inticating either problems in the data, or that our expectations needed adjustments.
At the time, we weren’t sure how we would deploy validations on a schedule, or whether Great Expectations would be able to fetch data from our Big Data Lake (an IBM Big SQL), which is a high performance massively parallel processing (MPP) SQL engine for Hadoop. Failing in any of these integrations would have ended the experiment.
Despite having to do a small hack to connect to our Big Data Lake, we were able to have our data quality validations run periodically on OpenShift in less than a month! 🎉
At the end of the Python Bytes episode, host Brian Okken wonders how data engineers might include the Great Expectations tool in their data pipelines. I will be back soon to show you how to do just that! I’m creating a tutorial that details the individual steps and technologies we use in DQVT, but the structure of DQVT is quite simple, so you would likely be able to reproduce it on your own infrastructure.
And if you have some experience of your own or are just curious to learn more, you’re more than welcome to leave a comment!