Cimsparql: Loading power system data into pandas dataframes in Python

In 2019, we started working on a model that should be able to handle intra-zonal constraints in the upcoming balancing market. That methodology has been presented in a previous post in January 2022. In this post, we will focus on an open source Python library called cimsparql that we have developed to support this model. For the model to be able to perform any analysis, it needs data that describe the state of the power system. At Statnett, these data are available as CIM (Common Information Model) profiles. The data is made available through a triple store (GraphDB/Blazegraph/rdf4j), using a resource description framework which is a standard model for data interchange.

The information about the power system available in these CIM profiles can be used for different purposes, and what information should be extracted depends on the requirement of your model. In the previously presented post, a DC optimal power flow model is used. Thus we need data on generation, demand and transmission lines. The purpose of the cimsparql package is to extract this information from the triple store, through a set of predefined sparql queries, and loading them into Python as pandas dataframes. Cimsparql will also make sure that columns in the dataframes have correct types, either string, float or integer, as defined by the CIM standard.

Cimsparql uses the SPARQLWrapper library to remotely execute sparql queries, and extends it with extra functionality, assuming the data conform to the CIM standard. Even though the package is an important part of the balancing market model, it is open source available from github and can be installed using pip.

~/pip install cimsparql

Once the library is installed, it must be configured to query a triple store using the ServiceConfig class in cimsparql.graphdb. The example below assumes you have a graphdb server with a CIM model in a repository called “micro_t1_nl”. This test case, available at the cimsparql repository on github, is used to test the development of the predefined queries.

  >>> service_cfg = ServiceConfig(repo="micro_t1_nl")
  >>> model = get_cim_model(service_cfg)

If you need to provide other configurations such as server, username and password, this can be done with the same ServiceConfig class.

Once the model is configured, the data can be loaded into a pandas dataframe using the predefined queries. In the example below, topological node information is extracted from the triple store.

>>> bus = model.bus_data()
>>> print(bus.to_string())
                                           busname      un
795a117d-7caf-4fc2-a8d9-dc8f4cf2344a  NL_Busbar__4  220.00
6bdc33de-d027-49b7-b98f-3b3d87716615   N1230822413   15.75
81b0e447-181e-4aec-8921-f1dd7813bebc   N1230992195  400.00
afddd60d-f7e6-419a-a5c2-be28d29beaf9   NL-Busbar_2  220.00
97d7d14a-7294-458f-a8d7-024700a08717    NL_TR_BUS2   15.75

Here the values in the nominal voltage column has been converted to float values as defined by the CIM standard, while node and bus names are strings.

All the predefined queries can be executed using the cimsparql.model.CimModel class. Examples are the already shown bus_data as well as loads, synchronous_machines, ac_lines and coordinates. The latter extracts coordinates of all equipment in the model from the CIM Geographical Location profile. Cimsparql orders the rows in the dataframe such that it is straightforward to use with plotly’s map functionality. The example below was made in a Jupyter notebook.

df = model.coordinates()
lines = df.query("rdf_type == ''")
stations = df.query("rdf_type == ''")
center_x, center_y = df["x"].mean(), df["y"].mean()

fig = px.line_mapbox(lines, lon="x", lat="y", color="mrid", height=1000)
fig2 = px.scatter_mapbox(stations, lon="x", lat="y", color="mrid", size=[1]*len(stations))
fig.update_geos(countrycolor="black", showcountries=True, showlakes=True, showrivers=True, fitbounds="locations")

all_fig = go.Figure( +, layout = fig.layout)
AC line segments and substations included in the model

The main goal of cimsparql is to read data for the purpose of running power flow analysis using sparql queries to read data from triple store into pandas dataframes in Python. Currently the package is used internally at Statnett, where we also have some data which is yet not covered by the CIM standard. Thus some of the queries contains a namespace which will probably only be used by Statnett. However, this should not pose any problem for the use of this package elsewhere, as these namespaces or columns have been made optional. So any query towards a data set that does not contain these, will just produce a column for the given namespace with NaN values.

The package can also be uses in cases where the predefined queries does not produce data for a specific purpose. In this case, the user can provide their own queries as a string argument to the get_table_and_convert method. The example below list out the numbers of ac line segments for each voltage level in your data.

>>> query='''
PREFIX cim: <>
PREFIX rdf: <>
select ?un (count(?mrid) as ?n) where { 
?mrid rdf:type cim:ACLineSegment;
   cim:ConductingEquipment.BaseVoltage/cim:BaseVoltage.nominalVoltage ?un.
} group by ?un'''
>>> df = model.get_table_and_convert(query)

So to summarize, the main contribution of cimsparl is a set of predefined queries for the purpose of running power flow simulations and type conversion of data that follows the CIM standard.

Automatic data quality validations with Great Expectations: An Introduction to DQVT

Hi, I’m Patrick, a Senior Data Engineer at Statnett. I’m happy to present some of our work that has proven useful recently: automatic validation of data quality.

We have created the Data Quality Validation Tool (DQVT), which helps us define the content of our datasets by testing it against a set of expectations on a regular basis. It is built on top of some cool open-source packages: Great Expectations, streamlit, FastAPI and D-Tale.

In this post, I will explain what DQVT actually does, and why we built it the way we did. But first, let me just mention why Statnett takes data quality so seriously.

Monitor your data assets

History has showed us that cascading blackouts of the power grid can result from a single failure, often caused by extreme weather conditions or a defective component. Statnett and other transmission system operators (TSOs) learn continuously from these failures, adapt to them and prepare against them in case these physical assets fail again. This is probably also true in your job as well. Everyone experiences failures, but not everyone is prepared.

Data quality is important in the same way. Not very long ago, data could be mere logs, archived in case you might need to dig into it once in a while. Today, complex automated flows of information are crucial in our decision processes. Just like defective physical assets, we need to understand that, at some point, unexpected data may break data pipelines, possibly with huge consequences. Statnett operates critical infrastructure for an entire country, and in this context, high-quality data isn’t just gold, it is necessary.

Always know what to expect from your data

The motto of Great Expectations hints at a basic, but beautiful principle. You prepare against data surprises by testing and documenting your data. And when data surprises do arise, you want to get notified quickly, and trigger a plan B, such as switching to an alternative data pipeline.

By analyzing your data, you can often figure out what kind of values (formats, ranges, rules etc.) you are supposed to get in the usual conditions, and how this might change over time. This data knowledge allows you to test periodically that you always get what you expected.

So, a great principle, and a great package. How did we make this work at Statnett?

Understanding what DQVT is

Like many organisations, Statnett uses lots of different data sources, some well known (Oracle/PostgreSQL databases, Kafka Streams, …) and others more domain-specific (IBM Big SQL instance, e-terra data platform, …). Needless to say, a concequence of this diversity is the abundance of data quality horror stories.

In order to understand our issues with data and improve the quality of our datasets, we wanted a dedicated tool able to

  1. profile and document the content of datasets stored in different data sources
  2. check the data periodically
  3. identify mismatch between the data and what we expect from it and
  4. help us include data quality checks in our data pipelines

So we built the Data Quality Validation Tool (DQVT).

It is not a data catalog. Rather, it aims at documenting what the content of a dataset is expected to look like. DQVT helps us define tests on the data, called expectations, which are turned into documentation (thanks to Great Expectations). DQVT validates these expectations on a regular basis and reports any mismatch between the data and its documentation. Finally, DQVT computes scores on data quality metrics defined through our internal data standard.

By filling these roles, DQVT takes us towards better data quality, and consequently also more reliable and more performant software systems.

The story of DQVT

Faced with several high-profile digitalization projects, Statnett recently ramped up its data quality initiatives. At the time, Python Bytes presented Great Expectations on episode #115 (we highly recommend this podcast if you are a Pythonista🐍).

We tested Great Expectations and became fans pretty quickly, for several reasons:

  • the simplicity of use: a command line interface providing guidance, supporting various types of SQL databases, Pandas and Spark.
  • a beautiful concept in line with development best practices (documentation-as-code). In the words of Great Expectations, tests are docs and docs are tests.
  • an extremely detailed user documentation
  • and an active and inclusive open source community (Slack, Discuss)

We were interested to see if this tool could help us monitor data quality on our own infrastructure at Statnett, which includes two particularly important platforms. We use the GitLab devops platform to host our code and provide continuous integration and deployment pipelines, and we use OpenShift as our on-premises Platform-as-a-Service to run and orchestrate our Docker containers, managed by Kubernetes.

The time came to build a proof-of-concept, and we started lean: a limited amount of time and resources to reduce technology risks. The main goals and scope were revolved aroupnd a handful of features and requirements:

The goal of our first demo was to document the content of our datasets, not what the columns and fields of a table are (that is the job of a data catalog), but what was expected from the values in these fields. We were also keen on having this documentation human-readable and to be kept automatically up-to-date. Finally, we wanted to get notified when data expectations were not met, inticating either problems in the data, or that our expectations needed adjustments.

At the time, we weren’t sure how we would deploy validations on a schedule, or whether Great Expectations would be able to fetch data from our Big Data Lake (an IBM Big SQL), which is a high performance massively parallel processing (MPP) SQL engine for Hadoop. Failing in any of these integrations would have ended the experiment.

Despite having to do a small hack to connect to our Big Data Lake, we were able to have our data quality validations run periodically on OpenShift in less than a month! 🎉

What’s next?

At the end of the Python Bytes episode, host Brian Okken wonders how data engineers might include the Great Expectations tool in their data pipelines. I will be back soon to show you how to do just that! I’m creating a tutorial that details the individual steps and technologies we use in DQVT, but the structure of DQVT is quite simple, so you would likely be able to reproduce it on your own infrastructure.

And if you have some experience of your own or are just curious to learn more, you’re more than welcome to leave a comment!

%d bloggers like this: