The green transition: Why the availability of power system data matters

The green transition has overwhelmed the power grid connection process for new customers. Swift and cost-effective connection relies on the expertise of power system analysts. To achieve this, Statnett believe we must enable analysts to adopt a data-driven approach akin to data scientists.

In 2019, Statnett therefore established the FRIDA program with a cross-disciplinary product team whose objective was to empower analyst by making it much easier to find and use power system data for analysis. In this post we’ll present how the team accomplished this task. The first part focuses on on the current industry challenge faced by Statnett and network operators around the world. The second part presents the tools and processes that were created to tackle it.

Overwhelming demand for grid capacity poses challenges for the power system

Statnett holds three key roles in the power system: grid owner, system operator, and transmission system network planner. We currently face challenges in:

  • Grid Connection Requests: An increasing number of requests from power producers and industrial projects, such as battery factories and data centers, demand timely responses, proper guidance for optimal project planning, and accelerated grid connection processes.
  • High Power System Utilization: During certain times of the year or when components are disconnected, the power system operates at maximum capacity. Consequently, optimizing incident management, maintenance planning, and disconnections is becoming increasingly vital.

Overall, Statnett’s decisions and services significantly impact both our customers’ value creation and the security of supply. Even minor improvements can yield considerable value.

Power system analysts require access to comprehensive amounts of data

To achieve these improvements, Statnett’s power system engineers need easy access to reliable information from various areas of the organization. They analyze vast quantities of data, including:

  • Graph model: The Common Information Model (CIM) is a comprehensive graph model that details all relevant components within the power system. It outlines connections, electrical attributes, and metadata, such as ownership. However, its complexity, consisting of millions of triplets and hundreds of data-classes, requires a level of IT knowledge that most analyst don’t have, which prevents efficient data exploration.
  • Time series: Long-term data, especially extreme values, is crucial to identify stress scenarios in the system. Given the significant influence of weather on the Norwegian power system, analysts typically require a decade’s worth of data to build confidence. For most equipment Statnett receives measurements from sensors and estimates from a network solver about every ten seconds.
  • Events: Information on faults, maintenance, and special regulations affecting system utilization is vital, along with other market data and the cost of non-delivered energy that influences bottleneck-related costs.

Traditionally, this information was only accessible through source systems or data warehouses using reports. Analysts spent 30-50% of their time locating, extracting, and ensuring data quality. To get the customers connected the analysts should spend their time finding solutions, not data.

The biggest challenge is just the sheer volume of projects. There are only so many power engineers out there who can do the sophisticated studies we need to do to ensure the system stays reliable, and everyone else is trying to hire them, too

Ken Seiler, who leads system planning at PJM Interconnection. Source: Wind and Solar Energy Projects Risk Overwhelming America’s Antiquated Electrical Grids    – The New York Times (

The Power SDK makes power system data instantly available in Python

As a team we worked closely with the power system engineers to identify the areas and user stories that would be most useful to focus on. We mostly focused on creating a web app to meet their needs for finding components and studying observations related to them. This is further explained in the movie below. The movie is in Norwegian.

In the remaining part of this post, we will focus on what we did to meet the wish from our power system engineers to have the power system data instantly available in Python. The Statnett Power SDK is a Python package built on top of Statnetts data platform, to make it easier to retrive data from multiple sources and analyse this data to make better decisions.

The Power SDK was initially developed in cooperation with Cognite. Cognite Data Fusion is a part of Statnetts data platform and the primary source of information in the Power SDK. You may find the original source code here. The Power SDK has since then been further developed in Statnett. The Power SDK aims to address three common issues that made the initial user threshold too high:

  1. End users needed to connect to different databases and understand the data structure and format, which typically differed from database to database.
  2. Related data could be identified through code, but this required knowledge of graph queries and graph traversals, which most analysts lacked. For instance, adherence to the same price.
  3. The standard SDK provided by the platform used unfamiliar terms and lacked functions for typical use cases within the power analysis domain.

Below we have provided examples of how the Power SDK addresses these issues. Let’s say that an analyst needs to know the power produced in price area NO5. With the Power SDK, she may get this information through querying in terms that are familiar to how she navigates the power system mentally.

To get all substations in a price area you may write:

substations = client.substations.list(bidding_area="Elspot NO5")

You may also specify the substations you want by name, voltage level, grid type and whether to include historic substations as well. You may even define areas based on the power lines entering the area. The area object will contain all the substations and power lines within the area.

From the area or list of substations you may find all connected components, such as generating units or power transformers:

generating_units = substations.generating_units()

From these components you may retrieve all time series to the generators. Here I retrieve all time series of ThreePhaseActivePower related to the generators:


To get events related to maintenance instead:


Once you have the time series or events, a user with basic Python skills may easily do powerful analysis. The example below shows how to find the total production in a power area over time. The area object refered to in the example could be defined by (in this example fictitious substation names):

area = client.power_area(substations=['Substation1', 'Substation2']).expand_area(level=2)

or by specifying which power lines that defines the electric area.

The code below finds all relevant time series associated with the generators in the power area defined earlier, summarizes these and plots hourly values over the past year.

The analysts need guidance and support to embrace new tools

Getting busy analysts to change how they work is challenging unless you provide easy to use tools with reliable access to support. Otherwise, the barrier to trying new ways of doing things can be too significant, especially in a busy work environment. Nevertheless, we believe that the workload does not have to be large to make a change.

To provide support, we have established a common chat for users to facilitate discussions with the team and other users. Both the team and expert users can quickly provide feedback, making it easier for users to explore new approaches. Moreover, we have set up JupyterHub, enabling users without Python experience to get started faster. This also allows departments to use notebooks as lightweight dashboards, facilitating the sharing of insights.

As a result, the adoption of our Power SDK has increased dramatically, and the service has also reduced the cost and time to market for integrating other data sources for analysts. For instance, the Entsoe-py package has become popular and reduced the load on our internal data platform team.

Furthermore, helping users who encounter challenges provides us with a better understanding of their needs and pain points, facilitating the development of our products. Through our dialogue with users, we can capture the value of what we do and share our knowledge with others. For example, feedback from a user states:

Previously, I would have spent a week extracting data from the data arehouse to investigate load and flow into and through the Oslo area. Now I can create good assumptions within 1 hour.

Power System Analyst in Statnetts Regional Grid Planning department

The feedback we receive from users is useful for sharing knowledge and gaining support from management to continue our development efforts.

The analysts have used the tools to improve their methods

We are thrilled to see that analysts have a keen interest in using statistical methods to make better decisions. With our experience and expertise, we can increase the scope of data analysis and data science in Statnett, empowering analysts to generate valuable insights.

The Power SDK can also load weather data from Statnetts substations. This figure shows the relationship between the peak hourly load per day and average outdoor temperature. The figure is done by an analyst working in our long term planning department and used to assess the expected peak load in a region in Norway.

These assessments may have great economic and environmental impact as well: As the figure shows, the load is increasing at a lower rate as it gets colder. Compared to a linear trend, which was common historically, this leads to a lower expected peak demand and hence lower demand for grid capacity.

If you want more information or have ideas for cooperation, please don’t hesitate to contact us.

Cimsparql: Loading power system data into pandas dataframes in Python

In 2019, we started working on a model that should be able to handle intra-zonal constraints in the upcoming balancing market. That methodology has been presented in a previous post in January 2022. In this post, we will focus on an open source Python library called cimsparql that we have developed to support this model. For the model to be able to perform any analysis, it needs data that describe the state of the power system. At Statnett, these data are available as CIM (Common Information Model) profiles. The data is made available through a triple store (GraphDB/Blazegraph/rdf4j), using a resource description framework which is a standard model for data interchange.

The information about the power system available in these CIM profiles can be used for different purposes, and what information should be extracted depends on the requirement of your model. In the previously presented post, a DC optimal power flow model is used. Thus we need data on generation, demand and transmission lines. The purpose of the cimsparql package is to extract this information from the triple store, through a set of predefined sparql queries, and loading them into Python as pandas dataframes. Cimsparql will also make sure that columns in the dataframes have correct types, either string, float or integer, as defined by the CIM standard.

Cimsparql uses the SPARQLWrapper library to remotely execute sparql queries, and extends it with extra functionality, assuming the data conform to the CIM standard. Even though the package is an important part of the balancing market model, it is open source available from github and can be installed using pip.

~/pip install cimsparql

Once the library is installed, it must be configured to query a triple store using the ServiceConfig class in cimsparql.graphdb. The example below assumes you have a graphdb server with a CIM model in a repository called “micro_t1_nl”. This test case, available at the cimsparql repository on github, is used to test the development of the predefined queries.

  >>> service_cfg = ServiceConfig(repo="micro_t1_nl")
  >>> model = get_cim_model(service_cfg)

If you need to provide other configurations such as server, username and password, this can be done with the same ServiceConfig class.

Once the model is configured, the data can be loaded into a pandas dataframe using the predefined queries. In the example below, topological node information is extracted from the triple store.

>>> bus = model.bus_data()
>>> print(bus.to_string())
                                           busname      un
795a117d-7caf-4fc2-a8d9-dc8f4cf2344a  NL_Busbar__4  220.00
6bdc33de-d027-49b7-b98f-3b3d87716615   N1230822413   15.75
81b0e447-181e-4aec-8921-f1dd7813bebc   N1230992195  400.00
afddd60d-f7e6-419a-a5c2-be28d29beaf9   NL-Busbar_2  220.00
97d7d14a-7294-458f-a8d7-024700a08717    NL_TR_BUS2   15.75

Here the values in the nominal voltage column has been converted to float values as defined by the CIM standard, while node and bus names are strings.

All the predefined queries can be executed using the cimsparql.model.CimModel class. Examples are the already shown bus_data as well as loads, synchronous_machines, ac_lines and coordinates. The latter extracts coordinates of all equipment in the model from the CIM Geographical Location profile. Cimsparql orders the rows in the dataframe such that it is straightforward to use with plotly’s map functionality. The example below was made in a Jupyter notebook.

df = model.coordinates()
lines = df.query("rdf_type == ''")
stations = df.query("rdf_type == ''")
center_x, center_y = df["x"].mean(), df["y"].mean()

fig = px.line_mapbox(lines, lon="x", lat="y", color="mrid", height=1000)
fig2 = px.scatter_mapbox(stations, lon="x", lat="y", color="mrid", size=[1]*len(stations))
fig.update_geos(countrycolor="black", showcountries=True, showlakes=True, showrivers=True, fitbounds="locations")

all_fig = go.Figure( +, layout = fig.layout)
AC line segments and substations included in the model

The main goal of cimsparql is to read data for the purpose of running power flow analysis using sparql queries to read data from triple store into pandas dataframes in Python. Currently the package is used internally at Statnett, where we also have some data which is yet not covered by the CIM standard. Thus some of the queries contains a namespace which will probably only be used by Statnett. However, this should not pose any problem for the use of this package elsewhere, as these namespaces or columns have been made optional. So any query towards a data set that does not contain these, will just produce a column for the given namespace with NaN values.

The package can also be uses in cases where the predefined queries does not produce data for a specific purpose. In this case, the user can provide their own queries as a string argument to the get_table_and_convert method. The example below list out the numbers of ac line segments for each voltage level in your data.

>>> query='''
PREFIX cim: <>
PREFIX rdf: <>
select ?un (count(?mrid) as ?n) where { 
?mrid rdf:type cim:ACLineSegment;
   cim:ConductingEquipment.BaseVoltage/cim:BaseVoltage.nominalVoltage ?un.
} group by ?un'''
>>> df = model.get_table_and_convert(query)

So to summarize, the main contribution of cimsparl is a set of predefined queries for the purpose of running power flow simulations and type conversion of data that follows the CIM standard.

Is bid filtering effective against network congestion?

Earlier this year, I wrote an introduction to the bid filtering problem, and explained how my team at Statnett are trying to solve it. The system we’ve built at Statnett combines data from various sources in its attempt to make the right call. But how well is it doing its job? Or, more precisely, what is the effect on network congestion of applying our bid filtering system in its current form?

Kyoto. Photo: Belle Co

Without calling it a definitive answer, a paper I wrote for the CIGRE Symposium contains research results that provide new insight. The symposium was in Kyoto, but a diverse list of reasons (including a strict midwife) forced me to leave the cherry blossom to my imagination and test my charming Japanese phrases from a meeting room in Trondheim.

A quick recap

European countries are moving toward a new, more integrated way of balancing their power systems. In a country with highly distributed electricity generation, we want to automatically identify power reserves that should not be used in a given situation due to their location in the grid. If you would like to learn the details about the approach, you are likely to enjoy reading the paper. Here is the micro-version:

To identify bids in problematic locations, we need a detailed network model, we try to predict the future situation in the power grid, and then we apply a nodal market model which gives us the optimal plan for balancing activations for a specific situation. But since we don’t really know how much is going to flow into or out of the country, we optimize many times with different assumptions on cross-border flows. Each of the exchange scenarios tells its own story about which bids should -and shouldn’t- be activated. The scenarios don’t always agree, but in the aggregate they let us form a consensus result, determining which bids will be made unavailable for selection in the balancing market.

An unfair competition

Today, human operators at Statnett select power reserves for activation when necessary to balance the system, always mindful of their locations in the grid and potential bottlenecks. Their decisions on which balancing bids to activate – and not activate – often build years of operational experience and an abundance of real-time data.

Before discussing whether our machine can beat the human operators, it’s important to keep in mind that the bid filtering system will take part in a different context: the new balancing market, where everyday balancing will take place without the involvement of human operators. This will change the rules of the balancing game completely. While human operators constantly make a flow of integrated last-minute decisions, the new automatic processes are distinct in their separation of concerns and must often act much earlier to respect strict timelines.

Setting up simulations

The quantitative results in our paper come from simulating one day in the Norwegian power grid, using our detailed, custom-built Python model together with recorded data. The balancing actions -and the way they are selected- are different between the simulations.

The first simulation is Historical operation. Here, we simply replay the historical balancing decisions of the human operators.

The second simulation is Bid filtering. Here, we replace the historical human decisions with balancing actions selected by a zonal market mechanism that doesn’t see the internal network constraints or respect the laws of physics. The balancing decisions will often be different from the human ones in order to save some money. But before the market selects any bids, some of them are removed from the list by our bid filtering machine in order to prevent network congestion. We try not to cheat, the bid filtering takes place using data and forecasts available 30 minutes before the balancing actions take effect.

The third simulation is No filtering. Here we try to establish the impact on congestion of moving from today’s manual, but flexible operation to zonal, market-based balancing. This simulation is a parallel run of the market-based selection, but without pre-filtering any bids, and it provides a second, possibly more relevant benchmark.

Example from 09:30 on August 25, 2021. Red cells are balancing bids made unavailable in the bid filtering simulation. As a result, the market-based balancing will not select exactly the same bids in the Bid filtering scenario (black dots) and the No filtering scenario (white dots).

Power flow analyses

The interesting part of the simulation is when we inject the balancing decisions into the historical system state and calculate all power flows in the network. Comparing these flows to the operational limits reveals which balancing approaches are doing a better job at avoiding overloads in the network.

Example from 09:30 on August 25, 2021 showing reliability limits. Reliability limits in Norway restrict the flow on a combination of transmission lines, so-called Power Transfer Corridors (PTCs). These 13 PTC constraints are violated in one or more of the simulations.

The overloads are similar between the simulation, but they are not the same. To better understand the big picture, we created a congestion index that summarizes the resulting overload situation in a single value. The number doesn’t have any physical interpretation, but gives a relative indication of how severe the overload situation is.

Congestion index for reliability limits in the Norwegian system from August 25, 2021

When we run the simulation for 24 historical hours, we see that with market-based balancing, there would be overloads throughout the day. When we apply bid filtering and remove the bids expected to be problematic, overloads are reduced in 9 of the 24 hours, and we’re able to avoid the most serious problems in the afternoon.

No matter the balancing mechanism, the congestion index virtually never touches zero. Even the human operators with all their extra information and experience run into many of the same congestion problems. This shows that balancing activations play a role in the amount of congestion, but they are just one part of the story, along with several other factors.

With that in mind, if you’re going to let a zonal market mechanism decide your balancing decisions, it seems that bid filtering can have a clear, positive effect in reducing network overloads.

What do you think? Do you read the results differently? Don’t be afraid to get in touch, my team and I are always happy to discuss.


Automatic data quality validations with Great Expectations: An Introduction to DQVT

Hi, I’m Patrick, a Senior Data Engineer at Statnett. I’m happy to present some of our work that has proven useful recently: automatic validation of data quality.

We have created the Data Quality Validation Tool (DQVT), which helps us define the content of our datasets by testing it against a set of expectations on a regular basis. It is built on top of some cool open-source packages: Great Expectations, streamlit, FastAPI and D-Tale.

In this post, I will explain what DQVT actually does, and why we built it the way we did. But first, let me just mention why Statnett takes data quality so seriously.

Monitor your data assets

History has showed us that cascading blackouts of the power grid can result from a single failure, often caused by extreme weather conditions or a defective component. Statnett and other transmission system operators (TSOs) learn continuously from these failures, adapt to them and prepare against them in case these physical assets fail again. This is probably also true in your job as well. Everyone experiences failures, but not everyone is prepared.

Data quality is important in the same way. Not very long ago, data could be mere logs, archived in case you might need to dig into it once in a while. Today, complex automated flows of information are crucial in our decision processes. Just like defective physical assets, we need to understand that, at some point, unexpected data may break data pipelines, possibly with huge consequences. Statnett operates critical infrastructure for an entire country, and in this context, high-quality data isn’t just gold, it is necessary.

Always know what to expect from your data

The motto of Great Expectations hints at a basic, but beautiful principle. You prepare against data surprises by testing and documenting your data. And when data surprises do arise, you want to get notified quickly, and trigger a plan B, such as switching to an alternative data pipeline.

By analyzing your data, you can often figure out what kind of values (formats, ranges, rules etc.) you are supposed to get in the usual conditions, and how this might change over time. This data knowledge allows you to test periodically that you always get what you expected.

So, a great principle, and a great package. How did we make this work at Statnett?

Understanding what DQVT is

Like many organisations, Statnett uses lots of different data sources, some well known (Oracle/PostgreSQL databases, Kafka Streams, …) and others more domain-specific (IBM Big SQL instance, e-terra data platform, …). Needless to say, a concequence of this diversity is the abundance of data quality horror stories.

In order to understand our issues with data and improve the quality of our datasets, we wanted a dedicated tool able to

  1. profile and document the content of datasets stored in different data sources
  2. check the data periodically
  3. identify mismatch between the data and what we expect from it and
  4. help us include data quality checks in our data pipelines

So we built the Data Quality Validation Tool (DQVT).

It is not a data catalog. Rather, it aims at documenting what the content of a dataset is expected to look like. DQVT helps us define tests on the data, called expectations, which are turned into documentation (thanks to Great Expectations). DQVT validates these expectations on a regular basis and reports any mismatch between the data and its documentation. Finally, DQVT computes scores on data quality metrics defined through our internal data standard.

By filling these roles, DQVT takes us towards better data quality, and consequently also more reliable and more performant software systems.

The story of DQVT

Faced with several high-profile digitalization projects, Statnett recently ramped up its data quality initiatives. At the time, Python Bytes presented Great Expectations on episode #115 (we highly recommend this podcast if you are a Pythonista🐍).

We tested Great Expectations and became fans pretty quickly, for several reasons:

  • the simplicity of use: a command line interface providing guidance, supporting various types of SQL databases, Pandas and Spark.
  • a beautiful concept in line with development best practices (documentation-as-code). In the words of Great Expectations, tests are docs and docs are tests.
  • an extremely detailed user documentation
  • and an active and inclusive open source community (Slack, Discuss)

We were interested to see if this tool could help us monitor data quality on our own infrastructure at Statnett, which includes two particularly important platforms. We use the GitLab devops platform to host our code and provide continuous integration and deployment pipelines, and we use OpenShift as our on-premises Platform-as-a-Service to run and orchestrate our Docker containers, managed by Kubernetes.

The time came to build a proof-of-concept, and we started lean: a limited amount of time and resources to reduce technology risks. The main goals and scope were revolved aroupnd a handful of features and requirements:

The goal of our first demo was to document the content of our datasets, not what the columns and fields of a table are (that is the job of a data catalog), but what was expected from the values in these fields. We were also keen on having this documentation human-readable and to be kept automatically up-to-date. Finally, we wanted to get notified when data expectations were not met, inticating either problems in the data, or that our expectations needed adjustments.

At the time, we weren’t sure how we would deploy validations on a schedule, or whether Great Expectations would be able to fetch data from our Big Data Lake (an IBM Big SQL), which is a high performance massively parallel processing (MPP) SQL engine for Hadoop. Failing in any of these integrations would have ended the experiment.

Despite having to do a small hack to connect to our Big Data Lake, we were able to have our data quality validations run periodically on OpenShift in less than a month! 🎉

What’s next?

At the end of the Python Bytes episode, host Brian Okken wonders how data engineers might include the Great Expectations tool in their data pipelines. I will be back soon to show you how to do just that! I’m creating a tutorial that details the individual steps and technologies we use in DQVT, but the structure of DQVT is quite simple, so you would likely be able to reproduce it on your own infrastructure.

And if you have some experience of your own or are just curious to learn more, you’re more than welcome to leave a comment!

%d bloggers like this: