Retrofitting the Transmission Grid with Low-cost Sensors

In Statnett, we collect large amounts of sensing data from our transmission grid. This includes both electric parameters such as power and current, and parameters more directly related to the individual components, such as temperatures, gas concentrations and so on.

Nevertheless, the state and behaviour of many of our assets are to a large extent not monitored and to some extent unobservable outside of regular maintenance and inspection rounds. We believe that more data can give us a better estimate of the health of each component, allowing for better utilization of the grid, more targeted maintenance, and reduced risk of component failure.

About a year ago, we therefore acquired a Pilot Kit from Disruptive Technologies, packed with a selection of miniature sensors that are simple to deploy and well-suited for retrofitting on existing infrastructure. We set up a small project in collaboration with Statnett R&D, where we set about testing the capabilities of this technology, and its potential value for Statnett.

Since then we’ve experimented with deploying these tiny IOT-enabled devices on a number of things, ranging from coffee machines to 420 kV power transformers.

To gauge the value and utility of these sensors in our transmission grid, we had to determine what to measure, how to measure, and how to gather and analyze the data.

This blog post summerizes what we’ve learnt so far, and evaluates some of the main use cases we’ve identified for instrumenting the transmission grid. The process is by no means finished, but we will describe the steps we have taken so far.

Small, Low-cost IoT-Sensors

Disruptive Technologies is a fairly young company whose main product is a range of low-cost, IoT-enabled sensors. Their lineup includes temperature sensors, touch sensors, proximity sensors, and more. In addition to sensors, you need one or more cloud connectors per area you are looking to instrument. The sensors transmit their signals to a cloud connector, which in turn streams the data to the cloud through mobile broadband or the local area network. Data from the sensors are encrypted, so a sensor can safely transmit through any nearby cloud connector without prior pairing or configuration.

The devices and the accompanying technology have some characteristics that make them interesting for power grid instrumentation, in particular for retrofitting sensors to already existing infrastructure:

  • Cost: The low price of sensors makes experimental or redundant installation a low-risk endeavour.
  • Size: Each sensor is about the size of a coin, including battery and the wireless communication layer.
  • Simplicity: Each sensor will automatically connect to any nearby cloud connector, so installation and configuration amounts to simply sticking the sensor onto the surface you want to monitor, and plugging the cloud connector into a power outlet.
  • Battery life: expected lifetime for the integrated battery is 15 years with 100 sensor readings per day.
  • Security: Data is transmitted with full end-to-end encryption from sensor to Disruptive’s cloud service.
  • Open API: Data are readily available for download or streaming to an analytics platform via a REST API.

Finding Stuff to Measure

Disruptive develops several sensor types, including temperature, proximity and touch. So far we have chosen to focus primarily on the temperature sensors, as this is the area where we see the most potential value for the transmission grid. We have considered use cases in asset health monitoring, where temperature is a well-established indicator of weaknesses or incipient failure. Depending on the component being monitored, unusual heat patterns may indicate poor electrical connections, improper arc quenching in switchgear, damaged bushings, and a number of other failure modes.

Asset management and thermography experts in Statnett helped us compile an initial list of components where we expect temperature measurements to be useful:

  • Transformers and transformer components. At higher voltage levels, transformers typically have built-in sensors for oil temperature, and modern transformers also tend to monitor winding and hotspot temperatures. Measurement on sub-components such as bushings and fans may however prove to be very valuable.
  • Ciruit breakers. Ageing GIS facilities are of particular importance both due their importance in the grid, and to the risk of environmental consequences in case of SF6 leakage. Other switchgear may also be of interest, since intermittent heat development during breaker operation will most likely not be uncovered by traditional thermography.
  • Disconnectors. Disconnectors (isolator switches) come in a number of flavors, and we often see heat development in joints and connection points. However, we know from thermography that hotspots are often very local, and it may be hard to predict in advance where on the disconnector the sensor should be placed.
  • Voltage and current transformers. Thermographic imaging has shown heat development in several of our instrument transformers. Continuous monitoring of temperature would enable us to better track this development and understand the relationship between power load, air temperature and transformer heating.
  • Capacitor banks. Thermography often reveals heat development at one or more capacitor in capacitor banks. However, it would require a very large number of sensors required to fully monitor all potential weak spots of a capacitor bank.

A typical use cases for the proximity sensors in the power system is open door or window alarms. Transmission level substations are typically equipped with alarms and video surveillance, but it might be relevant for other types of equipment in the field, or at lower voltage levels.

The touch sensors may for instance be used to confirm operator presence at regular inspection or maintenance intervals. Timestamping and georeferencing as part of an integrated inspection reporting application is a more likely approach for us, so we have not pursued this further.

Deployment on Transformer

Our first pilot deployment (not counting the coffee machine) was on three transformers and one reactor in a 420 kV substation. The sensors were deployed in winter when all components were energized, so we could only access the lower part of the main transformer and reactor bodies. This was acceptable, since the primary intention of the deployment was to gain experience with the process and hopefully avoid a few pitfalls in the future.

Moreover, the built-in temperature sensors in these components gave us a chance to compare readings from Disruptive sensors with the “true” inside temperature, giving us an impression of both the reliability of readings from Disruptive sensors and the ability to estimate oil temperature based on measurements on the outside of the transformer housing. We also experimented with different types of insulation covering the sensor, in order to gauge the effect of air temperature variations on sensor readings.

Deployment in Indoor Substation

Following the initial placement on the transformer bodies, we instrumented an indoor GIS facility, where we deployed sensors on both circuit breakers and disconnectors; plus one additional sensor to measure ambient temperature in the room. Since the facility is indoors and all energized components are fully insulated, this deployment was fairly straightforward. Our main challenge was that the cloud connector had a hard time connecting finding a cellular signal, but with a bit of fiddling we eventually found a few locations in the room with sufficient signal strength.

Concrete buildings can make it hard to find a good signal for mobile broadband. In this case we raised the cloud connector to the ceiling in search of a signal.

Deployment on Air Insulated Breakers

Finally, we took advantage of a planned disconnection of one of the busbars at a 300 kV facilty to instrument all poles of an outdoor SF6 circuit breaker. As mentioned above, disconnectors and instrument transformers were other instrument transformers. However, due to the layout of the substation, these were still energized so the circuit breakers were the only components we could gain access to.

Apart from monitoring the breakers, this deployment enabled us to test how the sensors reacted to being placed directly on uninsulated high-voltage equipment, and to check for any negative side-effects such as corona discharges.

Developing a Microservice for Data Ingestion

The sensors from Disruptive Technologies work by transmitting data from the sensor, via one or more cloud connectors, to Disruptive’s cloud software solution. The data are encrypted end-to-end, so the sensors may use any reachable cloud connector to transmit data.

As a precautionary measure, we opted to maintain an in-house mapping between sensor device ID and placement in the grid. This way, there is nothing outside Statnett’s systems to identify where the sensor data are measured.

Disruptive provides various REST APIs for streaming and downloading data from their cloud solution. For internal technical reasons, we chose to use a “pull” architecture, where we download new sensory readings every minute and pass them on to our internal data platform. We therefore developed a microservice that:

  1. Pulls data from Disruptive’s web service at regular intervals.
  2. Enriches the data with information about which component the sensor is placed on and how it is positioned.
  3. Produces each sensor reading as a message to our internal Kafka cluster.

From Kafka, the data are consumed and stored in a TimescaleDB database. Finally, we display and analyze the data using a combination of Grafana and custom-built dashboards.

The microservice runs on our internal Openshift Container Platform (PaaS).

We wrote a custom adapter that ingests data from Disruptive’s web services and produces messages on our internal Kafka cluster. From here, the data flows to Timescale and dashboards. The data are anonymous on Disruptive’s web service, so the adapter also adds contextual information to the messages.

The Value of Data

Do these newly acquired data help us take better decisions in the operation of the grid, and hence operate the grid in a smarter, safer, and more cost-effective way? This is really the litmus test for the value of retrofitting sensors and gathering more data about the components in the grid.

In this pilot project, we consulted field personnel and component experts regularly for advice on where and how to place sensors. However, it was the FRIDA project, a large cross-disciplinary digitalization project at Statnett, that really enabled and inspired the relevant switchgear expert to analyze the data we had collected in more detail.

Once he looked at the data, he discovered heat generation in one of the breakers, with temperatures that significantly exceeded what would be expected under normal operation. A thermographic imaging inspection was immediately ordered, which confirmed the readings from the Disruptive sensors.

The temperature sensors made us aware of high temperatures in one of the breakers, indicating a possible incipient failure of a critical component. As a result, the control centre changed the operating pattern on the substation and planned for mainentance of the unhealthy component. The figure shows the temperature on each phase of the breaker, with busbar A in the top panel and busbar B in the bottom one. The room temperature is shown as a dotted orange line.

Based on the available data, the breaker and thermography experts concluded that the breaker, altough apparently operating normally, shows signs of weakness and possibly incipient failure. This in turn lead to new parts being ordered and maintenance work planned for the near future.

While waiting for the necessary maintenance work to be performed, the operation of the substation has been adapted to reduce the stress on the weakened equipment. Until maintenance is performed, the limits for maximum amount of power flowing through the switchgear are now updated on a regular (and frequent) basis, based on the latest temperature readings from our sensors. Having access to live component state monitoring has also made the control centre able to make other changes to the operating pattern on the substation.

The control centre now continuously monitors the heat development in the substation, using the sensors from this pilot project. The new data has thus not only helped discovered an incipient failure in a critical component, it also allows us to keep operating the substation in a safe and controlled way while we are waiting for an opportunity to repair or replace the troublesome components.

Lessons Learned in the Field

Under optimal conditions, the wireless range, i.e. the maximum distance between sensors and cloud connectors, is 1000 meters with line of sight. Indoor, signals are reliably transmitted over ranges of 20+ meters in normal mode when the conditions are favorable. A weak signal will make the sensor transmit in “Boost mode”, and this quickly drains the battery.

High-voltage power transformer are huge oil-filled steel constructions, often surrounded by thick concrete blast walls. We quickly learned that these are not the best conditions for low-power wireless signal transfer. When attaching the sensors to the main body of the transformer, we observed that the communication distance was reduced to less than 10 meters and required line-of-sight between the sensors and the cloud connector.

One reason for the short transmission distance is that the metal body of the transformer absorbs most of the RF signal from the sensor. Disruptive therefore adviced us to use a bracket so that the sensors could be mounted at a 90 degree angle to the surface when mounting the sensors on large metal bodies. We used Lego blocks for this purpose. Disruptive have since developed a number of range extenders that are arguably better-looking.

Lego bracket vs. surface range extender. If you roll your own, keep in mind where on the sensor the actual temperature measurement is made, and rotate it accordingly. On the sensors in our Pilot Kit, this was the upper right corner of the device (not in the corner where the small orange dot is located).

Although we did experience an improvement in signal transmission range when mounting the sensors at an angle to the transformer body, sending signals around the corner of the transformer still turned out to be very challenging.

Disruptive have yet to develop a ruggedized cloud connector. The need to position the cloud connector such that line of sight was maintained, limited our ability to place the cloud connector inside existing shelters such as fuse cabinets. We therefore developed a weather-proof housing for outdoor use, so we could position the cloud connectors for optimal transmission conditions.

A weather-proof housing allowed us to position the cloud connectors for optimal transmission conditions.

All sensors have their device ID printed on the device itself. However, the text is tiny and the ID is long and complex. We opted to put a short label on each sensor using a magic marker in order to simplify the deployment process in the field. This simplistic approach was satisfactory for our pilot project, and required minimal support system development, but obviously does not scale to larger deployments.

As mentioned above, we have deployed a number of sensors directly on energized, unisolated, 300 kV components. We were quite curious to see how the sensors would cope with being mounted directly on high-voltage equipment. So far, the measurements from the sensors seem to be unaffected by the voltage. However, we have lost about 25 % of these sensors in less than a year due to high battery drainage. We suspect that this may be related to the environment in which they operate, but it may also be bad luck or related to the fact that our sensors are part of a pre-production pilot kit.

Finally, sticking coin-sized sensors onto metal surfaces is easy in the summer. It’s not always equally easy on rainy days or in winter, with cold fingers and icy or snow-covered components.

Other Things we Learned Along the Way

So far, our impression is the the data quality is good. The sensor readings are precise, and the sensors are mostly reliable. The snapshot from Grafana gives an impression of the data quality: as can be seen, there is very good correspondence between the temperature readings from the three different phases on busbar A after switching, when it has been disconnected.

However, both the cloud connectors and the SaaS-based architecture are weak spots where redundancy is limited. If a cloud connector fails, we risk loosing data from all the sensors communicating with that connector. This can to some extent be alleviated by using more cloud connectors. The SaaS architecture is more challenging: downtime on Disruptive’s servers sometimes affects the entire data flow from all their sensors.

Deployment is super easy, but an automated link to our ERP system would ease installation further, and significantly reduce the risk of human error when mapping sensors to assets.

The sensors are very small. This is cool, but if they were 10x larger with 10x stronger wireless signal, this would probably be better for many of our use cases.

Finally, communication can be challenging when dealing with large metal and concrete constructions. This goes for both the communication between sensor and cloud connector, and for the link between the cloud connector and the outside world. This is in most cases solvable, but may require some additional effort during installation.

Next Step: Monitor All Ageing GIS Facilities?

This pilot project has demonstrated that increased component monitoring can have a high value in the transmission grid.

Our main focus has been on temperature monitoring to assess component health, but the project has spurred quite a lot of enthusiasm in Statnett, and a number of other application areas have been suggested to us during the project.

One of the prime candidates for further rollout is to increase the monitoring of GIS substations, either by adding sensors to other parts of the facility, by selecting other substations for instrumentation, or both.

A further deployment of sensors must be aligned with other ongoing activities at Statnett, such as rollout of improved wireless communication at the substations. Nonetheless, we have learned that there are many valuable use cases for retrofitting sensors to the transmission grid, and we expect to take advantage of this kind of technology in the years to come.

How we created our own data science academy

If your goal is a more data driven organization, a group of six people in the Data Science department cannot do the task alone. In this post we describe how we developed a data science academy where a group of colleagues attended three months of training in data science.

How it all started 

A few years ago, we established a Data Science department in Statnett. Being the transmission system operator of the Norwegian power system, Statnett operates approximately 11,000 km of power lines and cables, with about 150 substations. Every single day we gather vast amounts of data from our assets in the power system, and our goal is to enable more data-driven decisions.  

We have made progress working together with core departments in Statnett to develop tools and machine learning models that solve advanced challenges. But we quickly realized that to become a more data-driven organization we needed to develop new skills throughout the company.

Luckily, we had a lot of colleagues who wanted to learn. Our Python classes were full, people attended our lectures and tested their skills on code kata’s.  

But that wasn’t enough. A 2-3-hour course might be inspiring, but to truly be able to use data for daily analytical tasks our colleagues asked for more training. An idea was born.  

Developing relevant data science skills

We could have chosen another strategy – to concentrate the data skills around the Data Science department. Issues around data quality and securing the right interpretation of large data sets would perhaps be more easily handled that way, but we truly believed in a decentralized system. By installing the right skills and tools in the core business we could get more value and scale faster.  

We set an ambitious goal. A group of employees in Statnett would attend a full time Data Science academy for three months. They would have to leave their regular tasks behind and be fully commited to the classes.  

The HR department supported the whole process, including the selection of employees. The application was open for everyone, but they needed support from their leaders. This was an important aspect, as we wanted the department of the attendee to commit to the program as well. There is no use in learning new skills if you are expected to go back to the old habits after the program!  

From Python to data wrangling

The content of the course was developed by us, and tightly linked to Statnett’s core business. That way we ensured a more interesting academy for our experienced colleagues. It quickly became clear to them how they could use data science tools in their daily work routine. 

The academy was a mix of coding sessions, lectures followed by workshops, self-studies and group work. We focused on a group of important skills: 

  1. Basic programming, especially focusing on Python  
  2. Learning about and getting access to data sources, both inside Statnett and external
  3. Data visualization and communicating results of data analysis
  4. Data prep/data wrangling – the process of transforming raw data and gaining control over missing data

The classes were taught by employees in the Data Science department, in addition to shorter workshops and lectures by colleagues from across the company.

What we learned

Our goal was to utilize Statnett’s own data in all case studies and tasks. But at the beginning of the course we realized that some of the attendees needed more basic coding and data visualization skills, and we adjusted the progression. We focused more on the basic skills and waited till the end with the more advanced use cases from the organization. 

This is one of the reasons why we believe it’s fundamental to build data skills through an internally developed program. We were able to evaluate consecutively and adjusted the program to ensure the right outcome. 

Participants from different parts of the company combine their specific domain knowledge with their new data science skills to solve use cases related to their day-to-day work.

We still haven’t decided what the next academy will be like, but most certainly we will start with the basic skills and build on that. A possible way forward is to admit more employees from the departments that have started using the data science tools to the next academy, to further build on what’s working.

Long term effects of learning new skills

Our attendees are now able to use data science tools to solve their day to day tasks more effectively, so they can focus their creativity on the more demanding tasks. Our goal is to ensure that employees across Statnett are inspired to learn more about the possibilities in coding and data visualization.

A relevant example is the vast amount of Excel workbooks in use. They may have started as an effective way to solve certain tasks, but many have grown way past their reasonable size. These tasks can be more effectively solved through Python tools where you update data across data sets all at once, always ensuring that every user knows which version is the latest. The hope is that these quite simple examples will inspire more people across the organization to learn. 

Our first group of attendees are devoted to sharing their knowledge. In addition to initiating projects related to their new skillset, our ambition is that they will also involve other colleagues interested in learning. The course administrator will still be available as a resource for our alumni, helping them putting their newly acquired skills into use in their daily work.

Better collaboration between IT and core business

A side effect of the academy is that data skills make you a better product owner. When employees in core business know more about the possibilities in programming, data visualization and machine learning, they can work better together with the IT department in further developing the products and services. In fact, this is an important aspect of digital transformation. Companies all over are looking to utilize their data assets in new ways. Some of this is best solved through large scale IT projects, but a lot of the business value is gained through simple data tools for continuous improvement. 

What happens to the Data Science department when our colleagues can solve their own problems using data and coding? Well, our ambitions are to further build our capacity as a leader in our field. We just started working on a research and development project with reinforcement learning, and we will keep working on better methods and smarter use of data.  

How to recruit data scientists and build a data science department from scratch

Our strategy was to build on talent from the business and hire fast learners when we started building a data science department at Statnett two years ago. Now, we have operated for a year with great achievements as well as setbacks and learning points. Read about our story in this post.

About two years ago I was tasked with building a data science team at Statnett, a government owned utility. Interesting data science tasks abounded, but most data science work was done by consultants or external parties as R&D projects. This approach has its merits, but there were two main reasons why I advocated building an in-house data science team:

  • Domain specific knowledge as well as knowing the ins and outs of the company is important to achieve speed and efficiency. An in-house data science team can do a proof of concept in the time it takes to hire an external team. And an external team often needs a lot of time to get to know the data sources and domain before they start delivering. It is also a burden for the few people who teach new consultants the same stuff time and again, they can become a bottleneck.
  • Data science is about making the company more data driven, it is an integral part of developing our core business. If we are to succeed, we must change how people work. I believe this is easier to achieve if you are intimately familiar with the business and are in it for the long haul.

But how to build a thriving data science department from scratch? Attracting top talent can be tough, especially so when you lack a strong professional environment where your new hires can learn and develop their skills.

Build on talent from the business

Drew Conway’s Venn Diagram The primary colors of data: hacking skills, math and stats knowledge, and substantive expertise

Prior to establishing the data science team, I lead a team of power system analysts. These analysts work in the department for power system planning, which has strong traditions for working data driven. Back in 2008 we adopted Python to automate our work with simple scripts; we didn’t use version control or write tests. Our use of Python grew steadily and eventually we hired a software development consultant to aid in the development and help us establish better software development practices.

This was long before I was aware that there was anything called data science and machine learning. However, the skillset we developed – a mix of coding skills, domain expertise and math – is just what Drew Conway described as the “primary colors” of data science in his now famous Venn diagram from 2010.

Even the Economist commented on the rise of Python in a 2018 article.

A background as an analyst with python skills proved to be a very good starting point for doing data science. The first two data scientists in Statnett came from the power system analysis department. I guess we were lucky to have started coding python back in 2008, as this coincided perfectly with the rise of python as the programming language for doing data science.

Still, most large companies have some sort of business analyst function, and this is a good place to find candidates for data science jobs. Even if they are unfamiliar with (advanced) coding, you get people that already know the business and should know basic math and statistics as well.

Hire for aptitude rather than skill

Having two competent analysts to kickstart the data science department also vastly improved our recruitment position. In my experience it is important for potential hires to know that they will have competent colleagues and a good work environment for learning. Therefore, Øystein, one of the “analyst-turned-data scientists”, attended all interviews and helped develop our recruitment strategy.

Basically, our strategy was to hire and develop fast learners, albeit with some background in at least two of the “bubbles” from the Venn diagram. Preferably the candidates should have complementary backgrounds. E.g. some strong coders, some more on the statistics side and some with a background in power systems.

We deliberately didn’t set any hard skill requirements for applicants. Requiring experience in machine learning or power systems would have narrowed the field down too much. Also, the technical side of data science is in rapid development. What is in vogue today, will be outdated in a year. So, it makes more sense to hire for aptitude rather than look for certifications and long experience with some particular technology. This emphasis on learning is also more attractive for the innovative talent that you need in a data science position.


Use interview-cases with scoring to avoid bias

During the first four months we conducted about 60 interviews with potential candidates. I had good experience using tests and cases in previous interview situations. First off, this forces you to be very specific about what you are looking for in a candidate and how to evaluate this. Secondly, it helps to avoid bias if you can develop an objective scoring system and stick to it. Thirdly, a good selection of tailored cases (not standard examples from the internet) serves to show off some interesting tasks that are present in the company. We used the ensuing discussions to talk about how we approached the cases in our company, thus advertising our own competence and work environment.

Such a process is not without flaws. Testing and cases can put some people off, and you are at risk of evaluating how candidates deal with a stressful test situation rather than how they will perform at work. Also, expect to spend a lot of time recruiting. I still think the benefits clearly outweigh the downsides. (Btw, maybe an aspiring data scientist should expect a data drive recruitment process).

We ended up with a set of tasks and cases covering:

  • Basic statistics/logic
  • Evaluate a data science project proposal
  • How to communicate a message with a figure
  • Statistical data visualization
  • Coding skills test (in language of choice)
  • Standardized aptitude tests

We did two rounds of interviews and spent about 4 hours in total per candidate for the full test set. In addition, some of the cases were prepared beforehand by the candidate. The cases separated well and in retrospect there was little need for adjustments. In any case, it would have been problematic to change anything once we started: How you explain the tasks, what hints you give etc. should be consistent from candidate to candidate.

Focus on a few tasks at a time

The first year of operation has necessarily been one of learning. Everyone has had to learn the ropes of our new infrastructure. We have recently started using a Hadoop stack for our streaming data and data at rest. And we have been working hard to get our CI/CD pipelines working and learning how to deploy containerized data science application on Open Shift.

Even though the team has spent a lot of time learning the technology and domain, this was as expected. I would say the major learning point from the first year is one of prioritization: We took on too many tasks at once and we underestimated the effort of getting applications into production.

As I stated initially, interesting data science tasks abound, and it is hard to say no to a difficult problem if you are a data scientist. When you couple this with a propensity to underestimate the effort needed to get applications into production (especially on new infrastructure for the first time), you have a recipe for unfinished work and inefficient task switching.

So, we have made adjustments along the way to focus on fewer tasks and taking the time needed to complete projects and get them in production. However, this is a constant battle; there are always new proposals/needs/problems from the business side that are hard to say no to. At times, having a team of competent data scientists feels like a double-edged sword because there are so many interesting problems left unsolved. Setting clear goals and priorities, as simple as that sounds, is probably where I can improve the most going forward.

Vernerunde Viklandet 2018

Get strong business involvement

I would also stress that it is paramount to have strong presence from the business side to get your prioritizations right and to solve the right problem. This holds for all digital transformation, when in my experience, the goal often has to be adjusted along the way. The most rewarding form of work is when our deliveries are continuously asked for and the results critiqued and questioned.

On the other hand, one of the goals with the data science department was to challenge the business side to adopt new data driven practices. This might entail doing proof of concepts that haven’t been asked for by the business side to prove that another way of operating is viable.

It can be difficult to strike the right balance between the two forms of work. In retrospect we spent too much time initiating our own pilot projects and proof of concepts in the beginning. While they provide ample learning opportunities – which was important during our first year – it is hard to get such work implemented and it is a real risk that you solve the wrong problem. A rough estimate therefore is that about 80 % of our work should be initiated by and led by the business units, with the representative from the data science team being an active participator, always looking for better ways to solve the issues at hand.

Scaling data science competency

From the outset, we knew that there would be more work than a central data science department could take on. Also, to become a more data-driven organization, we needed to develop new skills and a new mindset throughout the company, not only in one central department.

Initially we started holding introductory Python classes and data science discussions. We had a lot of colleagues who were eager to learn, but after a while it became clear that this approach had its limits. The lectures and classes were inspiring, but for the majority it wasn’t enough to change they way they work.

That’s why we developed a data science academy: Ten employees leave their work behind for three months and spend all their time learning about data science. The lectures and cases are tailored to be relevant for Statnett and held mainly by data scientists. The hope is that the attendees learn enough to spread data science competence in their own department (with help from the us). In such a way we hope to spread data science competency across the company.

What have we achieved?

At times I have felt that progress was slow. Challenges with data quality have been ubiquitous. Learning unfamiliar technology by trial and error has brought development to a standstill. Very often, projects that we think are close to production ready have weeks left of work. This has coined the term “ten minutes left” meaning an unspecified amount of work, often weeks or months left.

But looking back, the accomplishments become clearer. There is a saying that “people overestimate what can be done in the short term, and underestimate what can be done in the long term” which applies here as well.

Some of the work that we have done builds on ideas from our time working on reliability analysis in power systems and is covered here on this blog:

The largest part of our capacity during the last year has been spent on building forecasts for system operation:

We have also spent a considerable effort on communication, both to teach data science to new audiences at Statnett, exchange knowledge and to promote all the interesting work that happens at Statnett. This blog, scientific papers, our data science academy and talks at meetups and conferences are examples of this.

Developing and deploying machine learning models for online load forecasts

Load forecasts are important input to operating the power system, especially as renewable energy sources become a bigger part of our power mix and new electrical interconnectors link the Norwegian power system to neighbouring systems. That is why we have been developing machine learning models to make predictions for power consumption. This post describes the process for developing models as well as how we have deployed the models to provide online predictions.

Load forecasting, also referred to as consumption prognosis, combines techniques from machine learning with knowledge of the power system. It’s a task that fits like a glove for the Data Science team at Statnett and one we have been working on for the last months.

We started with a three week long Proof of Concept to see if we could develop good load forecasting models and evaluate the potential of improving the models. In this period, we only worked on offline predictions, and evaluated our results against past power consumption.

The results from the PoC were good, so we were able to continue the work. The next step was to find a way to go from offline predictions to making online predictions based on near real time data from our streaming platform. We also needed to do quite a bit of refactoring and testing of the code and to make it production ready.

After a few months of work, we have managed to build an API to provide online predictions from our scikit learn and Tensorflow models, and deploy our API as a container application on OpenShift.

Good forecasts are necessary to operate a green power system

Balancing power consumption and power production is one of the main responsibilities of the National Control Centre in Statnett. A good load forecast is essential to obtain this balance.

The balancing services in the Nordic countries consist of automatic Frequency Restoration Reserves (aFRR) and manual Frequency Restoration Reserves (mFRR). These are activated to contain and restore the frequency. Until the Frequency Restoration Reserves have been activated, the inertia (rotating mass) response and system damping reduce the frequency drop below unacceptable levels.

The growth in renewable energy sources alters the generation mix; in periods a large part of the generators will produce power without simultaneously providing inertia. Furthermore, import on HVDC-lines do not contribute with inertia. Therefore, situations with large imports to the Nordic system can result in insufficient amount of inertia in the system. A power system with low inertia will be more sensitive to disturbances, i.e. have smaller margins for keeping frequency stability.

That is why Statnett is developing a new balancing concept called Modernized ACE (Area Control Error) together with the Swedish TSO Svenska kraftnät. Online prediction of power consumption is an integral part of this concept and is the reason we are working on this problem.

Can we beat the existing forecast?

Statnett already has load forecasts. The forecasts used today are proprietary software, so they are truly black box algorithms. If a model performs poorly in a particular time period, or in a specific area, confidence in the model can be hard to restore. In addition, there is a need for predictions for longer horizons than the current software is providing.

We wanted to see if we could measure up to the current software, so we did a three week stunt, hacking away and trying to create the best possible models in the limited time frame. We focused on exploring different features and models as well as evaluating the results to compare with the current algorithm.

At the end of this stunt, we had developed two models: one simple model and one complex black box model. We had collected five years of historical data for training and testing, and used the final year for testing of the model performance.

Base features from open data sources

The current models provide predictions for the next 48 hours for each of the five bidding zones in Norway. Our strategy is to train one model for each of the zones. All the data sources we have used so far are openly available:

  • weather observations can be downloaded from the Norwegian Meteorological Institutes’s Frost API. We retrieve the parameters wind speed and air temperature from the observations endpoint, for specific weather stations that we have chosen for each bidding zone.
  • weather forecasts are available from the Norwegian Meteorological Institute’s thredds data server. There are several different forecasts, available as grid data, along with further explanation of the different forecasts and the post processing of parameters. We use the post processed unperturbed member (the control run) of the MetCoOp Ensemble Prediction System (MEPS), the meps_det_pp_2_5km-files. From the datasets we  extract wind speed and air temperature at 2 m height for the closest points to the weather stations we have chosen for each bidding zone.
  • historical hourly consumption per bidding area is available from Nord Pool

The five bidding zones in Norway

Among the features of the models were:

  • weather observations and forecasts of temperature and wind speed from different stations in the bidding zones
  • weather parameters derived from observations and forecasts, such as windchill
  • historical consumption and derived features such as linear trend, cosine yearly trend, square of consumption and so on
  • calendar features, such as whether or not the day is a holiday, a weekday, the hour of the day and so on

Prediction horizon

As mentioned, we need to provide load forecasts for the next 48 hours, to replace the current software. We have experimented with two different strategies:

  • letting one model predict the entire load time series
  • train one model for each hour in the prediction horizon, that is one model for the next hour, one model for two hours from now and so on.

Our experience so far is that one model per time step performs better, but of course it takes more time to train, and is more complex to handle. This explodes if the frequency of the time series is increased or the horizon is increased: perhaps this will be too complex for a week ahead load forecast, resulting in 168 models for each bidding zone.

Choosing a simple model for correlated features

We did some initial experiments using simple linear regression, but quickly experienced the problem of bloated weights: The model had one feature with a weight of about one billion, and another with a weight of about minus one billion. This resulted in predictions in the right order of magnitude, but with potential to explode if one feature varied more than usual.

Ridge regression is a form of regularization of a linear regression model, which is designed to mitigate this problem. For correlated features, the unbiased estimates for least squares linear regression have a high variance. This can lead to bloated weights making the prediction unstable.

In ridge regression, we penalize large weights, thereby introducing a bias in the estimates, but hopefully reducing the variance sufficiently to produce reliable forecasts. In linear least squares we find the weights minimizing the error. In ridge regression, we penalize the error function for having large weights. That is we add a term in proportion to the sum square of the weights x1 and x2 to the error. This extra penalty can be illustrated with circles in the plane, the farther you are from the origin, the bigger the penalty.

In ridge regression, the idea is to minimize the error, but penalize large weight values. Larger weights correspond to larger circles in the plane.

For our Ridge regression model, we used the Ridge class from the Python library scikit learn.

Choosing a neural network that doesn’t forget

Traditional neural networks are unable to remember information from earlier time steps, but recurrent neural networks (RNNs) are designed to be able to remember from earlier time steps. Unlike the simpler feed-forward networks where information can only propagate in one direction, RNNs introduce loops to pick up the information from earlier time steps. An analogy for this is being able to see a series of images as a video, as opposed to considering each image separately. The loops of the RNN is unrolled to form the figure below, showing the structure of the repeating modules:

An RNN contains a single network layer, represented by the blue box, where information from the previous module is incorporated.

In practice, RNNs are only able to learn dependencies from a short time period. Long short-term memory networks, LSTMs are a type of RNNs that use the same concept of loops as RNNs, but the value lies in how the repeating module is structured.

One important distinction from RNNs is that the LSTM has a cell state, that flows through the module only affected by linear operations. Therefore, the LSTM network doesn’t “forget” the long term dependencies.

The repeating layer in an LSTM network, where the state propagates through the module, shown by the grey line. The blue boxes are network layers and blue circles are pointwise operations.

For our LSTM model we used the Keras LSTM layer.

For a thorough explanation of LSTM, check out this popular blog post by Chris Olah: Understanding LSTMs.


To evaluate the models, we used mean absolute percentage error, MAPE, which is a common metric used in load forecasting. Our proof of concept models outperformed the current software when we used one year of test data, reducing the MAPE by 28% on average over all bidding zones in the forecasts.

The average MAPE over all bidding zones was reduced from 5.1 % to 4.0 % when forecasting one day ahead (labeled D-1) and from 6.2 % to 4.2 % when forecasting two days ahead (labeled D-2).

We thought this was quite good for such a short PoC, especially considering that the test data was an entire year. However, we see that for the bidding zone NO1 (Østlandet), where the power consumption is largest, we need to improve our models.

How to evaluate fairly when using weather forecasts in features

We trained the models using only weather observations, but when we evaluated performance, we used weather forecasts for time steps after the prediction time. In the online prediction situation, observations are available for the past, but predictions must be used for the future. To compare our models with the current model, we must simulate the same conditions.

We have also considered training our model using forecasts, but we only have access to about a year of forecasts, therefore we have chosen to train on observations. Another possible side effect of training on forecasts, is the possibility of learning the strengths and weaknesses of a specific forecasting model from The Norwegian Meteorological Institute. This is not desirable, as we would have to re-learn the patterns when a new model is deployed. Also, the model version (from the Meteorological Institute) is not available as of now, so we do not know when the model needs re-training.

Our most recent load forecast

By now we have made some adjustments to the model, and our forecasts for NO1 from the LSTM model and Ridge regression model are shown below. Results are stored in a PostgreSQL database and we have built a Grafana dashboard visualizing results.

The plot shows our newest load forecasts for the area NO1 (Østlandet) as of 20.12 17:00. For the hours in the past, the forecast from the hour before is shown. The shaded areas are the minimum and maximum values the previous forecasts have been, the bold green line is the actual consumption. The dashed line shows the forecast from 24 hours ago.

How to deploy real time ML models to OpenShift

The next step in our development process was to create a framework for delivering online predictions and deploy it. This also included refactoring the code from the PoC. We worked together with a development team to sketch out an architecture for fetching our data from different sources to delivering a prediction. The most important steps so far have been:

  • Developing and deploying the API early
  • Setting up a deployment pipeline that automatically deploys our application
  • Managing version control for the ML models

Develop and deploy the API early

“API first” development (or at least API early, as in our case) is often referenced as an addition to the twelve-factor app concepts, as it helps both client and server side define the direction of development. A process with contracts and requirements sounds like a waterfall process, but our experience from this project is that in reality, it supports an agile process as the requirements are adapted during development. Three of the benefits were:

  • API developers and API users can work in parallel. Starting with the API has let the client and server side continue work on both sides, as opposed to the client side waiting for the server side to implement the API. In reality, we altered the requirements as we developed the API and the client side implemented their use of it. This has been a great benefit, as we have uncovered problems, and we have been able to adjust the code early in the development phase.
  • Real time examples to end users. Starting early with the API allows us to demonstrate the models with real time examples to the operators who will be using the solution, before the end of model development. The operators will be able to start building experience and understanding of the forecasts, and give feedback early.
  • Brutally honest status reports. Since the results of the current models are available at any time, this provides a brutally honest status report of the quality of the models for anyone who wants to check. With the models running live on the test platform, we have some time to discover peculiarities of different models that we didn’t explore in the machine learning model development phase, or that occur in online prediction and not necessarily in the offline model development.

Separate concerns for the ML application and integration components

Our machine learning application provides an API that only performs predictions when a client sends a request with input data. Another option could be to have the application itself fetch the data and stream the results back to our streaming platform, Kafka.

We chose not to include fetching and publishing data inside our application in order to separate concerns of the different components in the architecture, and hopefully this will be robust to changes in the data sources.

The load forecasts will be extended to include Sweden as well. The specifics of how and where the models for Sweden will run are not decided, but our current architecture does not require the Swedish data input. Output is written back to Kafka topics on the same streaming platform.

The many components in our load forecast system architecture. The machine learning application only delivers predictions when a request is made.

The tasks we rely on are:

  • Fetching data from the different data sources we use. These data are published to our Kafka streaming platform.
  • Combine relevant data and make a request for a forecast when new data is available. A java application is scheduled each hour, when new data is available.
  • Make the predictions when requested. This is where the ML magic happens, and this is the application we are developing.
  • Publish the prediction response for real time applications. The java application that requests forecasts also publishes the results to Kafka. We imagine this is where the end-user applications will subscribe to data.
  • Store the prediction response for post-analysis and monitoring.

How to create an API for you ML models

Our machine learning application provides an API where users can supply the base features each model uses and receive the load forecast for the input data they provided. We created the API using the Python library Flask, a minimal web application framework.

Each algorithm runs as a separate instance of the application, on the same code base. Each of these algorithm instances has one endpoint for prediction. The endpoint takes the bidding zone as an input parameter and routes the request to the correct model.

Two tips for developing APIs

  • Modularize the app and use blueprint with Flask to prevent chaos. Flask makes it easy to quickly create a small application, but the downside is that it is also easy to create a messy application. After first creating an application that worked, we refactored the code to modularize. Blueprints are templates that allow us to create re-usable components.
  • Test your API using mocking techniques. As we refactored the code and tried writing good unit tests for the API, we used the mock library to mock out objects and functions so that we can separate which functionality the different tests test and run our tests faster.

Where should derived features be calculated?

We have chosen to calculated derived features (i.e. rolling averages of temperature, lagged temperature, calendar features and so on) within the model, based on the base features as input data. This could have been outside of the models, for example by the java application that integrates the ML and streaming, or by the Kafka producers. There are two main reasons why we have chosen to keep these calculations inside our ML API:

  • The feature data is mostly model specific, and lagged data causes data duplication. This is not very useful for anyone outside of the specific model, and therefore it doesn’t make sense to create separate Kafka topics for these data or calculate them outside of the application.
  • Since we developed the API at an early stage, we had (and still have) some feature engineering to do, and we did not want to be dependent on someone else to provide all the different features that we wanted to explore before we could deploy a new model.

Class hierarchy connects general operators to specific feature definitions

The calculations of derived features are modularized in a feature operator module to ensure that the same operations are done in the same way everywhere. The link that applies these feature operators to the base features is implemented using class methods in a class hierarchy.


We have created a class hierarchy as illustrated in the figure above. The top class, ForecastModels, contains class methods for applying the feature operators to the input data in order to obtain the feature data frame. Which base features and which feature operators we want to be applied to the features are specified in the child class.

The child classes at the very bottom, LstmModel and RidgeRegression, inherit from their respective Keras and scikit learn classes as well, providing the functionality for the prediction and training.

As Tensorflow models have a few specific requirements, such as normalising the features, this is implemented as a class method for the class TensorFlowModels. In addition, the LSTM models require persisting the state for the next prediction. This is handled by class methods in the LstmModel class.

Establish automatic deployment of the application

Our deployment pipeline consists of two steps, which run each time we push to our version control system:

  • run tests
  • deploy the application to the test environment if all tests pass

We use GitLab for version control, and have set up our pipeline using the integrated continuous integration/continuous deployment pipelines. GitLab also integrates nicely with the container platform we use, OpenShift, allowing for automatic deployment of our models through a POST request to the OpenShift API, which is part of our pipeline.


Automatic deployment has been crucial in the collaboration with the developers who make the integrations between the data sources and our machine learning framework, as it lets us test different solutions easily and get results of improvements immediately. Of course, we also get immediate feedback when we make mistakes, and then we can hopefully fix it while we remember what the breaking change was.

How we deploy our application on OpenShift

Luckily, we had colleagues who were a few steps ahead of us and had just deployed a Flask application to OpenShift. We used the source to image-build strategy, which builds a Docker image from source code, given a git repository. The Python source to image image builder is even compatible with the environment management package we have used, Pipenv, and automatically starts the container running the script This allowed us to deploy our application without any adaptions to our code base other that setting an environment variable to enable Pipenv. The source-to-image strategy was perfect for the simple use case, but we are now going towards a Docker build strategy because we need more options in the build process.

For the LSTM model, we also needed to find a way to store the state, independently of deployments of the application. As a first step, we have chosen to store the states as flat files in a persistent storage, which is stored independently of new deployments of our application.

How do we get from test to production?

For deployment from the production to the test environment, we will set up a manual step after verifying that everything is working as expected in the test environment. The plan as of now is to publish the image running in the test environment in OpenShift to Artifactory. Artifactory is a repository manager we use for hosting internally developed Python packages and Docker images that our OpenShift deployments can subscribe to. The applications running in the OpenShift production environment can then be deployed from the updated image.

Flask famously warns us not to use it in production because it doesn’t scale. In our setting, one Flask application will receive one request per hour per bidding zone, so the problems should not be too big. However, if need be, we plan to use the Twisted web server for our Flask application. We have used Twisted previously when deploying a Dash web application, which is also based on Flask, and thought it worked well without too much effort.

Future improvements

Choosing the appropriate evaluation metric for models

In order to create a forecast that helps the balancing operators as much as possible, we need to optimize our models using a metric that indicates their need. For example, what is most important: Predicting the level of the top load in the day, or predicting the time the top load occurs on? Is it more important to predict the outliers correctly or should we focus on the overall level? This must be decided in close cooperation with the operators, illustrating how different models perform in the general and in the special case.

Understanding model behavior

Understanding model behavior is important both in developing models and to build trust in a model. We are looking into the SHAP framework, which combines Shapley values and LIME to explain model predictions. The framework provides plots that we think will be very useful for both balancing operators and data scientists.

Including uncertainty in predictions

There is of course uncertainty in the weather forecasts and the Norwegian Meteorological Institute provides ensemble forecasts. These 10 forecasts are obtained by using the same model, but perturbing the initial state, thereby illustrating uncertainty in the forecast. For certain applications of the load forecast, it could be helpful to get an indication of the uncertainty in the load forecast, for example by using these ensemble forecast to create different predictions, and illustrating the spread.

How we quantify power system reliability

An important aspect of digital transformation for a utility company is to make data-driven decisions about risk. In electric power systems, the hardest part in almost any decision is to quantify the reliability of supply. That is why we have been working on methodology, data quality and simulation tools with this in mind for several years at Statnett. This post describes our simulation tool for long-term probabilistic reliability analysis called MONSTER.

The need for reliable power system services is ever increasing in our system as  technology advances. At the same time, the focus on carefully analysed and socio economic profitable investments has perhaps never been more essential. In light of this, several initiatives focus on having probabilistic criteria as the frame for doing reliability analysis. This post describes our Monte Carlo simulation tool for probabilistic power system reliability analysis that we have dubbed MONSTER.

Model overview

holisticOur simulation methodology is described in this IEEE PMAPS paper from 2018: A holistic simulation tool for long-term probabilistic power system reliability analysis (you need IEEE Xplore membership to access the article)

Here I’ll repeat the basics, ingnore most of the maths and expand on some details that have been left out in the paper.

Our tool consist of two main parts: A Monte Carlo simulation tool and a Contingency analysis module. The main idea in the simulation part is to use hourly times series of failure probability for each component.

We do Monte Carlo simulations of the power system where we draw component failure times and durations, and then let the Contingency analysis module calculate the consequence of each of these failures. The Contingency evaluation module is essentially a power system simulation using optimal loadflow to find the appropriate remedial actions, and the resulting “lost load” (if any).

The end result of the whole simulation is typically expressed as expected loss of load, expected socio-economic costs or distributions of the same parameters. Such results can, for instance, be used to make investment decisions for new transmission capacity.

Time series manager

The main principle of the Time series manager is to first use failure observations to calculate individual failure rates for each component by using a Bayesian updating scheme, and then to distribute the failure rates in time by using historical weather data.  We have described the procedure in detail in a previous blog post.

We first calculate individual, annual failure rates using observed failure data in the power system. Then for each component we calculate an hourly time series for the probability of failure. The core step in this approach is to calculate the failure probabilities based on reanalysis weather data such that the resulting expected number of failures is consistent with the long term annual failure rate calculated in the Bayesian step. This is about spreading the failure rate in time such that we get one probability of failure at each time step. In this way, components that experience the same adverse weather will have elevated probabilities of failure at the same time. This phenomenon is called failure bunching and it’s effect is illustrated in the figure below. In fact we will get a consistent and realistic geographical and temporal development of failure probabilities throughout the simulation period.

Failure bunching dramatically increases the probability of multiple independent failures happening at the same time.

In our model we divide each fault into eight different categories: Either the fault is  temporary or Permanent, and each of the failures can be due to either wind, lightning, snow/icing or unrelated to weather. Thus, for each component we get eight historical time series to be used in the simulations.

outages temporary failures

Biases in the fault statistics

There are two data sources for this part of the simulation. Observed failure data from the national fault statistics database FASIT, and approximately 30 years of reanalysis weather data calculated by Kjeller Vindteknikk.  The reanalysis dataset is nice to work with from a data science perspective. As it is simulated data, there are no missing values, no outliers, uniform data quality etc.

The fault statistics, however, has lots of free text columns, missing values and inconsistent reporting (done by different people across the nation over two decades). So a lot of data prep is necessary to go from reported disturbances to statistics relevant for simulation. Also, there are some biases in the data to be aware of:

  • There is a bias towards too long repair/outage-times. This happens because most failures don’t lead to any critical consequences, therefore we spend more time than strictly necessary repairing the component or otherwise restoring normal operation. For our simulation we want the repair time we would have needed in case of a critical failure. We have remedied this to some extent by eliminating some of the more extreme outliers in our data.
  • There is a bias towards reporting too low failure rates; only failures that lead to an immediate disturbance in the grid count as failures in FASIT. Failures that lead to controlled disconnections are not counted as failures in this database, but are  relevant for our simulations if the disconnection is forced (as opposed to postponable).

The two biases counteract each other, in addition this kind of epistemic uncertainty is possible to mitigate by doing sensitivity analysis.

Improving the input-data: Including asset health

Currently, we don’t use component specific failure rates. For example, all circuit breakers have the same failure rate regardless of age, brand, operating history etc. For long term simulations (like new power line investment decisions) this is a fair assumption as components are regularly maintained and swapped out as time passes by.

The accuracy of the simulations, especially on short to medium term, can be greatly improved by including some sort of component specific failure rate. Indeed, this is an important use-case in and by itself for risk based asset maintenance.

Monte Carlo simulation

A Monte Carlo simulation consist of a large number of random possible/alternative realizations of the world, typically on the order of 10000 or more. In our tool, we do Monte Carlo simulations for the time period where we have weather data, denoted by k years, where k~=30. For each simulation, we first draw the number N of failures in this period for each component and each failure type. We do this by using the negative binomial distribution. Then we distribute these failures in time by drawing with replacement from the k∗8760 hours in the simulation period such that the probability of drawing any particular hour is in proportion to the time series of probabilities calculated earlier. As the time resolution for the probabilities is hourly, we also draw a uniform random value of the failure start time within an hour. Then finally, for each of these failures we draw random duration of outages:

for each simulation do
for each component and failure type in study do
draw the number N of failures
draw N failures with replacement
for each failure do
draw random uniform start time within hour
draw duration of failure

Convergence of the most and least frequent single- and double failure rates for the first 1000 simulations.

Outage manager

In addition to drawing random errors, we also have to incorporate maintenance planning. This could for example be done either as an explicit manual plan or through an optimization routine that distributes the services according to needs and probabilities of failure.

How detailed the maintenance plan is modeled, typically depends on the specific area of study. In some areas, the concern is primarily of a failure occurring during a few peak load hours when margins are very tight. At such times it is safe to assume that there is no scheduled maintenance – and we disregard it entirely in the model. In other areas, the critical period might be when certain lines are out for maintenance, and so it becomes important to model maintenance realistically.

For each simulation, we step through points in time where something happens and collect which components are disconnected. In this part we also add points in time where the outage pass some predefined duration, called outage phase in our terminology. We do this to be able to apply different remedial measures depending on how much time has passed since the failure occurred. The figure below illustrates the basic principles.

Collecting components and phases to contingency list. Dashed line indicate phase 1 and solid line phase 2. In this example the two outages contribute to five different contingencies. In phase 1, only automatic remedial measures (like system protection) can be applied. In phase 2, manual measures, like activation of tertiary reserves can be applied.

One important assumption is that each contingency (like the five contingencies in the figure above) can be evaluated independently. This makes the contingency evaluation, which is the computationally heavy part, embarrassingly parallel. Finally, we eliminate all redundant contingencies. We end up with a pool of unique contingencies that we send to the Contingency Analysis module for load flow calculations in order to find interrupted power.

Contingency analysis

Our task in the Contingency analysis module is to mimic the system operator’s response to failures. This is done in four steps:

  1. Fast AC-screening to filter out contingencies that don’t result in any thermal overloads. If overloads, proceed to next step.
  2. Solve the initial AC load-flow, if convergence move to next step. In case of divergence (due to voltage collapse) attempt to emulate the effect of voltage collapse with a heuristic, and report the consequence.
  3. Apply remedial measures with DC-optimal load-flow, report all measures applied and go to next step.
  4. Shed load iteratively using AC load-flow until no thermal overloads remain. Report lost load.

In step 3 above, there are many different remedial measures that can be applied:

  • Reconfiguration of the topology of the system
  • Reschedule production from one generator to another
  • Reduce or increase production in isolation
  • Change the load at given buses

For the time being, we do not consider voltage or stability problems in our model apart from the rudimentary emulation of full voltage collapse. However, it is possible to add system protection to shed load for certain contingencies (which is how voltage stability problems are handled in real life).

As a result from the contingency analysis, we get the cost of applying each measure and the interrupted power associated with each contingency.

Load flow models and OPF algorithm

Currently we use Siemens PTI PSS/E to hold our model files and to solve the AC load flow in steps 1,2 and 4 above. For the DC optimal load-flow we express the problem with power transfer distribution factors (PTDF) and solve it as a mixed integer linear program using a commercial solver.

Cost evaluation – post-processing

Finally, having established the consequence of each failure, we calculate the socio-economic costs associated with each interruption. This is done by following the Norwegian regulation. The general principle is described in this 2008 paper, but the interruption costs have been updated since then. For each failure, we know where in the grid there are disconnections, for how long they last, and assuming we know the type of users connected to the buses in our load-flow model, the reliability indices for different buses will be easy to calculate. Due to the Monte Carlo approach, we will get a distribution of each of the reliability indices.

Result example: Failure rate and lost load averaged over alle simulation for some contingencies (black dots).

A long and winding development history

MONSTER has been in development for several years at Statnett after we did an initial study and found that no commercial tools where available that fit the bil. At first we developed two separate tools: one Monte Carlo simulation tool in Matlab, used to find the probability of multiple contingencies, and an Python AC contingency evaluation module built on top of PSS/E, used to do contingency analysis with remedial measures. Both tools were built to solve specific tasks in the power system analysis department. However, combining the tools was a cumbersome process that involved large spreadsheets.

At some point, we decided to do a complete rewrite of the code. The rewriting process has thought us much of what we know today about modern coding practices (like 12-factor apps), and it has helped build the data science team at Statnett. Now, the backend is written in python, the front end is a web-app built around node.js and vue, and we use a SQL database to persist settings and results.

A screenshot from the application where the user defines substation configuration.

Prediction of rare events and the challenges it brings: Wind failures on power lines

Summer internship blog 2018: Comparing different weather data, evaluating their corresponding probabilistic predictions and trying to develop a better forecast model.

This summer we have had two summer interns working as part of the data science team at Statnett: Norunn Ahdell Wankel from Industrial Mathematics at NTNU and Vegard Solberg from Environmental Physics and Renewable Energy at NMBU. We asked them to look at our failure prediction models and the underlying data sources with a critical eye. They have written this blog post to document their work. At the end of the article we offer some reflections. Enjoy!

Statnett already has its own models used to forecast probability of failure due to wind or lightning on the transmission lines. If you want to know more about the lightning model and how it is used to forecast probability of failure we recommend you to have a look at some previous blog posts from April on the details of the model and how it is used. Both of the models are, however, based on another climate data source (reanalysis data computed by Kjeller vindteknikk) than what is used as input in realtime forecasting (weather forecast from Norwegian Meteorological Institute – Met). With this in mind, together with the fact that Statnett’s models have not yet been properly evaluated or challenged by other models, we aim to answer the following questions:

  1. How similar is the weather data from Kjeller and Met?
  2. How good is the model currently in use, developed by the Data Science team?
  3. Can we beat today’s model?

Short overlap period between the weather data from Met and Kjeller

We must have an overlap period of data from Met and Kjeller to be able to compare the differences in weather. Unfortunately, we only had four full months of overlap.  This period corresponds to February 1 – May 30, 2017, for which we have hourly data for each tower along 238 transmission lines. There were only failures due to wind and none due to lightning in the overlap period. Therefore we have chosen to only focus on the wind model and hence the wind speed for the rest of this blog post.

Kjeller reports stronger wind

Kjeller gives a wind speed that on average is 0.76 m/s higher than Met in the overlap period. However, this is not the case for all lines. We get a feeling of the variation between lines by averaging the difference (Met minus Kjeller) per tower within a line and sort the difference in descending order. This is shown in the second column of the table below, from which it is clear that there is variety, and not a “trend”. The mean differences ranges from +1.55 to -2.74 m/s. Also noting that only 58 out of the 238 lines in total have a positive mean difference, the trend, if speaking of one, is that Kjeller records a higher wind speed on average for most lines.

Table that in several ways summarizes the difference between wind speeds reported by Kjeller and Met.

The third and fourth columns display the number of hours in which at least one tower along a line had a wind speed above 15 m/s (i.e that the max speed over the whole line exceeded 15 m/s in that hour). Why exactly 15 m/s? Because Statnett’s model for wind failures uses this threshold, meaning that a wind speed lower than this does not affect the wind exposure and therefore not the probability of failure. We see quite a huge difference between Met and Kjeller. For 179 of the 238 lines Kjeller reported a larger number of hours than Met. To visualize the difference better we can plot the difference in number of hours the max wind speed is above 15 m/s between Met and Kjeller as a histogram, see figure below.

On a majority of the transmission lines, there are more hours with wind speed above 15 m/s in the Kjeller dataset.

Clearly the histogram also shows how Kjeller reports more hours than Met for most lines. However only very few of them have an absolute difference of more than 200 hours.

Larger variance in Kjeller data

We found that Kjeller has a greater average variance within a line than Met. By finding the standard deviation at a line at each specific time point, and thereafter averaging over all times, we get the values reported in the columns “Mean std per hour”. In 199 cases Kjeller has a larger standard deviation than MET, indicating that Kjeller reports more different values along a line than Met.

Kjeller also has a higher variance throughout the whole overlap period. We calculated the standard deviation over all hours for each tower, and then found the mean standard deviation per tower along a line. The numbers are found in the columns “mean std per tower”. They tell us something about how the wind speeds for a tower tend to vary during the whole time period. It turns out that Kjeller has a higher standard deviation in 182 of the cases.

Is there any geographical trend?

Having summarized  wind speeds for each line we were curious to see if there could be a geographical trend. We looked at the mean differences when grouping the transmission lines based on which region they belong to, hence either N, S or M (north, south, mid). This is illustrated in the plot below.

Plot of the mean difference in wind speed per tower for all lines, grouped by region. Difference is taken as Met-Kjeller. Each transmission line belongs to one region, either N (north), S (south) or M (mid). On average the wind speed is higher in the Kjeller dataset.

As seen from the plot “N” is the region that stands a bit out from the other, where Kjeller is mainly lower than Met. The spread in values is however large for all regions. Maybe another grouping of the data would have revealed stronger trends. One could for instance have investigated if there is any trend tower-wise based on latitude and longitude coordinates for each tower. What would also have been interesting to do is to somehow plot the results on a map. By adding the grid points for both Met and Kjeller one could more clearly have seen if the location of these grid points could explain some of the differences in weather.

Keep in mind that…

The wind speed computed by Kjeller is from a height of 18 metres above the ground, while it is 10 metres for Met. We are not sure how much difference there is due to this fact only, but one probably could have found ways to adjust for this height difference. The resolution for Met is 2.5 km while 1 km for Kjeller, meaning that the grid points for which we have weather are located closer to each other in the Kjeller data. Hence it might not be so unexpected that Kjeller reports more different wind speeds along a line, and therefore has a higher standard deviation within a line. As already seen the variation within a line is for most lines greater for Kjeller than Met. Note however that even though the resolution of the grids differ, one might think that the trend over time for each line still could have been the same for Met and Kjeller. If for instance Met had the same trend in wind speed over time for each tower along a line as Kjeller, only that the values were shifted by a constant term, the standard deviation per tower should have been the same. However, this is at least not the case overall as the mean standard deviation per tower differs quite a bit for several lines, as seen in the previous table.

Using Met and Kjeller data as input to the model

We have seen that there are differences between the weather data from Met and Kjeller. The question is whether this difference in fact has any impact on the forecasting skill of the model. To check this we used Statnett’s model for wind failures to predict the probability of failure throughout the overlap period using both Kjeller and Met data. In addition, we ran the model after adding the mean line difference, the difference then in terms of Kjeller minus Met, to each individual line in the Met data.

To check whether the model predicts well we also defined a baseline prediction that simply predicts the longterm failure rate for all samples.

Aggregated probabilities to make data less unbalanced

In the overlap period there were only 6 failures due to wind in total, among 4 different lines. This makes our dataset extremely unbalanced. A way to make it less unbalanced is to aggregate the hourly probabilities to a daily probability, p_{d}. Assuming independent events, this can be done using the following formula

p_{d} = 1 - \prod_{i=1}^{24} (1 - p_{i})

where p_{i} is the probability of failure in hour i for a given line.

How do we evaluate the predictions?

To evaluate the forecasts produced by the model on Kjeller and Met data, we must use some type of score. There are many options, but we have focused on two widely used scores: The Brier score and the log loss score. The Brier score is an evaluation metric used for evaluating a set of predictions, i.e probabilities of belonging to a certain class. In our case we have a binary classification problem (failure, non-failure) and corresponding predictions for the probability of failure. The Brier score for N predictions in total is formulated as

BS = \frac{1}{N}\sum_{i=1}^{N}(\hat{y}_{i} - y_{i})^{2}, \;\; \hat{y}_{i}\in [0,1], \;y_{i} \in \{0,1\},

where \hat{y}_{i} is the predicted probability of failure and {y}_{i} is 1 if it is a failure and 0 otherwise. The Brier score can be interpreted as the mean squared error of the forecasts. Naturally, we wish to minimize this error, so a small Brier score is desired. Note that the score ranges from 0 to 1, where 0 is the best.

The contribution of a sample’s predicted probability on the evaluation score for the logistic loss and Brier score.

Log loss is another evaluation metric we have considered for our problem and it is defined as follows:

LL = -\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}log(\hat{y}_{i}) + (1-y_{i}) log(1-\hat{y}_{i})\right) \;\; , \hat{y}_{i}\in [0,1], \;y_{i} \in \{0,1\}.

Log loss, also known as logistic loss or cross-entropy loss, is also a measure of the error in the prediction, so small numbers are good here as well. Note that the score ranges from 0 to in theory +\infty (in Python’s sklearn library this upper limit is however approximately 34.5 based on the logarithm of machine precision), where 0 is the best. The figure above shows how each prediction contributes to the final Brier score and logistic loss for the case of failure/non-failure for all values of the prediction in the range [0,1]. Both metrics have in common that their contribution tends to zero the closer the predicted probability is to the true class. However, they differ in how much they penalize bad predictions. We use scikit-learn’s logistic loss metric and Brier score.

Also note that with only 6 failures, it is difficult to be very conclusive when evaluating models. Ideally the models should be evaluated over a much longer period.

Kjeller data performs better, but baseline is the best

Below is a figure of how the model performs when making predictions, e.g. using weather forecasts (denoted Met), compared to the baseline forecast. Both hourly and daily values are shown. To be clear, a relative score under 1 beats the same model using hindcast data (e.g. Kjeller data).

First we note that the baseline prediction model beats all others, except for the daily Kjeller prediction using logistic loss. The difference in Brier score for forecasts with Met seems negligible, only 0.96 % and 2.9 % better for hourly and daily scores, respectively. For log loss the difference is notable, with Met 0.19 % and 41 % worse for hourly and daily scores, respectively.

The naive baseline model preforms best on most metrics. A prediction beats the model using hindcast data from Kjeller if the score is below 1.

The Brier score for the forecasts where the mean line difference is added to Met is very similar to Kjeller as well. For log loss the score is worse than Kjeller for all cases, however slightly better than Met for daily forecasts when adding the mean difference for each line.

Logistic loss might be more appropriate for us

A relevant question is whether the Brier score and logistic loss are appropriate evaluation metrics for our problem. It is not obvious that small desired changes in the predicted probabilities are reflected in the scores. As mentioned earlier, we have a very large class imbalance in our data, which means that the predicted probabilites are generally very low. The maximum hourly forecasted probabiliy is 4%.  We wish that our evaluation metrics are able to notice increased predicted probabilities on the few actual failures. For instance, how much better would a set of predictions be if the probabilities on the actual failures were doubled. Would the metrics notice this?

To check this we manipulate the probabilities on the actual failures for the set of predicitons based on the Met data, while keeping the rest of the predicted probabilites as they initially were. First, we set all the probabilities on the actual failures equal to 0 and scored this with the Brier score and logistic loss. We do the same several times, writing over the probabilities on the actual failures from 0% to 6% and calculate the scores for each overwrite. This is the range of probabilities in which our model predicts, and we therefore expect that the metrics should decrease as we move from 0% to 6%.

This figure shows how slowly the brier score decreases (gets better) as the probability on actual failures are overwritten.

The results are shown in the figure above, where the scores are scaled by the score of the actual predictions (the inital ones before overwriting any predicitons). We see that the score for the logistic loss decreases rapidly, compared to the Brier score. For instance, when the probability is manipulated to 2 % for all failures, the resulting decrease is 4 % for the Brier score and over 70% for the logistic loss.

Logistic loss is in some sense more sensitive to change in predicted probabilities, at least for low probabilities. This is a desirable quality for us, because our model does in general give very low probabilities. We want a metric to reflect when the predictions are tuned in the right direction.

Brier score can be fooled by a dummy classifier

We decided to check what would happen if we shuffled all of the predicted probabilities in a random order, and then scored it with the Brier score and logistic loss. Intuitively, one would expect the scores to get worse, but that is not the case for the Brier score. This is weird because the predicted probabilities no longer have any relation with the weather for that line. The only common factor is the distribution of the actual predictions and the shuffled predictions. However we double our score when we use the logistic loss, meaning that it gets much worse.

This shows that a dummy classifier that generates probabilities from a certain distribution would be equally good as a trained classifier in terms of the Brier score – at least on such an unbalanced data set. Logistic loss guards us against this. For this one it is not sufficient to produce probabilities from a given probability distribution and assign them randomly.

Evaluation score of shuffled hourly and daily probabilities scaled by the unshuffled score.

The share of the failures in our data set is of order 10^{-5} , or 10 ppm. As a result, almost all of the shuffled probabilities line up with a non-failure, as before, and therefore contribute the same to the score as before the shuffle, for both evaluation metrics. The main difference is the shuffled probabilities that line up with the actual failures. Let us say that the model gives a probability of 0.1 % for an actual failure. That sample’s contribution to the Brier score is (1-0.001)^{2} = 0.9801. Almost 97% of the predicted probabilities are 0% for the hourly predictions, so the failure will most likely be lined up with a 0% prediction after the shuffle. That sample’s contribution to the brier score will now be (1-0)^{2} = 1, almost exactly the same. On the other hand, the sample’s contribution to the logistic loss will be -log(0.001) = 6.9 before the shuffle and -log(0)=34.5 after the shuffle. This is because logistic loss has a much steeper curve for low probabilities.

Using scikit-learn to create own model

But enough about Statnett’s model, it is time to create our own model. For this purpose we decided to use scikit-learn’s library. Scikit-learn is very easy to use once you got your data in the correct format, and it has handy functions for splitting the data set into a training and a test set, tuning of hyperparamaters and cross validation. As always, a lot of the job is preprocessing the data. Unfortunately, we only have access to 16 months old data from Met. To be able to compare our model with Statnett’s we would also like to test it on the same overlap period as theirs, namely February to May 2017. Therefore we also needed to train the model on the weather data from Kjeller, which of course might not be ideal when keeping the weather differences between Met and Kjeller in mind. Kjeller data until the end of January 2017 was used for training.

For the training set we have weather data for all towers on a line from 1998 until February 2017, and could have used the avaliable data for a specific tower to be a sample/row in our data matrix. This would however result in a gigantic data matrix with over 5 billion rows. Most of scikit-learn’s machine learning algorithms require that all the training data must fit in memory. This would not be feasible with a matrix of that size. We decided to pull out the minimum, maximum and mean of some of the weather parameters for each line. In addition, we have extracted the corresponding month, region and number of towers. This gave us a matrix of around 37 million rows, a size that fits into our memory.

Logistic regression beats Statnett’s current model

We managed to beat Statnett’s current model. We scored our logistic regression model and Statnett’s model on the Met data from the overlap period. Our score divided by the current model’s score is shown below. Our model is slightly better using the Brier score as the metric, and much better in terms of logistic loss. A reason for why we get a lower logistic loss score is that the current model predicts zero probability of failure on two actual failure times. As earlier mentioned, this is heavily penalized in the logistic loss setting.

Although we beat the current model using these two metrics, it is not sure that ours is superior. If we assume that failures are independent events and use the predicted probabilities in a Monte Carlo simulation, we find that both models overestimate the number of failures. There were 6 wind failures in the test set, but the probabilities generated by the current model suggests 7.4 failures, while ours indicates 8.1. In other words, our model overestimates the number of failures despite the fact that it is better in terms of logistic loss and Brier score.

Performance of logistic regression classifier relative to current model’s score. Less than 1 is better.

Logistic regression finds seaonal trend in wind failures

We can look at the coefficients from the logistic regression (the features are standardized) to see how the different features impact the probability of failure. A positive coefficient means that a large value of that feature increases the probability of failure. The coefficients of the logistic regression are visualized below. We can see that high values of wind speed and temperature increase the classifier’s predicted probability of failure. We also see that the winter months are given positive coefficients, while the summer months are given negative coefficients. Wind failures are more frequent during the winter, so the model manages to identify this pattern.

Importance of the variables in our logistic classifier. Positive importance means that a large variable value increases the predicted probability of failure.

All in all

Did we manage to answer the questions proposed at the very beginning of this blog post? Yes, at least to some extent.

1. How similar is the weather data from Kjeller and Met?

For the weather data comparison we see that there is a difference in the wind speeds reported by Kjeller and Met, where Kjeller tends to report stronger wind than Met. However this is not the case for all lines and there is no “quick fix”. It is of course not ideal to have a short overlap period to use for comparison since we are not even covering all seasons of the year.

2. How good is the model currently in use, developed by the Data Science team?

The current forecast model does not beat a baseline that predicts the long term failure rate – at least for the admittedly very short evaluation period we had data for. In addition, the model performs better on Kjeller data than Met data. This is no surprise considering that the model is trained on Kjeller data.

3. Can we beat today’s model?

Our logistic regression model performs marginally better using the Brier score and a lot better using the logisitic loss. Our model does however overestimate the number of failures in the overlap period.

Reflections from the data science team

The work highlights several challenges with the current data and methodology that we will follow up in further work. The discussion about log-loss vs. Brier scores also highlights a more general difficulty in evaluating forecasting models of very rare events. Some thoughts we have had:

  • There might be erroneously labeled training and test data since there is really no way of knowing the real cause of an intermittent failure. The data is labeled by an analyst who relies on the weather forecast and personal experience. We have seen some instances where a failure has been labeled as a wind failure even though the wind speed is low, while at the same time the lightning index is very high. Such errors make it complicated to use (advanced) machine learning models, while the simpler model we currently employ can help in identifying likely misclassifications in the data.
  • How should we handle forecasts that are close misses in the time domain? E.g. the weather arrives some hours earlier or later than forecasted. Standard loss functions like log-loss or Brier score will penalize such a forecast heavily compared to a “flat” forecast, while a human interpreter might be quite satisfied with the forecast.
  • We should repeat the exercise here when we get longer overlapping data series and more failures to get better comparisons of the models. Six failures is really on the low side to draw clear conclusions.
  • A logistic regression seems like a likely candidate if we are to implement machine learning, but we would also like to explore poisson-regression. When the weather is really severe we often se more failures happening on the same line in a short period of time. This information is discarded in logistic regression while poisson regression is perfectly suited for this kind of data.


Simulating power markets with competitive self-play

In this post we present an unpublished article from 2009 that uses self-play to evolve bidding strategies in congested power markets with nodal pricing.

Reinforcement learning using self-play is currently one of the most fascinating areas of research within artificial intelligence. Papers like Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm makes for inspiring reading, and I’m sure we’ll see a host of real world applications in the future. These current trends in reinforcement learning came to mind as an old piece of unpublished research I wrote surfaced some weeks ago. In the article I used a genetic algorithm (GA) – all the rage back in 2009 – to learn the weights in a neural network which provide the strategy for competitive generator companies in a simulated power market. Using this setup I could demonstrate well known Nash equilibria (NE) in common market situations, as well as simulate competition in intractable market setups.

I wrote the article while I worked as a scientific researcher at Sintef Energy AS back in 2009 and the work was financed by Sintef. Also, the topic is not that interesting to pursue as a TSO, however the methods proved interesting enough to write a post about it here. Thanks to Sintef for letting me publish it here!

The entire article draft can be downloaded here

The market participants

The paper starts off with a description of market participants (generators). The generators must bid their capacity into the power exchange. Each generator can behave strategically by adding a markup to the marginal cost, thereby attempting to achieve a higher profit, but at the risk of losing market shares. The markup is the output of the neural network which we will come to.


For convenience we use a linear marginal cost function with a scalar markup.

Nodal pricing and market clearing

The power exchange collects all the market participant’s bids (this includes demand). It then has to find the intersection of the aggregate supply and demand curves while taking transmission constraints into account. In other words, the power exchange must find the generation by all generators that minimizes the system cost. With linear bids and a linearized version of the power flow equations, this is a quadratic optimization problem.

The problem formulation implies that each node in the power system can have a unique power price (if there are constraints in the network).  The nodal power prices are inferred from the Lagrangian multipliers of the equality constraints; the Lagrangians are the costs of producing one additional MW of energy at a node. This setup roughly reflects several real world power markets.

To solve this quadratic optimization problem we can employ what is called a DC optimal power flow. The Python package pypower handles all this for us. A limited amount of coding should let you add a variable markup to each generators marginal cost function.

After the market is cleared, each participant gets paid the price at the node that it resides, even if it was consumed at a node with a different price. Profit from trade between differently priced nodes, congestion rent, devolves to the system operator.

Evolving strategies using a genetic algorithm

The genetic algorithm (GA) is used to evolve strategies for the market participants. Each market participant has its own pool of n strategies that are evolved using the GA. The aim of each participant is to maximize its profits.

At each generation g of the simulation the participants meet several times, these are the iterations. At each iteration, each participant randomly selects a strategy from its pool of strategies. This strategy is evaluated to yield a bid B_u that is submitted to the market clearing algorithm. The market clearing returns participants profits that are stored in memory. After the desired number of iterations have passed, the fitness of a strategy, s, for unit, u, at generation, g, F_{u,s,g}, is calculated as the sum of profits it received divided by the number of times it was played. Thereafter, selection, crossover, and mutation are applied to produce the new generation of strategies and a new generation is started.

In pseudocode:

  1. Initialize and load power system structure, g = 0
  2. g = g + 1, i = 0
  3. i = i + 1
  4. For each participant u, select a random strategy s from the pool
  5. Find the bid of each participant B_u, given the random strategy
  6. Evaluate the market clearing algorithm given the bids of all players and store the achieved profits
  7. If i < numIterations goto line (3), else:
  8. Calculate the fitness of each strategy F_{u,s,g} as the average of all profits received this generation
  9. Perform selection, crossover and mutation of all strategies
  10. If g < numGenerations goto line (2)
  11. Simulation ends

The number of iterations is selected such that all strategies are played on average ten times each generation. In this paper, the standard number of strategies for each participant was n = 50. In other words, there were 500 iterations per generation.

Representing the strategy as a neural network

A strategy is the rule by which the participants determine their bid, in this case the markup. In mathematical terms, a strategy is a function that takes a set of inputs and returns a decision (markup). This function has parameters that determine the nature of the relationship between inputs and output; it is these parameters that the GA work upon.

Inspired by research on the Iterated Prisoner’s Dilemma (IPD), the inputs to the strategy are taken to be the markup and the price attained when the strategy was last played. The loose resemblance to an IPD, is that the markup is one’s own level of cooperation and the price corresponds to the cooperative level of the competitor(s). Using these two inputs, the strategy essentially has a memory of one round and is able to evolve tit-for-tat strategies and similar, i.e. if you bid high last turn I also continue to bid high this turn. The amount of information taken as input could easily be extended to allow for more sophisticated strategies, maybe this could be an area for recurrent neural networks.

A convenient way of creating a parametrized function, that can represent a strategy as described above, is a multilayer feed-forward neural network. It can approximate any function arbitrarily well, given sufficient neurons. In the article I used a small neural network with one hidden layer. The weights and biases in the neural network were then evolved using the genetic algorithm. A modern approach would maybe use deep q-learning or a similar approach instead.

Perfect competition and co-operative equilibria

The model is primarily used to set up market situations that model classical game theory situations which have been analyzed extensively in the literature. The idea is to show that self-play converges to well-known Nash-equilibria (NE).

The first part of the article shows some rather obvious results:

  • In a market with a lot of small generators and a surplus of capacity, the markup converges to zero
  • In a duopoly where each generator can serve exactly half the load, the markup converges to the highest allowed markup
  • In a duopoly where one generator alone can serve the load (Bertrand competition), the markup converges to zero

Conceptually, Bertrand competition bears some resemblance to the Prisoners Dilemma (IPD): If both generators co-operate by bidding high they get a good profit, however, it is always in the interest of the other to defect and bid marginally lower, thus mutual defection is ensured. In the IPD game it is known that players can evolve strategies that maintain a mutually cooperative level even though the equilibrium for both players is to defect if the game is played only once. In the normal IPD, there are only two choices, either full cooperation or full defection. A common extension of the game is to include several discrete levels of cooperation.

Market price in Bertrand competition for an initially random population (A), defecting population (B), and co-operative population (C). In all simulations, the marginal cost is 100, and the markup is eventually reduced to zero.

In a paper on the IPD from 2002, the authors noted that the chance of mutual cooperation dwindled as the possible choices in the IPD increased. Considering that each player in the Bertrand game can select any real positive number as her price (which can be thought of as the levels of cooperation), it is simply very unlikely to remain at a mutually cooperative level. This effect can explain why mutual cooperation is likely if the choice of each player is limited to only a few discrete price levels, but highly unlikely if the price can be selected from a continuum.

Other papers on agent based modelling limit the agents to select from a discrete set of choices. If the market situation resembles Bertrand competition, the above discussion might lead to think that mutual co-operation can occur, albeit with a small probability if the number of choices is large. However, it turns out that the similarity of the Bertrand and Prisoners Dilemma game disappears if the number of choices is limited. In fact, the outcome of simulated Bertrand competition with discrete levels of cooperation is quite arbitrary and depends on the exact markups the players can choose from.

Competition with capacity constraints

With capacity constraints, one generator alone can not serve the entire market like in Bertrand competition. The participant that bids higher than its opponent still captures some residual demand. A market player can thus follow two strategies:

  1. Try to undercut the other player to get a large market share and risk low prices, or
  2. set the markup high to make sure the market clearing price remains high and accept that the other player gets the larger market share.

In this case there exists a price where a market player is indifferent as to which direction it should diverge as long as the opponent sticks to its price. The price of indifference occurs when undercutting the opponent gives the exact same profit as bidding higher than the opponent and capturing the residual demand.

The price of indifference depends on the exact market setup. In the simulation below, the price of indifference was 160. The bid of player 1 stays high (to maximize profits) while the bid of player 2 stays below the threshold of 160. Due to random mutation in the genetic algorithm, the values don’t stabilize entirely. Who ends up as the high bidder, and thus worse of, is entirely random.

The two players undercut one another until, at the threshold price, player 1 starts increasing its markup.

Duopoly in a congested two-bus network

So far, the underlying electrical network has been kept uncongested to enable analysis of some well known economic games. In this section, the simplest possible constrained network will be analysed theoretically and simulated with the developed model. The network is depicted in the figure below. It consists of two buses with a generator and load each; the buses are connected by a transmission line capable of transferring K MW in either direction.

Two bus network with constrained transmission capacity

Moreover, the transmission capacity K is less than the demand at either of the buses and the generators are capable of serving the load at their own bus and at the same time utilize the entire transmission capacity.

Accordingly, the generators face a strategic choice:

  1. Bid high to raise the price at its bus and serve the load that the opponent cannot capture because of the limited transmission capacity; or,
  2. Bid low and serve all the load it possibly can cover.

At fi rst, this might seem like the Bertrand game with capacity constraints. However, there is one important difference: since the generators are paid the nodal price (and not a common system price), the low bidder can profitably raise its price towards that of the high bidder. The consequence is that there is no NE in pure strategies, resulting in a cyclical variation in the price over the course of a simulation.

Simulated markups in the two-bus network for 4 different transmission capacity levels K

Competition in a meshed and congested grid

This section presents results from an analysis of a small benchmark power system presented in a paper on agent based modelling. There are 3 strategically behaving generators in the system and all load is assumed to to be completely inelastic. All transmission lines are assumed to have sufficient transmission capacity with the exception of the line from bus 2 to bus 5 which is limited to 100 MW.

Power system diagram. The numbers close to the lines are reactances in per unit.

Generator data

The authors in the original paper analytically computed the NE and simulated the system using adaptive agents with a discrete set og markups. Two symmetrical NE were found; generator 3 should always bid its highest markup while either generator 1 or 2 bids with the highest markup and the other bids its marginal cost. The agent-based approach did discover the NE, but alternated between the two equilibria in a cyclical fashion. Here, the markups can be any real number on the interval [0; 30], otherwise the simulation setup is identical to that of the original work.

I find the same NE, however no cyclical behavior occurs. As soon as the agents have locked in on one NE the game stays in that equilibrium for the rest of the simulation. To discover several equilibria, the model must be rerun with random initial conditions.

Simulated markups for the three generators and corresponding consumer benefit, producer benefit and congestion rent on the right

The figure above shows simulated simulated markups on the left and the corresponding consumer benefit (about 2500), producer bene fit (15000) and congestion rent on the right. If all producers had bid according to their marginal cost, we would have achieved the socially optimal solution, which has producers benefit at 1056, and consumers benefit at 17322. Accordingly there has been a large transfer of benefit from consumers to producers due to market power.

If we double the capacity on the line between bus 2 and 5, market power is mostly eradicated. In addition, the total economic benefit to society increases slightly.

Ideas for further work

Even with the simple setup used in this work it is possible to evolve interesting strategies. However, the field of reinforcement learning has evolved a lot since 2009. One idea for further development would be to re-implement the ideas presented here using modern frameworks for reinforcement learning like open Ai gym and Tensorforce. Such a setup could possibly be used to evolve cooperative strategies with longer memory as well as simulate time-varying market situations (like in the real world).


Comparing javascript libraries for plotting

After some research on available javascript libraries for plotting we decided that plotly fits our needs the best. We also provide a small vue-wrapper for plotly.

Plotly.js renders fast and has a large selection of visualizations

We frequently plot quite large amounts of data and want to be able to interact with the data without long wait times. I use the term large data, and not big data, since we are talking about a typical dataset size on the order of some 100k datapoints. This is not really big data, but it is still large enough that many plotting packages have trouble with rendering, zooming, panning etc. Ideally we would like render times below about 500 ms.

To find the ideal plotting package we set up a small test with some promising plotting packages: Echarts, Highcharts, Plotly and Bokeh. We plotted around 300k datapoints from a timeseries and noted the render times in ms. The chart below displays the render times in milliseconds from five repetitions.Echarts and bokeh are canvas based, and render much slower than the other packages which are SVG based.

Additionally, we find our selves wanting a large selection of visualization types. Classic bar chart, line charts etc. just is not enough. Incidentally, plotly also offers a wide range of charts. Plotly is also available as a python and R package so we can build experience in one plotting package independently of programming language.

Vue-wrapper for plotly

One of our main front-ends uses the vue.js framework. To integrate plotly cleanly we created a vue-wrapper for plotly. Check it out on github:

A vue wrapper for plotly.js chart library
30 forks.
118 stars.
33 open issues.

Recent commits:



Setting up a forecast service for weather dependent failures on power lines in one week and ten minutes

Combining Tableau, Python, Splunk and open data from to deliver a realtime forecast of the probability of failure due to wind and lightning on overhead lines.

Our last post was about estimating the probability of failure for overhead lines based on past weather and failure statistics. The results are currently used as part of our long term planning process, by using a Monte Carlo tool to simulate failures in the Norwegian transmission system.

In this post you can read about how we re-used this model for a different use case: using the current weather forecast to predict probability of failure per overhead line. This information is very relevant for our system operators when preparing for severe weather events. We use data from the Norwegian Meteorological Institute, Tableau for visualization, Python for our backend and Splunk for monitoring. This post explains how.

Overview of the forecast service

The core delivery of the mathematical model described in our previous blog post are fragility curves per overhead line. To recall from the previous post, those fragility curves are constructed based on historical failures on overhead line combined with historical weather data for that line. The resulting fragility curves give us the relation between the weather event and the expected probability of failure. In our current model we have two fragility curves per line: one that describes the expected failure due to wind vs wind exposure, and one that models the expected failure due to lightning vs lightning exposure.

Now, when we have this information per overhead line, we can combine this with weather forecast data to forecast expected failure rates for all overhead lines. Our service downloads the latest weather forecast when it becomes available, updates the expected failure probabilities, and exports the results to file. We designed a Tableau dashboard to visualize the results in an interactive and intuitive manner.

The resulting setup is displayed below:

flow diagram service

In the remainder, we cover how we created and deployed this service.

Data input: fragility curves

See our previous blog post for an explanation of the creation of the fragility curves.

Data input: open weather data

The Norwegian Meteorological Institute ( maintains a THREDDS server where anyone can download forecast data for Norway. THREDDS is an open source project which aims to simplify the discovery and use of scientific data. Datasets can be served through OPeNDAP, OGC’s WMS and WCS, HTTP, and other remote data access protocols. In this case we will use OPeNDAP so we don’t have to download the huge forecast files , only the variables we are interested in.

import logging
import numpy as np
import pandas as pd
import pyproj
import xarray as xr
from scipy.spatial import cKDTree
logger = logging.getLogger(__name__)
url = ';
# using the powerful xarray library to connect to the opendap server
ds = xr.open_dataset(url)
lat = ds['latitude'][:]
lon = ds['longitude'][:]
time_values = ds['time'][:].values
dt_time = pd.to_datetime(time_values)
wind_speed = ds['wind_speed'].load()
except Exception, e:
logger.error('Error in reading wind speed data', exc_info=True)
# in reality geo_object_list is list of power tower objects with coordinates
geo_object_list = []
def geo_object_coordinates():
return None
# collecting x,y coordinates for power towers
line_cartesian_positions = [
(x, y) for geo_object in geo_object_list for
(x, y) in geo_object_coordinates(geo_object)
target_points = map(lambda x_: [x_[0], x_[1]], line_cartesian_positions)
projection = pyproj.Proj(
"+proj=utm +zone=33 +north +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
# setting up model tree
lon2d, lat2d = np.meshgrid(lon, lat, sparse=True)
utm_grid = projection(np.ravel(lon2d), np.ravel(lat2d))
model_grid = list(list(zip(*utm_grid)))
# finding nearest point in model grid for all power towers
distance, index = cKDTree(model_grid).query(target_points)
y, x = np.unravel_index(index, lon.shape)
wind_speed_array = wind_speed.isel(x=xr.DataArray(x), y=xr.DataArray(y))[:, 0, :].values
# setting up pointers to geo_object_list
start_ids = [0]
start_ids.extend([len(geo_object) for geo_object in geo_object_list])
start_ids = np.cumsum(start_ids)
# creating dictionary of wind speed pandas dataframes
# each dataframe consists of as many column as there is power towers for the line
wind_speed_dict = {}
for i, geo_object in enumerate(geo_object_list):
wind_speed_dict[geo_object.object_id] = pd.DataFrame(
wind_speed_array[:, start_ids[i]:start_ids[i + 1]],

The first thing to notice is that we select the latest deterministic forecast and load this using xarray. An alternative here would have been to download the complete ensamble forecast to get a sense of the uncertainty in the predictions as well. Next, we select wind_speed as the variable to load. Then we use the scipy.spatial.cKDTree package to find the nearest grid point to the ~30 000 power line towers and load the corresponding wind speed.

Forecast service: service creation and monitoring

We have developed both the probability model and the forecast service in Python. The forecasts are updated four times a day. To download the latest forecast as soon as it becomes available we continuously run a Windows process using Pythons win32service package. Only a few lines of code were necessary to do so:

class AppServerSvc (win32serviceutil.ServiceFramework):
_svc_name_ = "TestService"
_svc_display_name_ = "Test Service"
def __init__(self, args):
win32serviceutil.ServiceFramework.__init__(self, args)
self.hWaitStop = win32event.CreateEvent(None, 0, 0, None)
def SvcStop(self):
def SvcDoRun(self):
(self._svc_name_, ''))
def main(self):
import download_service64Bit
gw = execnet.makegateway("popen//python={}".format(r'C:\Python27_64Bit\python.exe'))
channel = gw.remote_exec(download_service64Bit)
sleeptime = 1
wait = win32event.WAIT_OBJECT_0
while win32event.WaitForSingleObject(self.hWaitStop, sleeptime * 1000) != wait:
sleeptime = channel.receive()
except BaseException as e:
# error handling – omitted in snippet

The actual logic of obtaining the fragility curves, downloading new forecasts, and combining this into the desired time series is included in the download_service64Bit module.

To ensure the service behaves as desired, we monitor the process in two ways. Firstly, we use Splunk to monitor the server’s windows event log to pick up any unexpected crashes of the service. The team will receive an email within 15 minutes of an unexpected event. Secondly, we will develop more advanced analytics of our log using Splunk. This is made very easy when using Python’s excellent logging module (a great introduction on why and how you should use this module can be found here).

Publishing results to Tableau

We use Tableau as a quick and easy way to create interactive and visually attractive dashboards. An example screenshot can be seen below:


In the top left corner, the information about the system state is shown. The percentage is the total probability of at least one line failing in the coming forecast period of three days. The dashboard shows the situation just before a major storm in January this year, leading to a very high failure probability. There actually occurred seven failures due to severe wind two days after the forecast. The bar plot shows which part of the estimated failure probability is due to wind versus lightning.

Overview of main system state

The main graph shows the development of failure over time for the whole power system. When no specific line is selected, this graph shows the development of the failure probability in the whole system over time:timechart

When one line is selected, it shows the details for this line: wind speed or lightning intensity, and estimated failure probability over time.


The heat map at the bottom of the screen shows the probability of failure per overhead line per hour.


The list of lines is ordered, listing the lines that are most likely to fail on top. This provides the system operator with a good indication of which lines to focus on. The heat map shows how the failure probability develops over time, showing that not all lines are exposed maximally at the same time. The wind/lightning emoji before the name of the line indicates whether that line is exposed to high wind or lightning.

Going from idea to deployment in one week and ten minutes

This weeks goal was to go from idea to deployment in one week by setting aside all other tasks. However, as Friday evening came closer, more and more small bugs kept popping up both in the data visualization as well as in the back-end.

Friday afternoon debugging

In the end, we got the service at least 80% complete. To get the last things in place we might need just 10 minutes more…

Estimating the probability of failure for overhead lines

In Norway, about 90 percent of all temporary failures on overhead lines are due to weather. In this post, we present a method to model the probability of failures on overhead lines due to lightning.

Welcome to the blog for Data Science in Statnett, the Norwegian electricity transmission system operator. We use data science to extract knowledge from the vast amounts of data gathered about the power system and suggest new data-driven approaches to improve power system operation, planning and maintenance. In this blog, we write about our work. Today’s topic is a model for estimating the probability of failure of overhead lines.

Knowing the probability of failure is central to reliability management

For an electricity transmission system operator like Statnett, balancing power system reliability against investment and operational costs is at the very heart of our operation. However, a more data-driven approach can improve on the traditional methods for power system reliability management. In the words of the recently completed research project Garpur:

Historically in Europe, network reliability management has been relying on the so-called “N-1” criterion: in case of fault of one relevant element (e.g. one transmission system element, one significant generation element or one significant distribution network element), the elements remaining in operation must be capable of accommodating the new operational situation without violating the network’s operational security limits.

Today, the increasing uncertainty of generation due to intermittent energy sources, combined with the opportunities provided e.g. by demand-side management and energy storage, call for imagining new reliability criteria with a better balance between reliability and costs.

In such a framework, knowledge about failure probabilities becomes central to power system reliability management, and thus the whole planning and operation of the power system. When predicting the probability of failure, weather conditions play an important part; In Norway, about 90 percent of all temporary failures on overhead lines are due to weather, the three main weather parameters influencing the failure rate being wind, lightning and icing. In this post, we present a method to model the probability of failures on overhead lines due to lightning. The full procedure is documented in a paper to PMAPS 2018. In an upcoming post we will demonstrate how this knowledge can be used to predict failures using weather forecast data from

Data sources: failure statistics and weather data

Statnett gathers failure statistics and publishes them annually in our failure statistics. These failures are classified according to the cause of the failure. For this work, we considered 102 different high voltage overhead lines. For these there have been 329 failures due to lightning in the period 1998 – 2014.

We have used renanalysis weather data computed by Kjeller Vindteknikk. These reanalysis data have been calculated in a period from january 1979 until march 2017 and they consist of hourly historical time series for lightning indices on a 4 km by 4 km grid. The important property with respect to the proposed methods, is that the finely meshed reanalysis data allows us to use the geographical position of the power line towers and line segments to extract lightning data from the reanalysis data set. Thus it is possible to evaluate the historical lightning exposure of the transmission lines.

Lightning indices

The first step is to look at the data. Lightning is sudden discharge in the atmosphere caused by electrostatic imbalances. These discharges occur between clouds, internally inside clouds or between ground and clouds. There is no atmospheric variable directly associated with lightning. Instead, meteorologists have developed regression indices that measure the probability of lightning. Two of these indices are linked to the probability of failure of an overhead line. The K-index and the Total Totals index. Both of these indices can be calculated from the reanalysis data.

Figure 1 Rank and K index for lightning failures

Figure 1 shows how lightning failures are associated with high and rare values of the K and Total Totals indices, computed from the reanalysis data set. For each time of failure, the highest value of the K and Total Totals index over the geographical span of the transmission line have been calculated, and then these numbers are ranked among all historical values of the indices for this line. This illustrates how different lines fail at different levels of the index values, but maybe even more important: The link between high index values and lightning failures is very strong. Considering all the lines, 87 percent of the failures classified as “lightning” occur within 10 percent of the time. This is promising…

Figure 2: TT index versus K index shows seasonality trend

In Norway, lightning typically occurs during the summer in the afternoon as cumulonimbus clouds accumulate during the afternoon. But there is a significant number of failures due to thunderstorms during the rest of the year as well, winter months included. To see how the indices, K and T T , behave for different seasons, the values of these two indices are plotted at the time of each failure in Figure 3. From the figure it is obvious, though the data is sparse, that there is relevant information in the Total Totals index that has to be incorporated into the probability model of lightning dependent failures. The K index has a strong connection with lightning failures in the summer months, whereas the Totals Totals index seems to be more important during winter months.

Method in brief

The method is a two-step procedure: First, a long-term failure rate is calculated based on Bayesian inference, taking into account observed failures. This step ensures that lines having observed relatively more failures and thus being more error prone will get a relatively higher failure rate. Second, the long-term annual failure rates calculated in the previous step are distributed into hourly probabilities. This is done by modelling the probabilities as a functional dependency on relevant meteorological parameters and assuring that the probabilities are consistent with the failure rates from step 1.

Bayesian update

From the failure statistics we can calculate a prior failure rate \lambda due to lightning simply by summing the number of failures per year and dividing by the total length of the overhead lines. We then arrive at a failure rate per 100 km per year. This is our prior estimate of the failure rate for all lines.

When we observe a particular line, the failures arrive in what is termed a Poisson process. When we assume that the failure rate is exponentially distributed, we arrive at a convenient expression for the posterior failure rate \lambda^B:

\lambda^B = \frac{1 + \sum{y_i}}{\frac{1}{\lambda} + n}

Where n is the number of years with observations, \lambda is the prior failure rate and y_i is the number of observed failures in the particular year.

Figure 3: The prior and the posterior distribution. Dashed vertical lines are the

Distributing the long term failure rates over time

We now have the long-term failure rate for lightning, but have to establish a connection between the K-index, the Totals Totals index and the failure probability. The goal is to end up with hourly failure probabilities we can use in monte-carlo simulations of power system reliability.

The dataset is heavily imbalanced. There are very few failures (positives), and the method has to account for this so we don’t end up predicting a 0 % probability all the time. Read a good explanation of learning from imbalanced datasets in this kdnuggets blog.

Many approaches could be envisioned for this step, including several variants of machine learning. However, for now we have settled on an approach using fragility curves which is also robust for this type of skewed/biased dataset.

A transmission line can be considered as a series system of many line segments between towers. We assume that the segment with the worst weather exposure is representable for the transmission line as a whole.

We then define the lightning exposure at time t:

w^t = \alpha_K \max(0, K^t_{max} - K_{\text{thres}})^2 + \alpha_{TT} \max(0, TT^t_{max} - TT_{\text{thres}})^2

Where \alpha_K, \alpha_{TT} are scale parameters, K_{max}^t is the maximum K index along the line at time t, TT_{max}^t is the maximum Total Totals index at time t along the line. K_{\text{thres}}, TT_{\text{thres}} are threshold values for the lightning indices below which the indices has no impact on the probability.

Each line then has an probability of failure at time t given by:

p_L^t = F(w^t; \sigma_L, \mu_L)

where F(\cdot) is the cumulative log normal function.

To find the standard deviation and expected value that describe the log normal function, we minimize the following equation to ensure that the expected number of failures equals the posterior failure rate:

\mu_L, \sigma_L = \underset{\mu, \sigma}{\text{argmin}} \: g(p^t_L; \mu, \sigma)


g(p^t_L; \mu, \sigma) = \left(\lambda^B - \frac{1}{k}\sum_{t=0}^T p^t_L\right)^2

If you want to delve deeper into the maths behind the method we will present a paper at PMAPS 2018.

Fitting the model to data

In this section simulation results are presented where the models have been applied to the Norwegian high voltage grid. In particular 99 transmission lines in Norway have been considered, divided into 13 lines at 132 kV, 2 lines at 220 kV, 60 lines at 300 kV and 24 lines at 420 kV. Except for the 132 and 220 kV lines, which are situated in Finnmark, the rest of the lines are distributed evenly across Norway.

The threshold parameters K_{\text{thres}} and TT_{\text{thres}} have been set empirically to K_{\text{thres}} = 20.0 and TT_{\text{thres}} = 45.0. The two scale parameters \alpha_K and \alpha_{TT} have been set by heuristics to \alpha_K = 0.88 and \alpha_{TT} = 0.12, to reflect the different weights of the seasonal components.


The probability models presented above are being used by Statnett as part of a Monte Carlo tool to simulate failures in the Norwegian transmission system for long term planning studies. Together with a similar approach for wind dependent probabilities, we use this framework as the basic input to these Monte Carlo simulation models. In this respect, the most important part of the simulations is to have a coherent data set when it comes to weather, such that failures that occur due to bad weather appear logically and consistently in space and time.

Figure 4: Seasonal differences in K index and TT index for simulated results

Figure 4 shows how the probability model captures the different values of the K index and the Total Totals index as the time of the simulated failures varies over the year. This figure should be compared with figure 2. The data in Figure 4 is one out of 500 samples from a Monte Carlo simulation, done in the time period from 1998 to 2014.

The next figures show a zoomed in view of some of the actual failures, each figure showing how actual failures occur at time of elevated values of historical probabilities.



%d bloggers like this: