How we quantify power system reliability

An important aspect of digital transformation for a utility company is to make data-driven decisions about risk. In electric power systems, the hardest part in almost any decision is to quantify the reliability of supply. That is why we have been working on methodology, data quality and simulation tools with this in mind for several years at Statnett. This post describes our simulation tool for long-term probabilistic reliability analysis called MONSTER.

The need for reliable power system services is ever increasing in our system as  technology advances. At the same time, the focus on carefully analysed and socio economic profitable investments has perhaps never been more essential. In light of this, several initiatives focus on having probabilistic criteria as the frame for doing reliability analysis. This post describes our Monte Carlo simulation tool for probabilistic power system reliability analysis that we have dubbed MONSTER.

Model overview

holisticOur simulation methodology is described in this IEEE PMAPS paper from 2018: A holistic simulation tool for long-term probabilistic power system reliability analysis (you need IEEE Xplore membership to access the article)

Here I’ll repeat the basics, ingnore most of the maths and expand on some details that have been left out in the paper.

Our tool consist of two main parts: A Monte Carlo simulation tool and a Contingency analysis module. The main idea in the simulation part is to use hourly times series of failure probability for each component.

We do Monte Carlo simulations of the power system where we draw component failure times and durations, and then let the Contingency analysis module calculate the consequence of each of these failures. The Contingency evaluation module is essentially a power system simulation using optimal loadflow to find the appropriate remedial actions, and the resulting “lost load” (if any).

The end result of the whole simulation is typically expressed as expected loss of load, expected socio-economic costs or distributions of the same parameters. Such results can, for instance, be used to make investment decisions for new transmission capacity.

Time series manager

The main principle of the Time series manager is to first use failure observations to calculate individual failure rates for each component by using a Bayesian updating scheme, and then to distribute the failure rates in time by using historical weather data.  We have described the procedure in detail in a previous blog post.

We first calculate individual, annual failure rates using observed failure data in the power system. Then for each component we calculate an hourly time series for the probability of failure. The core step in this approach is to calculate the failure probabilities based on reanalysis weather data such that the resulting expected number of failures is consistent with the long term annual failure rate calculated in the Bayesian step. This is about spreading the failure rate in time such that we get one probability of failure at each time step. In this way, components that experience the same adverse weather will have elevated probabilities of failure at the same time. This phenomenon is called failure bunching and it’s effect is illustrated in the figure below. In fact we will get a consistent and realistic geographical and temporal development of failure probabilities throughout the simulation period.

failurebunching
Failure bunching dramatically increases the probability of multiple independent failures happening at the same time.

In our model we divide each fault into eight different categories: Either the fault is  temporary or Permanent, and each of the failures can be due to either wind, lightning, snow/icing or unrelated to weather. Thus, for each component we get eight historical time series to be used in the simulations.

outages temporary failures

Biases in the fault statistics

There are two data sources for this part of the simulation. Observed failure data from the national fault statistics database FASIT, and approximately 30 years of reanalysis weather data calculated by Kjeller Vindteknikk.  The reanalysis dataset is nice to work with from a data science perspective. As it is simulated data, there are no missing values, no outliers, uniform data quality etc.

The fault statistics, however, has lots of free text columns, missing values and inconsistent reporting (done by different people across the nation over two decades). So a lot of data prep is necessary to go from reported disturbances to statistics relevant for simulation. Also, there are some biases in the data to be aware of:

  • There is a bias towards too long repair/outage-times. This happens because most failures don’t lead to any critical consequences, therefore we spend more time than strictly necessary repairing the component or otherwise restoring normal operation. For our simulation we want the repair time we would have needed in case of a critical failure. We have remedied this to some extent by eliminating some of the more extreme outliers in our data.
  • There is a bias towards reporting too low failure rates; only failures that lead to an immediate disturbance in the grid count as failures in FASIT. Failures that lead to controlled disconnections are not counted as failures in this database, but are  relevant for our simulations if the disconnection is forced (as opposed to postponable).

The two biases counteract each other, in addition this kind of epistemic uncertainty is possible to mitigate by doing sensitivity analysis.

Improving the input-data: Including asset health

Currently, we don’t use component specific failure rates. For example, all circuit breakers have the same failure rate regardless of age, brand, operating history etc. For long term simulations (like new power line investment decisions) this is a fair assumption as components are regularly maintained and swapped out as time passes by.

The accuracy of the simulations, especially on short to medium term, can be greatly improved by including some sort of component specific failure rate. Indeed, this is an important use-case in and by itself for risk based asset maintenance.

Monte Carlo simulation

A Monte Carlo simulation consist of a large number of random possible/alternative realizations of the world, typically on the order of 10000 or more. In our tool, we do Monte Carlo simulations for the time period where we have weather data, denoted by k years, where k~=30. For each simulation, we first draw the number N of failures in this period for each component and each failure type. We do this by using the negative binomial distribution. Then we distribute these failures in time by drawing with replacement from the k∗8760 hours in the simulation period such that the probability of drawing any particular hour is in proportion to the time series of probabilities calculated earlier. As the time resolution for the probabilities is hourly, we also draw a uniform random value of the failure start time within an hour. Then finally, for each of these failures we draw random duration of outages:

for each simulation do
for each component and failure type in study do
draw the number N of failures
draw N failures with replacement
for each failure do
draw random uniform start time within hour
draw duration of failure
feilrater
Convergence of the most and least frequent single- and double failure rates for the first 1000 simulations.

Outage manager

In addition to drawing random errors, we also have to incorporate maintenance planning. This could for example be done either as an explicit manual plan or through an optimization routine that distributes the services according to needs and probabilities of failure.

How detailed the maintenance plan is modeled, typically depends on the specific area of study. In some areas, the concern is primarily of a failure occurring during a few peak load hours when margins are very tight. At such times it is safe to assume that there is no scheduled maintenance – and we disregard it entirely in the model. In other areas, the critical period might be when certain lines are out for maintenance, and so it becomes important to model maintenance realistically.

For each simulation, we step through points in time where something happens and collect which components are disconnected. In this part we also add points in time where the outage pass some predefined duration, called outage phase in our terminology. We do this to be able to apply different remedial measures depending on how much time has passed since the failure occurred. The figure below illustrates the basic principles.

outagephases
Collecting components and phases to contingency list. Dashed line indicate phase 1 and solid line phase 2. In this example the two outages contribute to five different contingencies. In phase 1, only automatic remedial measures (like system protection) can be applied. In phase 2, manual measures, like activation of tertiary reserves can be applied.

One important assumption is that each contingency (like the five contingencies in the figure above) can be evaluated independently. This makes the contingency evaluation, which is the computationally heavy part, embarrassingly parallel. Finally, we eliminate all redundant contingencies. We end up with a pool of unique contingencies that we send to the Contingency Analysis module for load flow calculations in order to find interrupted power.

Contingency analysis

Our task in the Contingency analysis module is to mimic the system operator’s response to failures. This is done in four steps:

  1. Fast AC-screening to filter out contingencies that don’t result in any thermal overloads. If overloads, proceed to next step.
  2. Solve the initial AC load-flow, if convergence move to next step. In case of divergence (due to voltage collapse) attempt to emulate the effect of voltage collapse with a heuristic, and report the consequence.
  3. Apply remedial measures with DC-optimal load-flow, report all measures applied and go to next step.
  4. Shed load iteratively using AC load-flow until no thermal overloads remain. Report lost load.

In step 3 above, there are many different remedial measures that can be applied:

  • Reconfiguration of the topology of the system
  • Reschedule production from one generator to another
  • Reduce or increase production in isolation
  • Change the load at given buses

For the time being, we do not consider voltage or stability problems in our model apart from the rudimentary emulation of full voltage collapse. However, it is possible to add system protection to shed load for certain contingencies (which is how voltage stability problems are handled in real life).

As a result from the contingency analysis, we get the cost of applying each measure and the interrupted power associated with each contingency.

Load flow models and OPF algorithm

Currently we use Siemens PTI PSS/E to hold our model files and to solve the AC load flow in steps 1,2 and 4 above. For the DC optimal load-flow we express the problem with power transfer distribution factors (PTDF) and solve it as a mixed integer linear program using a commercial solver.

Cost evaluation – post-processing

Finally, having established the consequence of each failure, we calculate the socio-economic costs associated with each interruption. This is done by following the Norwegian regulation. The general principle is described in this 2008 paper, but the interruption costs have been updated since then. For each failure, we know where in the grid there are disconnections, for how long they last, and assuming we know the type of users connected to the buses in our load-flow model, the reliability indices for different buses will be easy to calculate. Due to the Monte Carlo approach, we will get a distribution of each of the reliability indices.

risk_matrix_mwh
Result example: Failure rate and lost load averaged over alle simulation for some contingencies (black dots).

A long and winding development history

MONSTER has been in development for several years at Statnett after we did an initial study and found that no commercial tools where available that fit the bil. At first we developed two separate tools: one Monte Carlo simulation tool in Matlab, used to find the probability of multiple contingencies, and an Python AC contingency evaluation module built on top of PSS/E, used to do contingency analysis with remedial measures. Both tools were built to solve specific tasks in the power system analysis department. However, combining the tools was a cumbersome process that involved large spreadsheets.

At some point, we decided to do a complete rewrite of the code. The rewriting process has thought us much of what we know today about modern coding practices (like 12-factor apps), and it has helped build the data science team at Statnett. Now, the backend is written in python, the front end is a web-app built around node.js and vue, and we use a SQL database to persist settings and results.

skjermbilde-stasjon.jpg
A screenshot from the application where the user defines substation configuration.