Reinforcement learning using self-play is currently one of the most fascinating areas of research within artificial intelligence. Papers like Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm makes for inspiring reading, and I’m sure we’ll see a host of real world applications in the future. These current trends in reinforcement learning came to mind as an old piece of unpublished research I wrote surfaced some weeks ago. In the article I used a genetic algorithm (GA) – all the rage back in 2009 – to learn the weights in a neural network which provide the strategy for competitive generator companies in a simulated power market. Using this setup I could demonstrate well known Nash equilibria (NE) in common market situations, as well as simulate competition in intractable market setups.
I wrote the article while I worked as a scientific researcher at Sintef Energy AS back in 2009 and the work was financed by Sintef. Also, the topic is not that interesting to pursue as a TSO, however the methods proved interesting enough to write a post about it here. Thanks to Sintef for letting me publish it here!
The entire article draft can be downloaded here
The market participants
The paper starts off with a description of market participants (generators). The generators must bid their capacity into the power exchange. Each generator can behave strategically by adding a markup to the marginal cost, thereby attempting to achieve a higher profit, but at the risk of losing market shares. The markup is the output of the neural network which we will come to.
For convenience we use a linear marginal cost function with a scalar markup.
Nodal pricing and market clearing
The power exchange collects all the market participant’s bids (this includes demand). It then has to find the intersection of the aggregate supply and demand curves while taking transmission constraints into account. In other words, the power exchange must find the generation by all generators that minimizes the system cost. With linear bids and a linearized version of the power flow equations, this is a quadratic optimization problem.
The problem formulation implies that each node in the power system can have a unique power price (if there are constraints in the network). The nodal power prices are inferred from the Lagrangian multipliers of the equality constraints; the Lagrangians are the costs of producing one additional MW of energy at a node. This setup roughly reflects several real world power markets.
To solve this quadratic optimization problem we can employ what is called a DC optimal power flow. The Python package pypower handles all this for us. A limited amount of coding should let you add a variable markup to each generators marginal cost function.
After the market is cleared, each participant gets paid the price at the node that it resides, even if it was consumed at a node with a different price. Profit from trade between differently priced nodes, congestion rent, devolves to the system operator.
Evolving strategies using a genetic algorithm
The genetic algorithm (GA) is used to evolve strategies for the market participants. Each market participant has its own pool of n strategies that are evolved using the GA. The aim of each participant is to maximize its profits.
At each generation g of the simulation the participants meet several times, these are the iterations. At each iteration, each participant randomly selects a strategy from its pool of strategies. This strategy is evaluated to yield a bid that is submitted to the market clearing algorithm. The market clearing returns participants profits that are stored in memory. After the desired number of iterations have passed, the fitness of a strategy, s, for unit, u, at generation, g,
, is calculated as the sum of profits it received divided by the number of times it was played. Thereafter, selection, crossover, and mutation are applied to produce the new generation of strategies and a new generation is started.
In pseudocode:
- Initialize and load power system structure, g = 0
- g = g + 1, i = 0
- i = i + 1
- For each participant u, select a random strategy s from the pool
- Find the bid of each participant
, given the random strategy
- Evaluate the market clearing algorithm given the bids of all players and store the achieved profits
- If i < numIterations goto line (3), else:
- Calculate the fitness of each strategy
as the average of all profits received this generation
- Perform selection, crossover and mutation of all strategies
- If g < numGenerations goto line (2)
- Simulation ends
The number of iterations is selected such that all strategies are played on average ten times each generation. In this paper, the standard number of strategies for each participant was n = 50. In other words, there were 500 iterations per generation.
Representing the strategy as a neural network
A strategy is the rule by which the participants determine their bid, in this case the markup. In mathematical terms, a strategy is a function that takes a set of inputs and returns a decision (markup). This function has parameters that determine the nature of the relationship between inputs and output; it is these parameters that the GA work upon.
Inspired by research on the Iterated Prisoner’s Dilemma (IPD), the inputs to the strategy are taken to be the markup and the price attained when the strategy was last played. The loose resemblance to an IPD, is that the markup is one’s own level of cooperation and the price corresponds to the cooperative level of the competitor(s). Using these two inputs, the strategy essentially has a memory of one round and is able to evolve tit-for-tat strategies and similar, i.e. if you bid high last turn I also continue to bid high this turn. The amount of information taken as input could easily be extended to allow for more sophisticated strategies, maybe this could be an area for recurrent neural networks.
A convenient way of creating a parametrized function, that can represent a strategy as described above, is a multilayer feed-forward neural network. It can approximate any function arbitrarily well, given sufficient neurons. In the article I used a small neural network with one hidden layer. The weights and biases in the neural network were then evolved using the genetic algorithm. A modern approach would maybe use deep q-learning or a similar approach instead.
Perfect competition and co-operative equilibria
The model is primarily used to set up market situations that model classical game theory situations which have been analyzed extensively in the literature. The idea is to show that self-play converges to well-known Nash-equilibria (NE).
The first part of the article shows some rather obvious results:
- In a market with a lot of small generators and a surplus of capacity, the markup converges to zero
- In a duopoly where each generator can serve exactly half the load, the markup converges to the highest allowed markup
- In a duopoly where one generator alone can serve the load (Bertrand competition), the markup converges to zero
Conceptually, Bertrand competition bears some resemblance to the Prisoners Dilemma (IPD): If both generators co-operate by bidding high they get a good profit, however, it is always in the interest of the other to defect and bid marginally lower, thus mutual defection is ensured. In the IPD game it is known that players can evolve strategies that maintain a mutually cooperative level even though the equilibrium for both players is to defect if the game is played only once. In the normal IPD, there are only two choices, either full cooperation or full defection. A common extension of the game is to include several discrete levels of cooperation.

In a paper on the IPD from 2002, the authors noted that the chance of mutual cooperation dwindled as the possible choices in the IPD increased. Considering that each player in the Bertrand game can select any real positive number as her price (which can be thought of as the levels of cooperation), it is simply very unlikely to remain at a mutually cooperative level. This effect can explain why mutual cooperation is likely if the choice of each player is limited to only a few discrete price levels, but highly unlikely if the price can be selected from a continuum.
Other papers on agent based modelling limit the agents to select from a discrete set of choices. If the market situation resembles Bertrand competition, the above discussion might lead to think that mutual co-operation can occur, albeit with a small probability if the number of choices is large. However, it turns out that the similarity of the Bertrand and Prisoners Dilemma game disappears if the number of choices is limited. In fact, the outcome of simulated Bertrand competition with discrete levels of cooperation is quite arbitrary and depends on the exact markups the players can choose from.
Competition with capacity constraints
With capacity constraints, one generator alone can not serve the entire market like in Bertrand competition. The participant that bids higher than its opponent still captures some residual demand. A market player can thus follow two strategies:
- Try to undercut the other player to get a large market share and risk low prices, or
- set the markup high to make sure the market clearing price remains high and accept that the other player gets the larger market share.
In this case there exists a price where a market player is indifferent as to which direction it should diverge as long as the opponent sticks to its price. The price of indifference occurs when undercutting the opponent gives the exact same profit as bidding higher than the opponent and capturing the residual demand.
The price of indifference depends on the exact market setup. In the simulation below, the price of indifference was 160. The bid of player 1 stays high (to maximize profits) while the bid of player 2 stays below the threshold of 160. Due to random mutation in the genetic algorithm, the values don’t stabilize entirely. Who ends up as the high bidder, and thus worse of, is entirely random.

Duopoly in a congested two-bus network
So far, the underlying electrical network has been kept uncongested to enable analysis of some well known economic games. In this section, the simplest possible constrained network will be analysed theoretically and simulated with the developed model. The network is depicted in the figure below. It consists of two buses with a generator and load each; the buses are connected by a transmission line capable of transferring K MW in either direction.

Moreover, the transmission capacity K is less than the demand at either of the buses and the generators are capable of serving the load at their own bus and at the same time utilize the entire transmission capacity.
Accordingly, the generators face a strategic choice:
- Bid high to raise the price at its bus and serve the load that the opponent cannot capture because of the limited transmission capacity; or,
- Bid low and serve all the load it possibly can cover.
At first, this might seem like the Bertrand game with capacity constraints. However, there is one important difference: since the generators are paid the nodal price (and not a common system price), the low bidder can profitably raise its price towards that of the high bidder. The consequence is that there is no NE in pure strategies, resulting in a cyclical variation in the price over the course of a simulation.

Competition in a meshed and congested grid
This section presents results from an analysis of a small benchmark power system presented in a paper on agent based modelling. There are 3 strategically behaving generators in the system and all load is assumed to to be completely inelastic. All transmission lines are assumed to have sufficient transmission capacity with the exception of the line from bus 2 to bus 5 which is limited to 100 MW.


The authors in the original paper analytically computed the NE and simulated the system using adaptive agents with a discrete set og markups. Two symmetrical NE were found; generator 3 should always bid its highest markup while either generator 1 or 2 bids with the highest markup and the other bids its marginal cost. The agent-based approach did discover the NE, but alternated between the two equilibria in a cyclical fashion. Here, the markups can be any real number on the interval [0; 30], otherwise the simulation setup is identical to that of the original work.
I find the same NE, however no cyclical behavior occurs. As soon as the agents have locked in on one NE the game stays in that equilibrium for the rest of the simulation. To discover several equilibria, the model must be rerun with random initial conditions.

The figure above shows simulated simulated markups on the left and the corresponding consumer benefit (about 2500), producer benefit (15000) and congestion rent on the right. If all producers had bid according to their marginal cost, we would have achieved the socially optimal solution, which has producers benefit at 1056, and consumers benefit at 17322. Accordingly there has been a large transfer of benefit from consumers to producers due to market power.
If we double the capacity on the line between bus 2 and 5, market power is mostly eradicated. In addition, the total economic benefit to society increases slightly.
Ideas for further work
Even with the simple setup used in this work it is possible to evolve interesting strategies. However, the field of reinforcement learning has evolved a lot since 2009. One idea for further development would be to re-implement the ideas presented here using modern frameworks for reinforcement learning like open Ai gym and Tensorforce. Such a setup could possibly be used to evolve cooperative strategies with longer memory as well as simulate time-varying market situations (like in the real world).
Leave a Reply