How to improve the evaluation of one position and not evaluate other positions worse?

by mercury0114   Last Updated December 28, 2017 11:19 AM

I am trying to make a computer learn how to play the m,n,k-game with the help of the reinforcement learning. Here is a high level summary of the algorithm:

for (int i = 0; i < NUMBER_OF_ITERATIONS; i++) { // repeat 1000000 times
    Position position = startingPosition(); // empty board
    while (!gameEnded(position)) {
        boolean makeBestMove = selectTrueWithProbability(0.9);
        Position nextPosition = makeBestMove ? makeBestMove(position) 
                                             : makeRandomMove(position);

        double currentEvaluation = neuralNetwork.evaluate(position);
        double nextEvaluation = gameEnded(nextPosition) ?
            gameResult(nextPosition) // 1.0 for win, 0.0 for draw, -1.0 for loss
            : neuralNetwork.evaluate(nextPosition);

        if (makeBestMove) {
            neuralNetwork.adjustWeights(currentEvaluation - nextEvaluation, 
        position = nextPosition;            

The neuralNetwork.evaluate(position) feeds the board configuration into the neural network and computes the output in the range [-1.0, 1.0]. My goal is to train the neuralNetwork to return:

  • Value close to 1.0 if the position is a win for white (the one who starts the game).
  • Value close to 0.0 if the position is a draw;
  • Value close to -1.0 if the position is a loss for white.

I am training the neural network using the back-propogation:

neuralNetwork.adjustWeights(nextEvaluation - currentEvaluation, position)

computes the gradient at a point position and to every weight adds currentEvaluation - nextEvaluation multiplied by the gradient.

The problem is that the computer never learns to play the game well. To debug my algorithm I considered the simple 3,3,3 case (also known as the tic-tac-toe game) and here's what happened:

Iteration nr. 22019
Position: // This is losing for white
CurrentEvaluation: -0.9188801413584552
AdjustedEvaluation -0.9284801453564562 // So the algorithm improves the evaluation

Iteration nr. 40346
Position: // The same position is revisited
Current: 0.884793365231797
Adjusted 0.875083650365247 // The current iteration improves the evaluation, 
                              but for some reason many iterations 
                              in between made the evaluation worse.

I suspect the problem is that between the iterations 22019 and 40346 white won many times. Thus, the weights were adjusted to evaluate every winning position close to 1.0. Thus, the neural network started to give a score close to 1.0 to all positions.

How to solve this problem?

Related Questions

Design of CNN whose input is a 2D game board

Updated February 04, 2018 13:19 PM

What's Libratus made of?

Updated June 16, 2017 17:19 PM