Halfway through this series of lectures there is a theoretical explanation of Markov decision processes and reinforcement learning. To see how a reinforcement learning algorithm might work in practice, I made an AI to play a card game following some simple steps. The game in question is a form of good old Higher or Lower with betting. The bot will be dealt one of 52 cards by the dealer and must then decide whether to `play`

or `pass`

. If it chooses `pass`

, then the round ends there and the bot incurs a small loss. If the bot chooses `play`

, then the dealer gets dealt a card and the two face off. The bot will remember the rewards and punishments he receives and use them to drive his decisions in subsequent trials. In other words, choosing `play`

on cards that make it win will give it a positive reinforcement, while choosing `play`

on cards that make it lose will give it a negative reinforcement.

(If you don't want to read, you can go straight to my homepage to play around with the algorithm. The code can be found on my Github.)

## Keep Calm And Act Randomly

Before the game starts, our bot faces a bit of a hurdle. It doesn't actually know much of the stuff explained above. It wants to maximize its profits, but it doesn't know the rules of the game. In fact, it doesn't even know that it's playing--what humans would call--a *game* at all. From the moment the first card is dealt to the bot, all it sees is that it is in a state (*s*) and that it can take an action (*a*) in the hopes of maximizing its return at a subsequent state (*s'*). It has no awareness of the dealer. Actually, the bot doesn't even register the dealer's card after the bot chooses `bet`

. All the bot knows is the rank of the card it got dealt and the reward following the action it took. So the bot's only concern is to figure out which card ranks are on average better to choose `bet`

with and which card ranks are better to choose `pass`

with. It accomplishes this by starting with random actions for a predetermined number of trials, let's say 100. Acting completely randomly at first will ensure that the bot maximally explores the game tree. After these first 100 trials, it will start to slowly implement what it has learned by acting less randomly, but it will still maintain some probability of a random action--an exploration rate (*epsilon*). So even though the bot learned some stuff during its first 100 trials, it will still *sometimes* act randomly afterwards to ensure that it keeps learning and doesn't get stuck in an inaccurate model. But the more trials have passed, the more probable it is that the model is accurate. So the bot will also slowly decrease the exploration rate (*epsilon*) over time by a learning rate (*alpha*).

What we can expect from all this is that the bot will, on average, perform poorly in the first 100 trials--when it's completely unfamiliar with the environment it's in--but then gradually get better results as it goes through more trials. Eventually, it should be able to attain reasonably consistent results using what he has learned.

## A Trial Broken Down

A game is initialized with a number of maximum trials, number minimum exploration trials, an exploration rate, a learning rate and payouts for winning, losing and passing. Here is what one trial roughly looks like:

- start of trial
- the bot gets dealt a card
- if the current trial is less than the number of minimum exploration trials:
- the bot chooses an action based on a randomly generated number

- if the current trial is greater than the number of minimum exploration trials:
- the bot generates a random number
- if this number falls within the exploration rate:
- the bot chooses an action based on another randomly generated number

- if it is outside of the exploration rate:
- the bot uses an optimal action-value function to determine his best move. Put more simply, it looks at how each action performed on average in the past and chooses the action that did best
- if both actions have the same average return, the bot chooses an action based on a randomly generated number

- if this number falls within the exploration rate:

- the bot generates a random number

- if the current trial is less than the number of minimum exploration trials:
- if the chosen action is
`pass`

:- the bot gets a (small) punishment

- if the chosen action is
`play`

- the dealer deals himself a card, evaluates the winner and gives the bot a reward for winning and a punishment for losing

- the bot remembers the reward or punishment it received for this card rank
- end of trial

## Some Limitations

A core feature of this algorithm is that it treats each card rank discretely, which is not great if efficiency is desired. For a game played with 52 card ranks this will not be a problem, but it will if we increase the number of ranks to 10000. Imagine the bot has been dealt card rank 5 a couple of times and the average return for that card is -10 for the `bet`

action and -2 for the `pass`

action. At the same time, it has not yet been dealt card rank 6 at all, so the average return for that card is 0 for both `bet`

and `pass`

. The algorithm will now take a random action when it finally does get dealt card rank 6. This is unfortunate, because we as humans of course understand that card rank 6 will probably perform similarly to card rank 5, which makes `pass`

our preferred action. So a possible improvement to this algorithm is to give it the option of peeking into the average return of the closest neighbors of a card--in the event that the algorithm gets dealt an unknown card or is still unbiased after a certain number of trials. In this example, the bot could peek into the returns for cards 5 or 7 for inspiration. But since there might still be some slight differences between cards 5, 6 and 7, we could also attach a probability to this peeking option to force the bot to *sometimes* still explore the unknown branch.

Another consequence of this discrete state space is that the bot will not only sometimes act randomly when it doesn't need to, but it might also simply choose the wrong action all together. Let's say the average return for card ranks 20 and 21 are +10 and +14, but due to the dealer getting lucky and winning when the bot chose `play`

on card rank 22, the bot thinks that the average return for this card is -13. If the bot is at a point where it no longer acts randomly, then it will stay stuck thinking that 22 is a card he should choose `pass`

on, even though the reality is that it has a higher expected return than card rank 21.

Luckily, the chosen game for this demonstration is so simple that none of these limitations form a large enough hurdle to do something about it. But these (among other) things are certainly things that can be accounted for in future algorithms.

## See It In Action

Now go see it in action! You can use this algorithm to train your own bot with various parameters, then test it to see how it performs in real-time.

If you are familiar with this stuff and see that I'm misapplying something here, do let me know. I'm always trying to learn :D

## Discussion