DEV Community

Debojyoti Chakraborty
Debojyoti Chakraborty

Posted on


Understanding of Reinforcement Learning first lecture stanford cs243 course

1.intelligent agent->
2.learn to make good sequential decisions

5.a agent need to intelligent to make good decisions

Atari learn the game from pixel to pixel

video game playing

robotics grasping clothes

educational games to amplify human intelligence

NLP,vision kind of optimization process

key aspects:

  • optimization: good decision or at least good strategy
    • delayed consequences: no idea about decision is good for now or immediate but helpful past
    • exploaration: agent explore everything try to leearn everything.... data is censored only a reward for decision made.
    • policy is mapping pst experiences to the action not better if preprogram due to large high search space

good question why not pre-program a policy?

big search space

enourmous code base

atari learning from space of images what to do next
need some sort of generalisation.

  • AI planning : ogd why go game don't need exploration?
  • supervised: og already have experience as form of dataset
  • unsupervised: og no label but have data
  • RL:oged
  • imitation learning:ogd learning from others experience.

    assumes that input coming from good policy demos.

reduces rl to supervised learning

explore the world use experience to guide decisions

end of class goal

sequential decision making under uncertainity

  • interactive close loop prcess agent take action max reward observation,reward -- > max future reward

expected stochastic process need strategic behaviour to get high reward

balancing immediate and long term reward

it may have to make long decision in which it get no rewards
for long times

if agent get easy option to choose for maximising the reward it can do that

reward function() is a important one.

sub desicipline of machine teaching


  • + (0)+ - -

need constant two points

|||important things for sequential decision:
History,state space,world state descrete timer

small subset of the real world state

(Markov Assumption):

state current observation : s(t)

Enter fullscreen mode Exit fullscreen mode


markov whole history can be markov


Bandits: actions have no influence on next observations

MDP and POMDPs actions influence future onservations

types of SDP:



RL algorithm :


policy : mapping function States->actionsstochas

stochastic policy Determinsitic policy

value fucntion gamma: expected discounted sum on future rewards
Reward: Mars Rover Stochastic Markov Model

RL agents:

Model based: have model

Model free: have policy and value function

Key challenges:


finite horizon setting is to the time span of the system operation during which you are concerned about such defined performance measures. If you want to control the system, meeting the performance measures for a finite time say T, then the problem is finite horizon and if you are concerned about the optimality during the whole time span i.e till t=∞

, then it is an infinite horizon problem.

The problem of deriving control u(t)
, t=[0,T] for the system


such that the performance index


is minimised is a finite horizon problem

The problem of deriving control

t=[0,∞] for the system
such that the performance index

Enter fullscreen mode Exit fullscreen mode

is minimised is an infinite horizon problem

Evaluation and control

Top comments (1)

darkdebo profile image
Debojyoti Chakraborty • Edited

links goes here for understanding:

infinite horizon problem over optimal control:

a simple example on

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git