DEV Community

Debojyoti Chakraborty
Debojyoti Chakraborty

Posted on

Understanding of Reinforcement Learning first lecture stanford cs243 course

1.intelligent agent->
2.learn to make good sequential decisions
3.optimality
4.utility

5.a agent need to intelligent to make good decisions

Atari learn the game from pixel to pixel


video game playing

robotics grasping clothes

educational games to amplify human intelligence


NLP,vision kind of optimization process


key aspects:

  • optimization: good decision or at least good strategy
    • delayed consequences: no idea about decision is good for now or immediate but helpful past
    • exploaration: agent explore everything try to leearn everything.... data is censored only a reward for decision made.
    • policy is mapping pst experiences to the action not better if preprogram due to large high search space

good question why not pre-program a policy?

big search space

enourmous code base

atari learning from space of images what to do next
need some sort of generalisation.

  • AI planning : ogd why go game don't need exploration?
  • supervised: og already have experience as form of dataset
  • unsupervised: og no label but have data
  • RL:oged
  • imitation learning:ogd learning from others experience.

    assumes that input coming from good policy demos.

reduces rl to supervised learning

explore the world use experience to guide decisions


end of class goal


sequential decision making under uncertainity

  • interactive close loop prcess agent take action max reward observation,reward -- > max future reward

expected stochastic process need strategic behaviour to get high reward

balancing immediate and long term reward

it may have to make long decision in which it get no rewards
for long times

if agent get easy option to choose for maximising the reward it can do that

reward function() is a important one.

sub desicipline of machine teaching

---------------------|------------

  • + (0)+ - -

need constant two points

|||important things for sequential decision:
History,state space,world state descrete timer

small subset of the real world state

(Markov Assumption):

state current observation : s(t)

   t=inf
Enter fullscreen mode Exit fullscreen mode

his(t)=sum(s(i))
i=0
^
history|

markov whole history can be markov

POMD


Bandits: actions have no influence on next observations


MDP and POMDPs actions influence future onservations

types of SDP:

Deterministic:

Stochastic:

RL algorithm :

model

policy : mapping function States->actionsstochas

stochastic policy Determinsitic policy

value fucntion gamma: expected discounted sum on future rewards
Reward: Mars Rover Stochastic Markov Model

RL agents:

Model based: have model

Model free: have policy and value function

Key challenges:

Planning,

finite horizon setting is to the time span of the system operation during which you are concerned about such defined performance measures. If you want to control the system, meeting the performance measures for a finite time say T, then the problem is finite horizon and if you are concerned about the optimality during the whole time span i.e till t=∞

, then it is an infinite horizon problem.

The problem of deriving control u(t)
, t=[0,T] for the system

x˙(t)=Ax(t)+Bu(t)

such that the performance index

PM=∫T0x(t)′Qx(t)+u′(t)Ru(t)dt

is minimised is a finite horizon problem

The problem of deriving control

, 
t=[0,∞] for the system
x˙(t)=Ax(t)+Bu(t)
such that the performance index
PM=∫∞0x(t)′Qx(t)+u′(t)Ru(t)dt

Enter fullscreen mode Exit fullscreen mode

is minimised is an infinite horizon problem

Evaluation and control

Top comments (1)

Collapse
 
darkdebo profile image
Debojyoti Chakraborty • Edited

links goes here for understanding:

infinite horizon problem over optimal control: math.stackexchange.com/questions/2...

a simple example on it:math.stackexchange.com/questions/2...