DEV Community

Huseyn
Huseyn

Posted on

Online Learning Problem

Online Learning Problem (OLP)

What is Online Learning?

Online learning is a learning paradigm where a model learns sequentially while making decisions, without access to the full dataset beforehand. At each step, the learner must act first and learn from feedback afterward.

Unlike offline (batch) learning, online learning operates in real time and continuously updates its strategy.


Standard Online Learning Framework

At each round (t):

  1. The learner selects an action or prediction (a_t)
  2. The environment reveals the outcome (y_t)
  3. The learner incurs a loss (\ell(a_t, y_t)) or receives a reward
  4. The learner updates its decision rule

Future data is never known in advance.


Performance Measure: Regret

Online learning performance is measured using regret, defined as:

[
\text{Regret}(T) = \sum_{t=1}^{T} \ell(a_t) - \min_{a} \sum_{t=1}^{T} \ell(a)
]

Interpretation:

How much worse did the learner perform compared to the best fixed decision chosen in hindsight?

Low regret implies effective learning.


Types of Online Learning Problems

1. Online Prediction with Expert Advice

  • The learner combines predictions from multiple experts
  • Goal: perform almost as well as the best expert in hindsight

2. Online Convex Optimization

  • Loss functions are convex
  • Common in large-scale machine learning

3. Bandit Feedback

  • Only the loss or reward of the chosen action is observed
  • Limited feedback setting

4. Contextual Online Learning

  • Decisions depend on observed context or features

Bandit Models

What is a Bandit Model?

A (multi-armed) bandit model is a special case of online learning where the learner repeatedly chooses among several actions (arms), each with an unknown reward distribution.

The learner observes feedback only for the selected arm.


Exploration vs Exploitation Trade-off

  • Exploration: trying uncertain actions to gain information
  • Exploitation: choosing the action that currently appears best

Balancing these two is the core challenge of bandit problems.


Formal Bandit Setting

  • A finite set of arms: (A = {1, 2, \dots, K})
  • Each arm has an unknown reward distribution
  • At each round:

    • Select one arm
    • Observe its reward
    • Update beliefs

Performance is measured using regret.


Types of Bandit Models

Stochastic Bandits

  • Rewards drawn from fixed but unknown distributions

Contextual Bandits

  • Arm choice depends on observed context
  • Used in recommender systems and ads

Adversarial Bandits

  • Rewards may be chosen by an adversary
  • No statistical assumptions

Common Bandit Algorithms

  • (\varepsilon)-greedy
  • Upper Confidence Bound (UCB)
  • Thompson Sampling
  • EXP3 (adversarial setting)

Relationship Between Online Learning and Bandits

Aspect Online Learning Bandit Learning
Feedback Full loss information Partial (chosen action only)
Information More Less
Scope General framework Special case

All bandit problems are online learning problems, but not all online learning problems are bandits.


Applications

  • Adaptive routing and forwarding
  • Caching and resource allocation
  • Recommender systems
  • Online advertising
  • Congestion control
  • Real-time decision systems

Key Takeaway

Online learning focuses on learning while acting, and bandit models address this challenge under limited feedback conditions.

Top comments (0)