Huseyn

Posted on Dec 29, 2025

Online Learning Problem

#ai #programming #tutorial #discuss

Online Learning Problem (OLP)

What is Online Learning?

Online learning is a learning paradigm where a model learns sequentially while making decisions, without access to the full dataset beforehand. At each step, the learner must act first and learn from feedback afterward.

Unlike offline (batch) learning, online learning operates in real time and continuously updates its strategy.

Standard Online Learning Framework

At each round (t):

The learner selects an action or prediction (a_t)
The environment reveals the outcome (y_t)
The learner incurs a loss (\ell(a_t, y_t)) or receives a reward
The learner updates its decision rule

Future data is never known in advance.

Performance Measure: Regret

Online learning performance is measured using regret, defined as:

[
\text{Regret}(T) = \sum_{t=1}^{T} \ell(a_t) - \min_{a} \sum_{t=1}^{T} \ell(a)
]

Interpretation:

How much worse did the learner perform compared to the best fixed decision chosen in hindsight?

Low regret implies effective learning.

Types of Online Learning Problems

1. Online Prediction with Expert Advice

The learner combines predictions from multiple experts
Goal: perform almost as well as the best expert in hindsight

2. Online Convex Optimization

Loss functions are convex
Common in large-scale machine learning

3. Bandit Feedback

Only the loss or reward of the chosen action is observed
Limited feedback setting

4. Contextual Online Learning

Decisions depend on observed context or features

Bandit Models

What is a Bandit Model?

A (multi-armed) bandit model is a special case of online learning where the learner repeatedly chooses among several actions (arms), each with an unknown reward distribution.

The learner observes feedback only for the selected arm.

Exploration vs Exploitation Trade-off

Exploration: trying uncertain actions to gain information
Exploitation: choosing the action that currently appears best

Balancing these two is the core challenge of bandit problems.

Formal Bandit Setting

A finite set of arms: (A = {1, 2, \dots, K})
Each arm has an unknown reward distribution
At each round:
- Select one arm
- Observe its reward
- Update beliefs

Performance is measured using regret.

Types of Bandit Models

Stochastic Bandits

Rewards drawn from fixed but unknown distributions

Contextual Bandits

Arm choice depends on observed context
Used in recommender systems and ads

Adversarial Bandits

Rewards may be chosen by an adversary
No statistical assumptions

Common Bandit Algorithms

(\varepsilon)-greedy
Upper Confidence Bound (UCB)
Thompson Sampling
EXP3 (adversarial setting)

Relationship Between Online Learning and Bandits

Aspect	Online Learning	Bandit Learning
Feedback	Full loss information	Partial (chosen action only)
Information	More	Less
Scope	General framework	Special case

All bandit problems are online learning problems, but not all online learning problems are bandits.

Applications

Adaptive routing and forwarding
Caching and resource allocation
Recommender systems
Online advertising
Congestion control
Real-time decision systems

Key Takeaway

Online learning focuses on learning while acting, and bandit models address this challenge under limited feedback conditions.

DEV Community