When you train a neural network or any ML model, performance depends heavily on hyperparameters — learning rate, batch size, number of layers, regularization strength. Finding good values is expensive because each evaluation means training a model end to end.
The standard approaches each have tradeoffs:
Grid search tries every combination on a predefined grid. It works for 2-3 parameters but scales exponentially. A 5-parameter search with 10 values each is 100,000 evaluations. Not practical.
Random search samples uniformly and usually finds a decent region faster than grid search. But it has no memory — it doesn't learn from previous evaluations to focus on promising areas.
Bayesian optimization (what Optuna and Hyperopt use under the hood) builds a surrogate model of the objective and samples where improvement is most likely. Very sample-efficient in low dimensions. But the surrogate model itself becomes expensive to fit in high-dimensional spaces, and it can get stuck when the objective surface has many local optima.
Swarm methods like PSO (Particle Swarm Optimization) maintain a population of candidate solutions that share information about good regions. They scale better to high dimensions than Bayesian methods. The failure mode is premature convergence: every particle gets pulled toward the same global best, the swarm loses diversity, and it can't escape a local optimum once it's trapped.
That last problem — premature convergence in swarm methods — is what I was trying to address.
The idea behind MSPO
Standard PSO gives every particle the same update rule every iteration: move toward your personal best, move toward the swarm's global best, add some inertia. The weights change over time, but the type of movement is always the same. This makes the swarm predictable, which is exactly the wrong property when you're trying to explore a complex landscape.
MSPO (Multi-Strategy Parrot Optimizer) takes a different approach: instead of one update rule, there are four. Each iteration, each agent in the swarm independently and randomly picks one of the four behaviors. Some agents explore aggressively, some exploit locally, some follow the crowd, some deliberately go against it. The swarm maintains diversity because different agents are doing genuinely different things at the same time.
The four behaviors are loosely inspired by how parrots behave in groups:
Foraging. The agent takes a Levy flight — a random step with a heavy-tailed distribution, meaning it usually takes small steps but occasionally jumps far. The step is scaled by the distance to the global best and pulled toward the population mean. This is the exploration behavior. Early in the run the mean-pull is strong (the flock stays together); late in the run it fades and agents explore independently.
Staying. The agent drifts toward the global best with some random noise that shrinks over time. This is pure exploitation — fine-tuning around a known good region.
Communicating. A coin flip. Half the time the agent moves toward the group mean (flocking). The other half it moves in a random direction with a decaying step size (going off alone). This creates a mix of conformist and nonconformist behavior in every iteration.
Fear of strangers. The position update combines attraction toward the best solution with repulsion from the current position, modulated by a chaotic sequence from a Tent map. The chaos prevents the swarm from settling into a fixed pattern during the later stages of optimization.
Three additional components support the behaviors:
- Sobol initialization: instead of scattering agents randomly, a low-discrepancy sequence covers the search space more uniformly from the start. Less wasted exploration in the opening iterations.
- Exponentially decaying inertia weight: starts high (broad jumps) and decays toward a low floor (small refinements). The algorithm naturally transitions from exploration to exploitation without manual tuning.
- Parametric Tent map: generates a deterministic but unpredictable chaotic sequence that modulates the "fear of strangers" behavior. Structured chaos is better than pure randomness for escaping local optima.
How to use it
Install:
pip install mspo
Basic usage — minimize any function that takes a numpy array and returns a float:
import numpy as np
from mspo import MSPO
def objective(x):
return float(np.sum(x ** 2))
opt = MSPO(
objective=objective,
bounds=[(-100, 100)] * 10,
n_parrots=30,
max_iter=1000,
seed=42,
)
result = opt.run()
print(result.best_value) # ~1.6e-12
print(result.best_params) # ~zeros
Tuning a classifier
Here's a more realistic example — tuning a random forest on your own dataset:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from mspo import MSPO
def objective(params):
n_estimators = int(params[0])
max_depth = int(params[1])
min_samples_split = int(params[2])
clf = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
min_samples_split=min_samples_split,
random_state=0,
)
score = cross_val_score(clf, X_train, y_train, cv=3).mean()
return -score # MSPO minimizes, so negate accuracy
opt = MSPO(
objective=objective,
bounds=[
(10, 500), # n_estimators
(2, 50), # max_depth
(2, 20), # min_samples_split
],
n_parrots=15,
max_iter=50,
seed=0,
)
result = opt.run()
print(f"Best CV accuracy: {-result.best_value:.4f}")
A few practical notes:
- MSPO minimizes. If you want to maximize accuracy, negate the return value.
-
Integer parameters need to be cast with
int()inside your objective. The optimizer works in continuous space; you handle the rounding. - The seed parameter makes runs fully reproducible — same seed, same result, every time. Use this when comparing different objective functions or parameter bounds.
- n_parrots and max_iter control the computational budget. Total evaluations = n_parrots × (1 + max_iter). For expensive objectives (each evaluation takes minutes), use fewer parrots and iterations. For cheap objectives (milliseconds per eval), you can afford more.
Tuning a neural network
Same pattern works for PyTorch or TensorFlow — just wrap your training loop:
import torch
from mspo import MSPO
def objective(params):
lr = 10 ** params[0] # log-scale: params[0] in [-5, -1]
weight_decay = 10 ** params[1] # log-scale: params[1] in [-6, -2]
batch_size = int(params[2])
model = MyModel()
optimizer = torch.optim.Adam(
model.parameters(),
lr=lr,
weight_decay=weight_decay,
)
val_loss = train_and_evaluate(model, optimizer, batch_size)
return val_loss
opt = MSPO(
objective=objective,
bounds=[
(-5, -1), # log10(learning_rate)
(-6, -2), # log10(weight_decay)
(16, 256), # batch_size
],
n_parrots=20,
max_iter=30,
seed=42,
)
result = opt.run()
Note the log-scale trick for learning rate and weight decay. These parameters span several orders of magnitude, so searching in log space gives the optimizer a more uniform landscape to work with.
Does it actually work?
I validated against the official CEC 2022 benchmark suite — 12 functions (unimodal, multimodal, hybrid, composition) with the published shift vectors and rotation matrices from the competition organizers. This is the standard test used in the metaheuristics community to compare optimizers on a level playing field.
Setup: 30 agents, 1000 iterations, 10 dimensions, 30 independent runs per function. Compared against canonical PSO (constriction coefficients) and random search (same evaluation budget).
| Function | Type | MSPO | PSO | Random |
|---|---|---|---|---|
| F1 Zakharov | unimodal | 45.06 | 0.00 | 5436 |
| F2 Rosenbrock | unimodal | 10.05 | 62.98 | 256 |
| F3 Schaffer F7 | multimodal | 1.7e-04 | 0.03 | 0.20 |
| F4 Rastrigin | multimodal | 38.00 | 51.00 | 87.90 |
| F5 Levy | multimodal | 0.55 | 0.67 | 2.47 |
| F6 Hybrid 1 | hybrid | 39637 | 58895 | 10.4M |
| F7 Hybrid 2 | hybrid | 60.02 | 29.50 | 214 |
| F8 Hybrid 3 | hybrid | 472 | 298 | 1061 |
| F9 Composition 1 | composition | 398 | 426 | 516 |
| F10 Composition 2 | composition | -1254 | 29.57 | -366 |
| F11 Composition 3 | composition | 2.32 | 79.14 | 117 |
| F12 Composition 4 | composition | 165 | 170 | 242 |
Values are median error (f(x) - f*) over 30 runs. Lower is better.
MSPO wins on 9 of 12 functions. PSO wins on Zakharov (a smooth unimodal function where simple attraction-to-best is enough) and two hybrid functions. MSPO's advantage shows up most clearly on the composition functions (F9-F12), where the landscape has multiple overlapping basins — exactly where behavioral diversity matters.
Where it doesn't win: purely unimodal problems where the shortest path to the minimum is a straight line. PSO's simple "move toward best" is hard to beat when there are no local optima to escape from. If your hyperparameter landscape is likely smooth and unimodal, PSO or Bayesian optimization may be more appropriate.
Adapting it to your problem
Some guidelines based on what I've found works:
When MSPO is a good fit:
- Hyperparameter spaces with 3+ dimensions where grid search is too expensive
- Objectives with multiple local optima (most neural network training landscapes)
- Situations where you can afford 500-30000 evaluations but need better results than random search
- When you want reproducible tuning (set the seed)
When something else might be better:
- Very expensive objectives where you can only afford <100 evaluations — use Bayesian optimization (Optuna)
- Purely combinatorial/discrete spaces — MSPO works in continuous space, so you'd need to round parameters. Evolutionary methods designed for discrete spaces may be more natural.
- Low-dimensional smooth problems (1-2 parameters) — grid search or scipy.optimize will be simpler and effective
Tuning MSPO itself:
The default parameters (from the paper) work well as a starting point. The main knobs:
-
n_parrots: more agents = better coverage but more evaluations per iteration. 20-50 is a reasonable range. -
max_iter: more iterations = finer convergence. 200-1000 depending on your budget. -
seed: always set this for reproducibility. Run with 3-5 different seeds and take the best if you want robustness.
The inertia weight, Tent map parameters, and Levy flight beta all have paper-validated defaults and I haven't found a case where changing them helps significantly. Leave them alone unless you have a specific reason.
Source code and reproduction
Everything is on GitHub with 73 unit tests:
- Repo: github.com/vijaygovindaraja/mspo
-
PyPI:
pip install mspo
To reproduce the CEC 2022 benchmarks:
pip install mspo[benchmark]
python benchmarks/run_cec2022.py --quick # ~2 min smoke test
python benchmarks/run_cec2022.py # full run, ~2.5 hours
The package is 6 source files, each under 200 lines. If you want to understand or modify the algorithm, start with mspo/behaviors.py (the four update rules) and mspo/optimizer.py (the main loop).
Paper: Govindarajan, V. (2025). MSPO: A machine learning hyperparameter optimization method for enhanced breast cancer image classification. Digital Health.
Top comments (0)