Prompt tuning stopped at text. Multimodal LLMs (images, video, molecules) need joint prompt search. Read: https://arxiv.org/abs/2510.09201 (arXiv).
Core insight: prompts are pairs (text + non‑text). MPO jointly optimizes both with alignment‑preserving updates (keeps decoder behavior stable) and a Bayesian selector that reuses past evals as priors.
How to adopt (practical): 1) parameterize non‑text inputs as prompt embeddings; 2) freeze decoder and apply alignment‑preserving updates to prompt vectors; 3) use Bayesian acquisition that leverages prior evals to focus candidates.
Takeaway: Don’t optimize text alone. Joint multimodal prompt optimization (MPO) outperforms text‑only tuning and reduces wasted evaluations across images, video, and molecules. Read: https://arxiv.org/abs/2510.09201
Top comments (0)