DEV Community

Cover image for Detecting Objects in Images from Any Text Prompt (Not Fixed Classes)
Eyasu Asnake
Eyasu Asnake

Posted on

Detecting Objects in Images from Any Text Prompt (Not Fixed Classes)

Most object detection systems assume a fixed label set: train a model on COCO, Open Images, or a custom dataset, and you’re limited to whatever classes you trained for.

I’ve been exploring a different approach: prompt-based object detection, where the input is:

  • an image
  • a free-form natural language prompt

and the output is localized detections matching the prompt. Even when the concept isn’t a single predefined object class.

I built a small web tool to experiment with this idea.

What it can do

The tool supports complex, compositional prompts, not just object names.

Detection from the prompt

Detection from the prompt

Detection from the prompt

These prompts combine attributes, relations, text, and world knowledge. Things that don’t map cleanly to standard detector classes.

What it’s not good at

This approach is not designed for:

  • very small objects
  • obscure, barely visible objects
  • dense real-time detection out of the box

It performs better on concepts that require reasoning and world knowledge, rather than pixel-level precision on tiny targets.

Why I built it

The main motivation so far has been:

creating training data for highly specific detectors

Instead of manually labeling or training a new detector for every niche concept, this can be used to:

  • bootstrap datasets
  • explore whether a concept is learnable
  • validate prompts before committing to full training pipelines

Try it

I’ve made the tool publicly available as a demo:

👉 Detect Anything – Free AI Object Detection Online
https://www.useful-ai-tools.com/tools/detect-anything

No login required. Images are processed transiently and not stored.

(Please don’t abuse it. Inference is relatively expensive.)

Looking for feedback

I’m especially interested in:

  • good real-world use cases people see for this
  • stress-testing and failure modes
  • where this approach breaks down compared to task-specific detectors

If you’ve worked with grounding, referring expression comprehension, or prompt-based vision models, I’d love to hear your thoughts.

Top comments (0)