DEV Community

jackma
jackma

Posted on

I Built a Small AI Tool That Starts With a Photo

#ai

I Built a Small AI Tool That Starts With a Photo

I have been experimenting with a small AI tool that begins with a very ordinary input: a photo.

The idea came from a simple observation. A lot of information people want help with is not typed neatly into a prompt box. It is on paper, in a notebook, on a worksheet, in a diagram, or spread across a couple of photos.

So instead of starting with text, I wanted to see what happens when the first step is visual.

👉 Download Now from the App Store: https://apps.apple.com/us/app/ai-snapsolve-homework-solver/id6763911277
App Store Search: AI SnapSolve

The Basic Flow

The prototype is simple in concept:

  1. Take a photo of a problem.
  2. Extract the useful text and visual context.
  3. Identify what kind of question it is.
  4. Route it to a solving path.
  5. Return a step-by-step explanation.

That sounds straightforward, but each step has its own edge cases.

A photo might contain shadows. Handwriting can be uneven. A diagram label might matter more than a paragraph of text. A small minus sign can change the entire result.

The first lesson was that "just use OCR" is not enough. The image-to-question step has to preserve meaning, not only characters.

Why Start With an Image?

For some tasks, typing is already good enough.

But homework-style problems are often visual. They can include equations, tables, geometry figures, chemical formulas, or prompts that depend on layout.

When the user has to manually rewrite all of that, the tool feels slower before the AI even starts helping.

Starting with a photo removes some of that friction.

Photo-based AI workflow prototype

The Interesting Part Is the Middle

The part I found most interesting was not the final answer.

It was the middle layer between recognition and response.

After extracting the problem from the image, the system has to decide what kind of reasoning is appropriate. A physics problem should probably care about units. A geometry problem may need theorem-based language. A chemistry problem can break if symbols are handled loosely.

This led me toward a small routing layer rather than a single generic prompt.

The routing does not need to be fancy to be useful. Even a basic subject-aware step can make the output feel less generic.

Trying Multiple Explanations

Another experiment was generating more than one solution path.

For some problems, one answer is enough. But for learning-oriented tasks, different explanations can reveal different things:

  • one path may be more algebraic
  • one may focus on the concept
  • one may verify the answer
  • one may be closer to the method taught in class

This is useful only if the differences are easy to compare. Otherwise, multiple answers become noise.

The interface matters here as much as the model.

Multi-Image Context

I also underestimated how often a single image is not enough.

A problem can span multiple pages. The instructions may be on one page and the diagram on another. A table can be separated from the questions that depend on it.

Supporting multiple images made the tool feel more realistic.

It also made the extraction step harder, because the system has to treat the images as connected context instead of isolated uploads.

AI explanation and comparison interface

Things That Still Need Care

The tool is useful, but it is not magic.

There are several places where mistakes can enter:

  • OCR can misread symbols
  • visual context can be incomplete
  • the subject classification can be wrong
  • a model can produce a plausible but incorrect step
  • multiple explanations can disagree in confusing ways

Because of that, I think tools like this should be designed around inspection rather than blind trust.

The best version is not "take a photo and accept the answer." It is closer to "take a photo, see the reasoning, compare the path, and check whether it makes sense."

What I Learned

This small project made me think about AI products less as single prompts and more as pipelines.

For this kind of tool, the quality depends on several connected pieces:

  • image capture
  • OCR and visual parsing
  • context preservation
  • subject detection
  • model routing
  • explanation quality
  • user-side verification

If any one layer is weak, the final answer becomes less reliable.

Final Thought

Starting with a photo feels small, but it changes the shape of the interaction.

Instead of asking users to translate the world into a prompt, the tool tries to meet the input where it already exists.

That is the part I find most promising: not replacing thinking, but reducing the awkward first step between a real-world problem and a useful AI explanation.

Top comments (0)