DEV Community

Cover image for Ask Your Images Anything 
Jimmy Guerrero for Voxel51

Posted on • Originally published at voxel51.com

Ask Your Images Anything 

Author: Jacob Marks (Machine Learning Engineer at Voxel51)

Run Visual Question Answering Models on Your Images Without Code

Welcome to week two of Ten Weeks of Plugins. During these ten weeks, we will be building a FiftyOne Plugin (or multiple!) each week and sharing the lessons learned!

If you’re new to them, FiftyOne Plugins provide a flexible mechanism for anyone to extend the functionality of their FiftyOne App. You may find the following resources helpful:

Ok, let’s dive into this week’s FiftyOne Plugin!

Visual Question Answering 🖼️❓🗨️

Imagine a world where your dataset could talk to you. A world where you could ask any question — specific or open-ended — and get a meaningful answer. How much more dynamic would your data exploration be? Welcome to Visual Question Answering (VQA), a machine learning task squarely situated at the intersection of computer vision and natural language processing. 

In recent years, transformer models like Salesforce’s BLIPv2 have taken this open-ended data exploration to a new level. This plugin brings the power of VQA models directly to your image dataset. Now you can start asking your images those burning questions — all without a single line of code!

Plugin Overview & Functionality

For the second week of 10 Weeks of Plugins, I built a Visual Question Answering (VQA) Plugin. This plugin allows you to ask open-ended questions to your images — effectively chatting with your data — within the FiftyOne App.

Out of the box, this plugin supports two models (and two types of usage):

  1. A Vision-Language Transformer (fine-tuned on the VQAv2 dataset), which is the default VQA model in the Visual Question Answering pipeline from Hugging Face’s Transformers library. This model is run locally.
  2. BLIP2 from Salesforce, which is accessed via a Replicate inference endpoint.

After you install the plugin, when you open the operators list (pressing “ ` ” in the FiftyOne App) and click into the answer_visual_question operator, you can choose which of these models to use.

Enter your question in the question box, and the answer will be displayed in the operator’s output:

No data is added to the underlying dataset.

Installing the Plugin

If you haven’t already done so, install FiftyOne:

pip install fiftyone
Enter fullscreen mode Exit fullscreen mode

Then you can download this plugin from the command line with:

fiftyone plugins download https://github.com/jacobmarks/vqa-plugin
Enter fullscreen mode Exit fullscreen mode

Refresh the FiftyOne App, and you should see the answer_visual_question operator in your operators list when you press the “ ` ” key.

To use the Vision Language transformer (ViLT), install Hugging Face’s transformers library.

pip install transformers
Enter fullscreen mode Exit fullscreen mode

To use BLIPv2, set up an account with Replicate, install the Replicate Python library:

pip install replicate
Enter fullscreen mode Exit fullscreen mode

And add your Replicate API Token to your environment variables:

export REPLICATE_API_TOKEN=...
Enter fullscreen mode Exit fullscreen mode

You do not need both to use the plugin — the operator checks your environment variables and only shows as options models accessible via the corresponding APIs.

If you want to use a different VQA model (or fine tune your own version of one of these!), locally or via API, it should be easy to extend this code.

Lessons Learned

The Visual Question Answering plugin is a Python Plugin consisting of four files:

  • __init__.py: defining the operator
  • fiftyone.yml: making the plugin register for download and installation
  • README.md: describing the plugin
  • requirements.txt: listing the requirements. Both transformers and replicate are commented out by default because neither is strictly required.

Using Selected Samples

Visual question answering models like BLIPv2 typically answer questions about one image at a time. As a result, it only makes sense for the answer_visual_question operator to likewise act on a single image. But how does the operator know which image to answer a question about?

Just like the FiftyOne App, whose session has a selected attribute (see Selecting sample), the plugin’s context, ctx, has a selected attribute. In direct analogy with the session, ctx.selected is a list of sample IDs that are currently selected in the FiftyOne App. 

The VQA plugin looks at the number of selected samples in the resolve_input() method:

num_selected = len(ctx.selected)
Enter fullscreen mode Exit fullscreen mode

And only allows the user to enter a question if exactly one sample is selected.

💡Note: to use ctx.selected, when you expect the selected sample(s) to be changing, you must pass dynamic=True into the operator’s configuration. In this case, the operator config was:

@property
    def config(self):
        return foo.OperatorConfig(
            name="answer_visual_question",
            label="VQA: Answer question about selected image",
            dynamic=True,
        )
Enter fullscreen mode Exit fullscreen mode

Returning Informative Outputs

The VQA plugin doesn’t write anything onto the samples themselves, but we still need a way to see the results of the model’s run: the “answer”. In this plugin, I return the model’s answer as output, using the resolve_output() method.

Output in a Python plugin works in much the same way as input. In resolve_input(), we create an inputs object inputs = types.Object(), add elements to this, e.g. inputs.str("question", label="Question", required=True), and then return these inputs via types.Property(inputs, view=...). In resolve_output(), we create an output object outputs = types.Output() object, add elements to this, e.g. outputs.str("question", label="Question"),and return these outputs via types.Property(outputs, view=...)

The main difference between inputs and outputs is that in resolve_input(), the input values come from the user. How are variables communicated to resolve_output()? You can return them as a dictionary from execute(), and then use the values, referencing them by key.

In this plugin, I pass the question and answer from execute():

return {"question": question, "answer": answer}
Enter fullscreen mode Exit fullscreen mode

Then in resolve_output(), I access these values:

outputs.str("question", label="Question")
outputs.str("answer", label="Answer")
Enter fullscreen mode Exit fullscreen mode

This works for a variety of data types, not just strings!

Conclusion

If you want to chat with your entire dataset, VoxelGPT is a great option. VoxelGPT is another example of a FiftyOne Plugin, which we launched earlier this year. It translates your natural language prompts into actions that organize and explore your data. On the other hand, if you want to ask open-ended questions about specific images in your dataset — without departing from your existing workflows — then this Visual Question Answering plugin is for you!

Stay tuned over the remaining weeks in the Ten Weeks of FiftyOne Plugins while we continue to pump out a killer lineup of plugins! You can track our journey in our ten-weeks-of-plugins repo — and I encourage you to fork the repo and join me on this journey!

Week 2 Community Plugins

🚀Check out this awesome line2d plugin 📉by wayofsamu for visualizing (x,y) points as a line chart!

pip install fiftyone
Enter fullscreen mode Exit fullscreen mode

Top comments (0)