1. Introduction
The digital world is increasingly reliant on visual content, presenting a significant barrier for visually impaired users. SmartCaption AI (our tool's name) addresses this challenge by leveraging advanced Large Language Models (LLMs) and Multi-Agent framework to generate contextually relevant and accurate image captions. This innovative tool not only describes images but integrates them meaningfully into the web content, enhancing the browsing experience for visually impaired users.
2. The Challenge
Visually impaired users often struggle to access and interpret visual content on websites. Existing solutions typically describe images without context, leading to misunderstandings and inaccuracies. For instance, an image below depicts a person standing on a wooden staircase surrounded by dense greenery, pointing towards the rocky shoreline of an oceanfront property in West Vancouver. This image was incorrectly captioned by using an AI tool "a man standing on a ledge near a river" due to a lack of contextual understanding. Additionally, this caption does not provide any insight for the reader regarding why it would be used in the news.
Image source: https://www.cbc.ca/news/canada/british-columbia/west-vancouver-public-beach-access-1.7279886
3. The Solution
SmartCaption AI overcomes these limitations by first summarizing the article content to provide context for image analysis. This ensures that the generated captions are not only accurate but also contextually relevant. Implemented as a Chrome extension, the tool seamlessly integrates into the user's browsing experience, generating real-time image captions and enabling text-to-speech functionality.
4. Workflow
The picture above describes a workflow of the tool that includes:
User interaction:
- Opens a news article (Step 1).
- Clicks the Chrome extension icon (Step 2).
- Once the user activates the extension, it conducts two tasks simultaneously:
- The application simplifies the article content using the Readability.js library such as advertisement poster and other unnecessary information (Step 2a).
- Sends the image URLs and the article URL to the server (Step 2b).
- Disables the ‘Speak’ button (Step 3).
To enhance LLM capabilities:
- Uses multi-agents, allowing LLM Agents to work together for better results.
- The User Proxy Agent controls the backend workflow.
- Upon receiving the information (Step 5), engages the Web Surfer Agent to summarize the article content (Step 6).
- Sends this summary and the image URLs to the Image Agent through the User Proxy Agent, which generates image captions based on the summary (Step 7).
Once captions are generated:
- The backend server responds to the User Interface (UI) (Step 8).
- The UI appends a customized prefix and suffix to indicate AI generation and also displays them under the images and in the alternative text of the image (Step 4).
- Enables the ‘Speak’ button to activate the Text-To-Speech (TTS) functionality (Step 9).
- The ‘Speak’ button is initially disabled to ensure all images are captioned before TTS activation.
5. Implementation
The tech stack of this project including:
- Frontend: HTML, CSS, Javascript
- Backend: Python, Flask framework, Pyautogen
- LLM Model: OpenAI 'gpt-4o'
For more details, visit the SmartCaption AI GitHub repository
6. Demonstration and Result
Here is the demonstration how the tool work:
In the demonstration, you could see the tool describe the image as:
"The image shows a person standing on a wooden staircase surrounded by dense greenery, pointing towards the rocky shoreline of an oceanfront property in West Vancouver. This staircase is part of a century-old public access path to Altamont Beach, which has recently been sold to a private buyer, sparking local outrage. The individual in the image appears to be discussing or reflecting on the significance of this now-restricted path, symbolizing the community's loss of access to a cherished public space."
The tool accurately describes the image and also provides relevant information related to the article for the reader, reflecting on the significance of this now-restricted path and symbolizing the community's loss of access to a cherished public space.
Enjoying the project? Don’t forget to star it ⭐!
7. Potential For Improvement
- Optimizing processing times for initial loads.
- Optimizing cost by switching to open source LLM such as LLaMA3, Phi3, and Llava.
- Handling a wider range of web content types.
8. Acknowledgement
This project is an implementation of a research paper in which I work with Dr. Randy Lin (Algoma University) to leverage LLMs to generate accurate and relevant image captions. In addition, this paper has been accepted by the IEEE/ICCA'24 conference (Sixth Edition, BUE) and will be held from December 17 to 19, 2024, at The British University in Egypt.
Top comments (2)
Nice use case to solve using tecchnology.
Thanks @piyushtechsavy