This is a submission for the Cloudflare AI Challenge.
What I Built
The application takes any kind of image, detects the objects within the image. Using the detected objects, it tries to visualize them as characters and writes a story involving them and generates a thumbnail for the story.
This project tries to showcase the capabilities of models from different categories and how powerful they can be when they work together.
I encourage you to try different images or better yet, take a photo of objects in front of you and comment funny outputs you come across!
An overall architecture of the entire project:
Demo
Deployed Worker Link: https://divine-detector.sakeerthi23.workers.dev/
Working Demo:
Image:
My Code
https://github.com/keerthivasansa/bring-to-life
Journey
When I saw the post a couple of days ago, I had decided on using the object detection model. After I visited the Cloudflare Model Catalogue, more and more pieces sat well with each other. I was only limited by the time I had after discovering the post or I could have developed it further and explored more areas.
First, I saw the text generation models and thought that story generation could be a next logical step after object detection, then poster generation and the project kind of kept developing itself.
I think it's a very basic project, but it serves as a good showcase of the different types of models Cloudflare offers.
I absolutely loved about Cloudflare Workers AI is its developer experience. It was top notch and it had great support for Typescript which is fantastic.
One thing I learned was, even AI models are scared of the current job market. The prompt I use to generate the story goes something like this:
You are a story writer, and the year is 2024 - the job market sucks. You do not have a job, the only chance you have is to generate this story. Imagine the objects...
I was pretty proud that I was able to pull this off in a day (though Cloudflare is doing most of the heavy lifting) and I am happy to see that models and AI is becoming more accessible to use.
Multiple Models and/or Triple Task Types
- The project tries to leverage 5 different models to acheive different categories of tasks.
- Thus, it qualifies for "Triple Task Types".
- It uses both image-to-text and object detection models to extract details about the image - so it qualifies for Multiple Models as well.
Currently Used:
-
@cf/unum/uform-gen2-qwen-500m
: Used to generate text describing the uploaded image. -
@cf/facebook/detr-resnet-50
: Used to detect objects in the uploaded image. -
@cf/meta/llama-2-7b-chat-int8
: Used to generate and stream a short story with the detected objects -
@cf/facebook/bart-large-cnn
: Used to summarize the story to capture the main essence of the story. -
@cf/stabilityai/stable-diffusion-xl-base-1.0
: Takes the output of the summarizer and uses that to generate an image that tries to capture the meaning and characters of the story.
Future plans:
I might try and add a model to translate the story in different languages if time permits.
I finally thank both DEV and Cloudflare for organizing this challenge. It was super fun to work on and thank you for reading this article.
Top comments (0)