This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I built Social Butler, a comprehensive AI-powered toolkit designed to streamline the creative workflow for social media managers, content creators, and marketers. The applet solves the time-consuming process of creating engaging social media assets by offering a suite of powerful, specialized tools in one cohesive interface.
Social Butler features three core modules:
- YouTube Thumbnail Generator: A highly customizable tool that goes beyond simple text-to-image. It allows users to define a theme, art style, lighting, framing, and specific text overlays to generate eye-catching thumbnails that drive clicks. Users can generate from a detailed prompt or upload their own base image for the AI to edit and enhance.
- Social Media Post Generator: This module crafts platform-specific copy (for LinkedIn and Instagram) based on a user's core idea and desired post type (e.g., promotional, educational). Crucially, it also generates a contextually relevant image to accompany the text, providing a complete, ready-to-publish content package.
- Background Remover: A straightforward yet essential utility that takes any user-uploaded image and intelligently removes the background, providing a clean PNG with a transparent background, perfect for layering in other designs.
Demo
How I Used Google AI Studio
This application is built entirely on the power of the Gemini API, which I prototyped and refined using Google AI Studio. The platform was essential for testing different multimodal prompting strategies to achieve the desired quality and control across all features.
I leveraged two key models:
- gemini-2.5-flash: This model is the text-generation workhorse. It's used for the "meta-prompting" in the Thumbnail Generator, where it intelligently transforms simple user selections into a detailed, descriptive prompt for the image model. It is also used to generate the nuanced, platform-aware copy for the Social Media Post Generator.
- gemini-2.5-flash-image-preview: This is the core of the app's visual capabilities. As a versatile multimodal model, it handles all image-related tasks:
- Text-to-Image Generation for creating thumbnails and social media images from scratch.
- Image-and-Text Editing for enhancing a user's uploaded base image in the Thumbnail Generator.
- Mask-free Image Editing for the Background Remover, where it understands the instruction to isolate the subject without needing a specific mask.
Multimodal Features
Social Butler is fundamentally multimodal, integrating text and images as both inputs and outputs to create a seamless user experience.
- Combined Image and Text Input for Thumbnail Editing: The Thumbnail Generator's most powerful feature is its ability to take a user's uploaded image and a complex set of text-based instructions (theme, style, text to add, etc.) to produce a new, edited image. The model doesn't just overlay text; it reinterprets the entire image in the context of the user's request, creating a cohesive and professional final product. This enhances the user experience by giving them creative control far beyond a simple filter, allowing them to bring their precise vision to life.
- Text-to-Multimodal-Output for Social Posts: The Social Post Generator demonstrates a chained multimodal workflow. It starts with a text prompt from the user and first generates a text output (the post copy). It then intelligently creates a new text prompt derived from the post's content and context, which is fed to the image model to generate a perfectly matching visual. This creates immense value by packaging two distinct creative tasks into one click, ensuring the text and image are thematically aligned and saving the user significant time.
- Instruction-Based Image Editing: The Background Remover uses multimodal input in its simplest, most practical form: an image combined with a direct text instruction ("remove the background"). The model's ability to understand this command and perform a complex editing task without further user input (like manual masking) makes a tedious task trivial. This direct, instruction-based interaction makes the tool highly intuitive and efficient.
Top comments (0)