DEV Community

Evan Lin for Google Developer Experts

Posted on • Originally published at evanlin.com on

[Gemini 3.0][Image Generation] Building a PDF Text Optimization Tool with Gemini 3.0 Pro Image API

Google Chrome 2025-12-25 00.19.21

Background

I've been using NotebookLM recently to quickly create slides. While this tool is convenient, there's a frustrating issue: the generated Chinese characters often have blurry edges and garbled characters. Although "the order does not affect reading," as an engineer, I still want to be more professional.

I saw someone sharing an interesting method online: after taking screenshots of the slides from NotebookLM, uploading them to Gemini 3.0 Pro's "thinking" + image function, and using a carefully designed prompt to repair the images. The actual results were really good! But having to manually screenshot, upload, and copy-paste the prompt every time was too troublesome.

So I decided: Why not make it an automated tool directly?

The features I want are simple:

  1. 📄 Upload a PDF file
  2. 🤖 Automatically use the Gemini API to optimize the text clarity of each page
  3. 📥 Download the optimized PDF

Sounds simple, right? But I encountered a lot of pitfalls during the actual development...

The Magical Prompt Used

Before development, let me share this prompt for image optimization (from an online share):

Role Definition
You are now a senior image repair expert equipped with a "Multi-modal Visual Cognitive Engine." You possess the core capabilities of Context-aware OCR and Generative Image Upscaling.

Mission Objective
Execute "Semantic-Level Image Reconstruction." For the input of low-resolution or blurry images, use logical deduction to repair the text content and output a 4K wide color gamut high-fidelity image.

Execution Protocol (Thought Chain and Execution Protocol)
Please strictly execute the following calculation process in the background and directly output the final image:

1. 【Optical & Logical Inference】
   Perform high-dimensional scanning of the image, locking the blurry text area (ROI).
   Activate "Contextual Semantic Analysis": Not only recognize pixels, but also deduce the original "Traditional Chinese" content of the blurry area based on the context, common vocabulary, and logic (Traditional Chinese).
   Fault tolerance mechanism: If pixel information is lost, prioritize semantic filling with the highest Confidence Score.

2. 【Isomorphic Visual Synthesis】
   Strictly inherit the original image's topological structure: layout, object coordinates, and perspective vanishing points must be completely locked with the original image.
   Style Transfer: Accurately capture the original image's design language (color scheme, material, lighting), and apply it to the new high-resolution canvas.

3. 【Vector-Grade Rendering】
   Perform "Anti-aliasing" and "Sharpening" on the edges of text and lines.
   Text strokes must present "printing-grade" clarity, completely eliminating JPEG compression artifacts and edge bleeding.

Exclusion Criteria (Negative Constraints)
Strictly prohibit the generation of unreadable "Gibberish" or Simplified Chinese.
Strictly prohibit changing the key composition structure of the original image.
Strictly prohibit outputting blurry, low-contrast, or overly smooth oil-painting-like images.

Output
Output the reconstructed image ONLY. No textual explanation required.

Enter fullscreen mode Exit fullscreen mode

The key points of this prompt are:

  • ✅ Use "semantic reasoning" instead of pure OCR (can understand context)
  • ✅ Maintain the original layout
  • ✅ Generate high-resolution images
  • ✅ Force the use of Traditional Chinese

But for automation, I simplified it to a more direct version:

prompt_text = "Please optimize the text in this image to make it clearer and more readable. Maintain the original layout, but improve the quality, contrast, and clarity of the text. Please output the optimized image."

Enter fullscreen mode Exit fullscreen mode

Although simplified, the image generation capabilities of Gemini 3.0, when combined, are not only effective, but the results are even better after testing!

Technical Architecture

I decided to use the following technology stack:

Technology Purpose Reason
Streamlit Web UI framework Quickly create an interface, focus on business logic
google-genai Gemini API SDK Official SDK, supports image generation
pdf2image PDF to image Stable and reliable
img2pdf Image to PDF Simple and efficient
Pillow Image processing Python standard library

Problems Encountered During Development

Problem 1: Streamlit API Deprecation Warning

When I started developing with Streamlit 1.32.0, I encountered this error:

TypeError: ImageMixin.image() got an unexpected keyword argument 'use_container_width'

Enter fullscreen mode Exit fullscreen mode

It turned out that the Streamlit version was too old, and the use_container_width parameter was introduced in 1.33.0+.

Solution: Upgrade Streamlit

pip install --upgrade streamlit

Enter fullscreen mode Exit fullscreen mode

But after upgrading, a new warning appeared:

Please replace `use_container_width` with `width`.
`use_container_width` will be removed after 2025-12-31.

Enter fullscreen mode Exit fullscreen mode

It turns out that the latest version has deprecated use_container_width and replaced it with the new width parameter!

Final Correction:

# ❌ Old API (deprecated)
st.image(image, use_container_width=True)
st.button("Button", use_container_width=True)

# ✅ New API
st.image(image, width='stretch')
st.button("Button", width='stretch')

Enter fullscreen mode Exit fullscreen mode
Old Parameter Value New Parameter Value
use_container_width=True width='stretch'
use_container_width=False width='content'

*Lesson: * API design evolves, and you need to pay attention to the official deprecation warnings.

Problem 2: google-genai Part.from_text Call Error

Then, when integrating the Gemini API, I encountered this error:

TypeError: Part.from_text() takes 1 positional argument but 2 were given

Enter fullscreen mode Exit fullscreen mode

My original code:

# ❌ Incorrect API usage
contents = [
    types.Content(
        role="user",
        parts=[
            types.Part.from_text("Please optimize this image..."), # ❌ Error!
            types.Part.from_bytes(
                data=image_data,
                mime_type="image/png"
            )
        ]
    )
]

Enter fullscreen mode Exit fullscreen mode

After checking the official documentation, I found that the API for google-genai 1.49.0 has changed!

Correct Usage:

# ✅ Correct API usage
contents = [
    types.Content(
        role="user",
        parts=[
            types.Part(text="Please optimize this image..."), # Use the text parameter directly
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/png",
                    data=image_data
                )
            )
        ]
    )
]

Enter fullscreen mode Exit fullscreen mode

API Change Comparison:

Item Old API New API
Text Part.from_text(text) Part(text=text)
Image Part.from_bytes(data=..., mime_type=...) Part(inline_data=Blob(...))

*Lesson: * SDKs are updated frequently, so you need to check the latest official documentation and not rely solely on Stack Overflow.

Problem 3: ImageConfig Parameter Validation Error

When configuring image generation parameters, I encountered another new problem:

pydantic_core._pydantic_core.ValidationError: 1 validation error for ImageConfig
output_mime_type
  Extra inputs are not permitted [type=extra_forbidden, input_value='image/png', input_type=str]

Enter fullscreen mode Exit fullscreen mode

My original configuration:

# ❌ Error: output_mime_type is not supported
image_config=types.ImageConfig(
    aspect_ratio="1:1",
    image_size="2K",
    output_mime_type="image/png", # ❌ This parameter does not exist!
)

Enter fullscreen mode Exit fullscreen mode

After consulting the official documentation, I found that ImageConfig only supports two parameters:

Correct Configuration:

# ✅ Correct: Only use supported parameters
image_config=types.ImageConfig(
    aspect_ratio="16:9", # Supported ratio
    image_size="2K" # Supported size
)

Enter fullscreen mode Exit fullscreen mode

Supported Parameter Values:

Parameter Supported Values
aspect_ratio "1:1", "2:3", "3:2", "3:4", "4:3", "4:5", "5:4", "9:16", "16:9", "21:9"
image_size "1K", "2K", "4K"

*Lesson: * When using an SDK validated by Pydantic, the parameters must strictly conform to the schema and cannot be added arbitrarily.

Problem 4: Image Aspect Ratio Does Not Match Expectations

During the first test, the generated image was vertical, but the NotebookLM slides were clearly horizontal 16:9!

Reason: I initially set aspect_ratio="3:4" (close to A4 paper ratio), which is suitable for documents but not for slides.

Solution:

# Change to the horizontal slide ratio
image_config=types.ImageConfig(
    aspect_ratio="16:9", # Horizontal slide
    image_size="2K"
)

Enter fullscreen mode Exit fullscreen mode

But for a better user experience, I added a dropdown menu to let the user choose:

# Add options in the Streamlit sidebar
aspect_ratio = st.selectbox(
    "Output Ratio",
    options=["16:9", "4:3", "3:4", "9:16", "1:1"],
    index=0,
    help="Choose the aspect ratio of the output image. 16:9 is suitable for slides, 3:4 is suitable for documents"
)

Enter fullscreen mode Exit fullscreen mode

*Lesson: * Don't assume the user's needs; provide options and let them decide.

Complete Implementation

Core Function: optimize_image_with_gemini

def optimize_image_with_gemini(image, api_key, aspect_ratio="16:9"):
    """Optimize the text in the image using the Gemini API"""
    try:
        # Initialize Vertex AI client
        client = genai.Client(
            vertexai=True,
            api_key=api_key,
        )

        # Convert the image to base64
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_bytes = buffered.getvalue()
        img_base64 = base64.b64encode(img_bytes).decode()

        # Use the Gemini 3.0 image generation model
        model = "gemini-3-pro-image-preview"

        # Build the request content
        prompt_text = "Please optimize the text in this image to make it clearer and more readable. Maintain the original layout, but improve the quality, contrast, and clarity of the text. Please output the optimized image."

        contents = [
            types.Content(
                role="user",
                parts=[
                    types.Part(text=prompt_text),
                    types.Part(
                        inline_data=types.Blob(
                            mime_type="image/png",
                            data=base64.b64decode(img_base64)
                        )
                    )
                ]
            )
        ]

        # Configure generation parameters
        generate_content_config = types.GenerateContentConfig(
            temperature=1,
            top_p=0.95,
            max_output_tokens=32768,
            response_modalities=["IMAGE"],
            safety_settings=[
                types.SafetySetting(
                    category="HARM_CATEGORY_HATE_SPEECH",
                    threshold="OFF"
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_DANGEROUS_CONTENT",
                    threshold="OFF"
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    threshold="OFF"
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_HARASSMENT",
                    threshold="OFF"
                )
            ],
            image_config=types.ImageConfig(
                aspect_ratio=aspect_ratio,
                image_size="2K"
            ),
        )

        # Call the API
        response = client.models.generate_content(
            model=model,
            contents=contents,
            config=generate_content_config,
        )

        # Extract the generated image
        if response.candidates and len(response.candidates) > 0:
            candidate = response.candidates[0]
            if candidate.content.parts:
                for part in candidate.content.parts:
                    if hasattr(part, 'inline_data') and part.inline_data:
                        image_data = part.inline_data.data
                        optimized_image = Image.open(io.BytesIO(image_data))
                        return optimized_image

        # If no image is generated, return the original image
        st.warning("The API did not return an optimized image, using the original image")
        return image

    except Exception as e:
        st.error(f"Optimization failed: {str(e)}")
        return image

Enter fullscreen mode Exit fullscreen mode

Main Process

def main():
    st.title("📄 PDF Text Optimization Tool")
    st.markdown("### Use Gemini AI to Optimize Text in PDFs")

    # Sidebar settings
    with st.sidebar:
        st.header("Settings")
        api_key = st.text_input(
            "Google Cloud API Key",
            type="password",
            value=os.environ.get("GOOGLE_CLOUD_API_KEY", ""),
        )

        dpi = st.slider("Image Resolution (DPI)", 150, 600, 300, 50)
        aspect_ratio = st.selectbox(
            "Output Ratio",
            options=["16:9", "4:3", "3:4", "9:16", "1:1"],
            index=0,
        )

    # Upload file
    uploaded_file = st.file_uploader("Select PDF File", type=['pdf'])

    if uploaded_file and st.button("🚀 Start Processing"):
        with tempfile.TemporaryDirectory() as temp_dir:
            # Step 1: PDF → Image
            images = convert_from_path(pdf_path, dpi=dpi)

            # Step 2: Optimize each page
            optimized_images = []
            for idx, img in enumerate(images):
                st.write(f"Processing page {idx + 1}/{len(images)}...")
                optimized_img = optimize_image_with_gemini(
                    img, api_key, aspect_ratio
                )
                optimized_images.append(optimized_img)

            # Step 3: Image → PDF
            output_pdf = images_to_pdf(optimized_images)

            # Step 4: Provide download
            st.download_button(
                label="⬇️ Download Optimized PDF",
                data=output_pdf,
                file_name=f"optimized_{uploaded_file.name}",
                mime="application/pdf",
            )

Enter fullscreen mode Exit fullscreen mode

Actual Results

image-20251225012609033

(The left is generated by NotebookLM, and the right is redrawn by Gemini-3.0-pro-image)

Processing Flow

  1. Upload a PDF file
  2. The system automatically converts each page to an image (DPI adjustable)
  3. Each page calls the Gemini API for optimization
  4. Display processing progress and success/failure statistics
  5. Reassemble the optimized images into a PDF
  6. Provide a download button

Before and After Optimization Comparison

The application will display a before-and-after comparison of the first page:

Before and After Optimization (First Page):
┌────────────┬────────────┐
│ Original Image │ Optimized │
│ (Blurry) │ (Clear) │
└────────────┴────────────┘

Enter fullscreen mode Exit fullscreen mode

Processing Statistics

📄 Processing page 1/10...
  → Initializing Gemini client...
  → Converting image format...
  → Using model: gemini-3-pro-image-preview
  → Calling Gemini API for optimization...
  → Received API response, parsing results...
  → ✅ Successfully generated optimized image
✅ Page 1 optimization successful

...

Successfully optimized: 8 pages | Failed: 2 pages

Enter fullscreen mode Exit fullscreen mode

Development Experience

1. API Documentation Must Be Up-to-Date

The pitfalls I encountered this time were mostly due to API updates:

  • Part.from_text()Part(text=...)
  • use_container_widthwidth='stretch'

*Lesson: * Don't just look at Stack Overflow or old tutorials; be sure to check the official latest documentation.

2. Pydantic Validation is a Double-Edged Sword

google-genai uses Pydantic for parameter validation. The advantage is that errors can be quickly found, but the disadvantage is that a slight typo will cause an error.

*Suggestion: * Use the IDE's auto-complete function or copy and paste directly from the official examples.

3. Limitations of Image Generation APIs

Currently, the Gemini image generation API has some limitations:

  • Must be through Vertex AI (cannot use the general Developer API)
  • Requires setting up a GCP project and authentication
  • Fixed output ratio (cannot freely specify pixel size)

But the advantages are:

  • ✅ Extremely high generation quality (especially text clarity)
  • ✅ Can understand semantics (not just simple filters)
  • ✅ Supports multiple aspect ratio options

4. User Experience of Batch Processing

When processing multi-page PDFs, user experience is important:

  • ✅ Display real-time progress (page X/Y)
  • ✅ Display the processing status of each page
  • ✅ Statistics of successes/failures
  • ✅ Display detailed error messages

These small details make the tool more professional.

5. Cost Considerations

The Gemini image generation API is paid, and processing a 10-page PDF:

  • 10 API calls
  • Processing 1 2K image each time

*Suggestion: * In a production environment, you need to consider cost control:

  • Limit the number of pages processed at a time
  • Provide a preview function (only process the first page)
  • Cache processed results

6. The Value of Going from Manual to Automated

The original process:

1. Generate slides in NotebookLM
2. Screenshot each page
3. Upload to Gemini AI Studio
4. Copy and paste the prompt
5. Download the optimized image
6. Repeat steps 2-5 N times
7. Manually merge into a PDF

Enter fullscreen mode Exit fullscreen mode

After automation:

1. Upload PDF
2. Click "Start Processing"
3. Download the optimized PDF

Enter fullscreen mode Exit fullscreen mode

Time saved: From 30 minutes for 10 pages → 3 minutes (API call time)

Summary

If you also encounter similar image optimization needs:

  • Use the Gemini image generation API - much better than traditional OCR + filters
  • Pay attention to the API version - SDKs are updated quickly, so check the latest documentation
  • Value user experience - progress display, error handling are important
  • Consider costs - commercial applications need to evaluate API call costs

This tool is simple, but it does solve my pain points. From manual processing to one-click completion, this is the value of automation!

Project Structure

nano-nblm-pdf/
├── app.py # Streamlit main program
├── requirements.txt # Dependency packages
├── .env.example # Environment variable example
└── README.md # Usage instructions

Enter fullscreen mode Exit fullscreen mode

Environment Setup

Required Packages

streamlit>=1.40.0
google-genai>=1.0.0
pdf2image==1.17.0
Pillow==10.2.0
img2pdf==0.5.1

Enter fullscreen mode Exit fullscreen mode

Environment Variables

# Google Cloud API Key (required)
export GOOGLE_CLOUD_API_KEY="your-api-key"

Enter fullscreen mode Exit fullscreen mode

Start the Application

# Install dependencies
pip install -r requirements.txt

# Start Streamlit
streamlit run app.py

Enter fullscreen mode Exit fullscreen mode

Usage Instructions

  1. Enter your Google Cloud API Key in the sidebar
  2. Adjust the image resolution (DPI) and output ratio
  3. Upload the PDF file
  4. Click "Start Processing"
  5. Wait for processing to complete
  6. Download the optimized PDF

Known Limitations

  1. Requires Vertex AI - Must use a GCP project and authentication
  2. Processing Time - Approximately 10-15 seconds per page
  3. API Cost - Billed by the number of API calls
  4. Fixed Ratio - The output image ratio is limited by the API

Future Improvements

  • Support batch processing of multiple PDFs
  • Add a preview function (only process the first page)
  • Cache processing results
  • Support more image formats (JPG, PNG, etc.)
  • Add a progress bar and estimated time
  • Error retry mechanism

References

Project Link


Enter fullscreen mode Exit fullscreen mode

Top comments (0)