Evan Lin for Google Developer Experts

Posted on Jan 11 • Originally published at evanlin.com on Jan 11

[Gemini 3.0][Image Generation] Building a PDF Text Optimization Tool with Gemini 3.0 Pro Image API

#gemini #tooling #llm #api

Background

I've been using NotebookLM recently to quickly create slides. While this tool is convenient, there's a frustrating issue: the generated Chinese characters often have blurry edges and garbled characters. Although "the order does not affect reading," as an engineer, I still want to be more professional.

I saw someone sharing an interesting method online: after taking screenshots of the slides from NotebookLM, uploading them to Gemini 3.0 Pro's "thinking" + image function, and using a carefully designed prompt to repair the images. The actual results were really good! But having to manually screenshot, upload, and copy-paste the prompt every time was too troublesome.

So I decided: Why not make it an automated tool directly?

The features I want are simple:

📄 Upload a PDF file
🤖 Automatically use the Gemini API to optimize the text clarity of each page
📥 Download the optimized PDF

Sounds simple, right? But I encountered a lot of pitfalls during the actual development...

The Magical Prompt Used

Before development, let me share this prompt for image optimization (from an online share):

Role Definition
You are now a senior image repair expert equipped with a "Multi-modal Visual Cognitive Engine." You possess the core capabilities of Context-aware OCR and Generative Image Upscaling.

Mission Objective
Execute "Semantic-Level Image Reconstruction." For the input of low-resolution or blurry images, use logical deduction to repair the text content and output a 4K wide color gamut high-fidelity image.

Execution Protocol (Thought Chain and Execution Protocol)
Please strictly execute the following calculation process in the background and directly output the final image:

1. 【Optical & Logical Inference】
   Perform high-dimensional scanning of the image, locking the blurry text area (ROI).
   Activate "Contextual Semantic Analysis": Not only recognize pixels, but also deduce the original "Traditional Chinese" content of the blurry area based on the context, common vocabulary, and logic (Traditional Chinese).
   Fault tolerance mechanism: If pixel information is lost, prioritize semantic filling with the highest Confidence Score.

2. 【Isomorphic Visual Synthesis】
   Strictly inherit the original image's topological structure: layout, object coordinates, and perspective vanishing points must be completely locked with the original image.
   Style Transfer: Accurately capture the original image's design language (color scheme, material, lighting), and apply it to the new high-resolution canvas.

3. 【Vector-Grade Rendering】
   Perform "Anti-aliasing" and "Sharpening" on the edges of text and lines.
   Text strokes must present "printing-grade" clarity, completely eliminating JPEG compression artifacts and edge bleeding.

Exclusion Criteria (Negative Constraints)
Strictly prohibit the generation of unreadable "Gibberish" or Simplified Chinese.
Strictly prohibit changing the key composition structure of the original image.
Strictly prohibit outputting blurry, low-contrast, or overly smooth oil-painting-like images.

Output
Output the reconstructed image ONLY. No textual explanation required.

The key points of this prompt are:

✅ Use "semantic reasoning" instead of pure OCR (can understand context)
✅ Maintain the original layout
✅ Generate high-resolution images
✅ Force the use of Traditional Chinese

But for automation, I simplified it to a more direct version:

prompt_text = "Please optimize the text in this image to make it clearer and more readable. Maintain the original layout, but improve the quality, contrast, and clarity of the text. Please output the optimized image."

Although simplified, the image generation capabilities of Gemini 3.0, when combined, are not only effective, but the results are even better after testing!

Technical Architecture

I decided to use the following technology stack:

Technology	Purpose	Reason
Streamlit	Web UI framework	Quickly create an interface, focus on business logic
google-genai	Gemini API SDK	Official SDK, supports image generation
pdf2image	PDF to image	Stable and reliable
img2pdf	Image to PDF	Simple and efficient
Pillow	Image processing	Python standard library

Problems Encountered During Development

Problem 1: Streamlit API Deprecation Warning

When I started developing with Streamlit 1.32.0, I encountered this error:

TypeError: ImageMixin.image() got an unexpected keyword argument 'use_container_width'

It turned out that the Streamlit version was too old, and the use_container_width parameter was introduced in 1.33.0+.

Solution: Upgrade Streamlit

pip install --upgrade streamlit

But after upgrading, a new warning appeared:

Please replace `use_container_width` with `width`.
`use_container_width` will be removed after 2025-12-31.

It turns out that the latest version has deprecated use_container_width and replaced it with the new width parameter!

Final Correction:

# ❌ Old API (deprecated)
st.image(image, use_container_width=True)
st.button("Button", use_container_width=True)

# ✅ New API
st.image(image, width='stretch')
st.button("Button", width='stretch')

Old Parameter Value	New Parameter Value
`use_container_width=True`	`width='stretch'`
`use_container_width=False`	`width='content'`

*Lesson: * API design evolves, and you need to pay attention to the official deprecation warnings.

Problem 2: google-genai Part.from_text Call Error

Then, when integrating the Gemini API, I encountered this error:

TypeError: Part.from_text() takes 1 positional argument but 2 were given

My original code:

# ❌ Incorrect API usage
contents = [
    types.Content(
        role="user",
        parts=[
            types.Part.from_text("Please optimize this image..."), # ❌ Error!
            types.Part.from_bytes(
                data=image_data,
                mime_type="image/png"
            )
        ]
    )
]

After checking the official documentation, I found that the API for google-genai 1.49.0 has changed!

Correct Usage:

# ✅ Correct API usage
contents = [
    types.Content(
        role="user",
        parts=[
            types.Part(text="Please optimize this image..."), # Use the text parameter directly
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/png",
                    data=image_data
                )
            )
        ]
    )
]

API Change Comparison:

Item	Old API	New API
Text	`Part.from_text(text)`	`Part(text=text)`
Image	`Part.from_bytes(data=..., mime_type=...)`	`Part(inline_data=Blob(...))`

*Lesson: * SDKs are updated frequently, so you need to check the latest official documentation and not rely solely on Stack Overflow.

Problem 3: ImageConfig Parameter Validation Error

When configuring image generation parameters, I encountered another new problem:

pydantic_core._pydantic_core.ValidationError: 1 validation error for ImageConfig
output_mime_type
  Extra inputs are not permitted [type=extra_forbidden, input_value='image/png', input_type=str]

My original configuration:

# ❌ Error: output_mime_type is not supported
image_config=types.ImageConfig(
    aspect_ratio="1:1",
    image_size="2K",
    output_mime_type="image/png", # ❌ This parameter does not exist!
)

After consulting the official documentation, I found that ImageConfig only supports two parameters:

Correct Configuration:

# ✅ Correct: Only use supported parameters
image_config=types.ImageConfig(
    aspect_ratio="16:9", # Supported ratio
    image_size="2K" # Supported size
)

Supported Parameter Values:

Parameter	Supported Values
`aspect_ratio`	`"1:1"`, `"2:3"`, `"3:2"`, `"3:4"`, `"4:3"`, `"4:5"`, `"5:4"`, `"9:16"`, `"16:9"`, `"21:9"`
`image_size`	`"1K"`, `"2K"`, `"4K"`

*Lesson: * When using an SDK validated by Pydantic, the parameters must strictly conform to the schema and cannot be added arbitrarily.

Problem 4: Image Aspect Ratio Does Not Match Expectations

During the first test, the generated image was vertical, but the NotebookLM slides were clearly horizontal 16:9!

Reason: I initially set aspect_ratio="3:4" (close to A4 paper ratio), which is suitable for documents but not for slides.

Solution:

# Change to the horizontal slide ratio
image_config=types.ImageConfig(
    aspect_ratio="16:9", # Horizontal slide
    image_size="2K"
)

But for a better user experience, I added a dropdown menu to let the user choose:

# Add options in the Streamlit sidebar
aspect_ratio = st.selectbox(
    "Output Ratio",
    options=["16:9", "4:3", "3:4", "9:16", "1:1"],
    index=0,
    help="Choose the aspect ratio of the output image. 16:9 is suitable for slides, 3:4 is suitable for documents"
)

*Lesson: * Don't assume the user's needs; provide options and let them decide.

Complete Implementation

Core Function: optimize_image_with_gemini

def optimize_image_with_gemini(image, api_key, aspect_ratio="16:9"):
    """Optimize the text in the image using the Gemini API"""
    try:
        # Initialize Vertex AI client
        client = genai.Client(
            vertexai=True,
            api_key=api_key,
        )

        # Convert the image to base64
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_bytes = buffered.getvalue()
        img_base64 = base64.b64encode(img_bytes).decode()

        # Use the Gemini 3.0 image generation model
        model = "gemini-3-pro-image-preview"

        # Build the request content
        prompt_text = "Please optimize the text in this image to make it clearer and more readable. Maintain the original layout, but improve the quality, contrast, and clarity of the text. Please output the optimized image."

        contents = [
            types.Content(
                role="user",
                parts=[
                    types.Part(text=prompt_text),
                    types.Part(
                        inline_data=types.Blob(
                            mime_type="image/png",
                            data=base64.b64decode(img_base64)
                        )
                    )
                ]
            )
        ]

        # Configure generation parameters
        generate_content_config = types.GenerateContentConfig(
            temperature=1,
            top_p=0.95,
            max_output_tokens=32768,
            response_modalities=["IMAGE"],
            safety_settings=[
                types.SafetySetting(
                    category="HARM_CATEGORY_HATE_SPEECH",
                    threshold="OFF"
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_DANGEROUS_CONTENT",
                    threshold="OFF"
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    threshold="OFF"
                ),
                types.SafetySetting(
                    category="HARM_CATEGORY_HARASSMENT",
                    threshold="OFF"
                )
            ],
            image_config=types.ImageConfig(
                aspect_ratio=aspect_ratio,
                image_size="2K"
            ),
        )

        # Call the API
        response = client.models.generate_content(
            model=model,
            contents=contents,
            config=generate_content_config,
        )

        # Extract the generated image
        if response.candidates and len(response.candidates) > 0:
            candidate = response.candidates[0]
            if candidate.content.parts:
                for part in candidate.content.parts:
                    if hasattr(part, 'inline_data') and part.inline_data:
                        image_data = part.inline_data.data
                        optimized_image = Image.open(io.BytesIO(image_data))
                        return optimized_image

        # If no image is generated, return the original image
        st.warning("The API did not return an optimized image, using the original image")
        return image

    except Exception as e:
        st.error(f"Optimization failed: {str(e)}")
        return image

Main Process

def main():
    st.title("📄 PDF Text Optimization Tool")
    st.markdown("### Use Gemini AI to Optimize Text in PDFs")

    # Sidebar settings
    with st.sidebar:
        st.header("Settings")
        api_key = st.text_input(
            "Google Cloud API Key",
            type="password",
            value=os.environ.get("GOOGLE_CLOUD_API_KEY", ""),
        )

        dpi = st.slider("Image Resolution (DPI)", 150, 600, 300, 50)
        aspect_ratio = st.selectbox(
            "Output Ratio",
            options=["16:9", "4:3", "3:4", "9:16", "1:1"],
            index=0,
        )

    # Upload file
    uploaded_file = st.file_uploader("Select PDF File", type=['pdf'])

    if uploaded_file and st.button("🚀 Start Processing"):
        with tempfile.TemporaryDirectory() as temp_dir:
            # Step 1: PDF → Image
            images = convert_from_path(pdf_path, dpi=dpi)

            # Step 2: Optimize each page
            optimized_images = []
            for idx, img in enumerate(images):
                st.write(f"Processing page {idx + 1}/{len(images)}...")
                optimized_img = optimize_image_with_gemini(
                    img, api_key, aspect_ratio
                )
                optimized_images.append(optimized_img)

            # Step 3: Image → PDF
            output_pdf = images_to_pdf(optimized_images)

            # Step 4: Provide download
            st.download_button(
                label="⬇️ Download Optimized PDF",
                data=output_pdf,
                file_name=f"optimized_{uploaded_file.name}",
                mime="application/pdf",
            )

Actual Results

(The left is generated by NotebookLM, and the right is redrawn by Gemini-3.0-pro-image)

Processing Flow

Upload a PDF file
The system automatically converts each page to an image (DPI adjustable)
Each page calls the Gemini API for optimization
Display processing progress and success/failure statistics
Reassemble the optimized images into a PDF
Provide a download button

Before and After Optimization Comparison

The application will display a before-and-after comparison of the first page:

Before and After Optimization (First Page):
┌────────────┬────────────┐
│ Original Image │ Optimized │
│ (Blurry) │ (Clear) │
└────────────┴────────────┘

Processing Statistics

📄 Processing page 1/10...
  → Initializing Gemini client...
  → Converting image format...
  → Using model: gemini-3-pro-image-preview
  → Calling Gemini API for optimization...
  → Received API response, parsing results...
  → ✅ Successfully generated optimized image
✅ Page 1 optimization successful

...

Successfully optimized: 8 pages | Failed: 2 pages

Development Experience

1. API Documentation Must Be Up-to-Date

The pitfalls I encountered this time were mostly due to API updates:

Part.from_text() → Part(text=...)
use_container_width → width='stretch'

*Lesson: * Don't just look at Stack Overflow or old tutorials; be sure to check the official latest documentation.

2. Pydantic Validation is a Double-Edged Sword

google-genai uses Pydantic for parameter validation. The advantage is that errors can be quickly found, but the disadvantage is that a slight typo will cause an error.

*Suggestion: * Use the IDE's auto-complete function or copy and paste directly from the official examples.

3. Limitations of Image Generation APIs

Currently, the Gemini image generation API has some limitations:

Must be through Vertex AI (cannot use the general Developer API)
Requires setting up a GCP project and authentication
Fixed output ratio (cannot freely specify pixel size)

But the advantages are:

✅ Extremely high generation quality (especially text clarity)
✅ Can understand semantics (not just simple filters)
✅ Supports multiple aspect ratio options

4. User Experience of Batch Processing

When processing multi-page PDFs, user experience is important:

✅ Display real-time progress (page X/Y)
✅ Display the processing status of each page
✅ Statistics of successes/failures
✅ Display detailed error messages

These small details make the tool more professional.

5. Cost Considerations

The Gemini image generation API is paid, and processing a 10-page PDF:

10 API calls
Processing 1 2K image each time

*Suggestion: * In a production environment, you need to consider cost control:

Limit the number of pages processed at a time
Provide a preview function (only process the first page)
Cache processed results

6. The Value of Going from Manual to Automated

The original process:

1. Generate slides in NotebookLM
2. Screenshot each page
3. Upload to Gemini AI Studio
4. Copy and paste the prompt
5. Download the optimized image
6. Repeat steps 2-5 N times
7. Manually merge into a PDF

After automation:

1. Upload PDF
2. Click "Start Processing"
3. Download the optimized PDF

Time saved: From 30 minutes for 10 pages → 3 minutes (API call time)

Summary

If you also encounter similar image optimization needs:

✅ Use the Gemini image generation API - much better than traditional OCR + filters
✅ Pay attention to the API version - SDKs are updated quickly, so check the latest documentation
✅ Value user experience - progress display, error handling are important
✅ Consider costs - commercial applications need to evaluate API call costs

This tool is simple, but it does solve my pain points. From manual processing to one-click completion, this is the value of automation!

Project Structure

nano-nblm-pdf/
├── app.py # Streamlit main program
├── requirements.txt # Dependency packages
├── .env.example # Environment variable example
└── README.md # Usage instructions

Environment Setup

Required Packages

streamlit>=1.40.0
google-genai>=1.0.0
pdf2image==1.17.0
Pillow==10.2.0
img2pdf==0.5.1

Environment Variables

# Google Cloud API Key (required)
export GOOGLE_CLOUD_API_KEY="your-api-key"

Start the Application

# Install dependencies
pip install -r requirements.txt

# Start Streamlit
streamlit run app.py

Usage Instructions

Enter your Google Cloud API Key in the sidebar
Adjust the image resolution (DPI) and output ratio
Upload the PDF file
Click "Start Processing"
Wait for processing to complete
Download the optimized PDF

Known Limitations

Requires Vertex AI - Must use a GCP project and authentication
Processing Time - Approximately 10-15 seconds per page
API Cost - Billed by the number of API calls
Fixed Ratio - The output image ratio is limited by the API

Future Improvements

Support batch processing of multiple PDFs
Add a preview function (only process the first page)
Cache processing results
Support more image formats (JPG, PNG, etc.)
Add a progress bar and estimated time
Error retry mechanism

References

Project Link

GitHub Repository: https://github.com/kkdai/nano-nblm-pdf

DEV Community