Background
I've been using NotebookLM recently to quickly create slides. While this tool is convenient, there's a frustrating issue: the generated Chinese characters often have blurry edges and garbled characters. Although "the order does not affect reading," as an engineer, I still want to be more professional.
I saw someone sharing an interesting method online: after taking screenshots of the slides from NotebookLM, uploading them to Gemini 3.0 Pro's "thinking" + image function, and using a carefully designed prompt to repair the images. The actual results were really good! But having to manually screenshot, upload, and copy-paste the prompt every time was too troublesome.
So I decided: Why not make it an automated tool directly?
The features I want are simple:
- 📄 Upload a PDF file
- 🤖 Automatically use the Gemini API to optimize the text clarity of each page
- 📥 Download the optimized PDF
Sounds simple, right? But I encountered a lot of pitfalls during the actual development...
The Magical Prompt Used
Before development, let me share this prompt for image optimization (from an online share):
Role Definition
You are now a senior image repair expert equipped with a "Multi-modal Visual Cognitive Engine." You possess the core capabilities of Context-aware OCR and Generative Image Upscaling.
Mission Objective
Execute "Semantic-Level Image Reconstruction." For the input of low-resolution or blurry images, use logical deduction to repair the text content and output a 4K wide color gamut high-fidelity image.
Execution Protocol (Thought Chain and Execution Protocol)
Please strictly execute the following calculation process in the background and directly output the final image:
1. 【Optical & Logical Inference】
Perform high-dimensional scanning of the image, locking the blurry text area (ROI).
Activate "Contextual Semantic Analysis": Not only recognize pixels, but also deduce the original "Traditional Chinese" content of the blurry area based on the context, common vocabulary, and logic (Traditional Chinese).
Fault tolerance mechanism: If pixel information is lost, prioritize semantic filling with the highest Confidence Score.
2. 【Isomorphic Visual Synthesis】
Strictly inherit the original image's topological structure: layout, object coordinates, and perspective vanishing points must be completely locked with the original image.
Style Transfer: Accurately capture the original image's design language (color scheme, material, lighting), and apply it to the new high-resolution canvas.
3. 【Vector-Grade Rendering】
Perform "Anti-aliasing" and "Sharpening" on the edges of text and lines.
Text strokes must present "printing-grade" clarity, completely eliminating JPEG compression artifacts and edge bleeding.
Exclusion Criteria (Negative Constraints)
Strictly prohibit the generation of unreadable "Gibberish" or Simplified Chinese.
Strictly prohibit changing the key composition structure of the original image.
Strictly prohibit outputting blurry, low-contrast, or overly smooth oil-painting-like images.
Output
Output the reconstructed image ONLY. No textual explanation required.
The key points of this prompt are:
- ✅ Use "semantic reasoning" instead of pure OCR (can understand context)
- ✅ Maintain the original layout
- ✅ Generate high-resolution images
- ✅ Force the use of Traditional Chinese
But for automation, I simplified it to a more direct version:
prompt_text = "Please optimize the text in this image to make it clearer and more readable. Maintain the original layout, but improve the quality, contrast, and clarity of the text. Please output the optimized image."
Although simplified, the image generation capabilities of Gemini 3.0, when combined, are not only effective, but the results are even better after testing!
Technical Architecture
I decided to use the following technology stack:
| Technology | Purpose | Reason |
|---|---|---|
| Streamlit | Web UI framework | Quickly create an interface, focus on business logic |
| google-genai | Gemini API SDK | Official SDK, supports image generation |
| pdf2image | PDF to image | Stable and reliable |
| img2pdf | Image to PDF | Simple and efficient |
| Pillow | Image processing | Python standard library |
Problems Encountered During Development
Problem 1: Streamlit API Deprecation Warning
When I started developing with Streamlit 1.32.0, I encountered this error:
TypeError: ImageMixin.image() got an unexpected keyword argument 'use_container_width'
It turned out that the Streamlit version was too old, and the use_container_width parameter was introduced in 1.33.0+.
Solution: Upgrade Streamlit
pip install --upgrade streamlit
But after upgrading, a new warning appeared:
Please replace `use_container_width` with `width`.
`use_container_width` will be removed after 2025-12-31.
It turns out that the latest version has deprecated use_container_width and replaced it with the new width parameter!
Final Correction:
# ❌ Old API (deprecated)
st.image(image, use_container_width=True)
st.button("Button", use_container_width=True)
# ✅ New API
st.image(image, width='stretch')
st.button("Button", width='stretch')
| Old Parameter Value | New Parameter Value |
|---|---|
use_container_width=True |
width='stretch' |
use_container_width=False |
width='content' |
*Lesson: * API design evolves, and you need to pay attention to the official deprecation warnings.
Problem 2: google-genai Part.from_text Call Error
Then, when integrating the Gemini API, I encountered this error:
TypeError: Part.from_text() takes 1 positional argument but 2 were given
My original code:
# ❌ Incorrect API usage
contents = [
types.Content(
role="user",
parts=[
types.Part.from_text("Please optimize this image..."), # ❌ Error!
types.Part.from_bytes(
data=image_data,
mime_type="image/png"
)
]
)
]
After checking the official documentation, I found that the API for google-genai 1.49.0 has changed!
Correct Usage:
# ✅ Correct API usage
contents = [
types.Content(
role="user",
parts=[
types.Part(text="Please optimize this image..."), # Use the text parameter directly
types.Part(
inline_data=types.Blob(
mime_type="image/png",
data=image_data
)
)
]
)
]
API Change Comparison:
| Item | Old API | New API |
|---|---|---|
| Text | Part.from_text(text) |
Part(text=text) |
| Image | Part.from_bytes(data=..., mime_type=...) |
Part(inline_data=Blob(...)) |
*Lesson: * SDKs are updated frequently, so you need to check the latest official documentation and not rely solely on Stack Overflow.
Problem 3: ImageConfig Parameter Validation Error
When configuring image generation parameters, I encountered another new problem:
pydantic_core._pydantic_core.ValidationError: 1 validation error for ImageConfig
output_mime_type
Extra inputs are not permitted [type=extra_forbidden, input_value='image/png', input_type=str]
My original configuration:
# ❌ Error: output_mime_type is not supported
image_config=types.ImageConfig(
aspect_ratio="1:1",
image_size="2K",
output_mime_type="image/png", # ❌ This parameter does not exist!
)
After consulting the official documentation, I found that ImageConfig only supports two parameters:
Correct Configuration:
# ✅ Correct: Only use supported parameters
image_config=types.ImageConfig(
aspect_ratio="16:9", # Supported ratio
image_size="2K" # Supported size
)
Supported Parameter Values:
| Parameter | Supported Values |
|---|---|
aspect_ratio |
"1:1", "2:3", "3:2", "3:4", "4:3", "4:5", "5:4", "9:16", "16:9", "21:9"
|
image_size |
"1K", "2K", "4K"
|
*Lesson: * When using an SDK validated by Pydantic, the parameters must strictly conform to the schema and cannot be added arbitrarily.
Problem 4: Image Aspect Ratio Does Not Match Expectations
During the first test, the generated image was vertical, but the NotebookLM slides were clearly horizontal 16:9!
Reason: I initially set aspect_ratio="3:4" (close to A4 paper ratio), which is suitable for documents but not for slides.
Solution:
# Change to the horizontal slide ratio
image_config=types.ImageConfig(
aspect_ratio="16:9", # Horizontal slide
image_size="2K"
)
But for a better user experience, I added a dropdown menu to let the user choose:
# Add options in the Streamlit sidebar
aspect_ratio = st.selectbox(
"Output Ratio",
options=["16:9", "4:3", "3:4", "9:16", "1:1"],
index=0,
help="Choose the aspect ratio of the output image. 16:9 is suitable for slides, 3:4 is suitable for documents"
)
*Lesson: * Don't assume the user's needs; provide options and let them decide.
Complete Implementation
Core Function: optimize_image_with_gemini
def optimize_image_with_gemini(image, api_key, aspect_ratio="16:9"):
"""Optimize the text in the image using the Gemini API"""
try:
# Initialize Vertex AI client
client = genai.Client(
vertexai=True,
api_key=api_key,
)
# Convert the image to base64
buffered = io.BytesIO()
image.save(buffered, format="PNG")
img_bytes = buffered.getvalue()
img_base64 = base64.b64encode(img_bytes).decode()
# Use the Gemini 3.0 image generation model
model = "gemini-3-pro-image-preview"
# Build the request content
prompt_text = "Please optimize the text in this image to make it clearer and more readable. Maintain the original layout, but improve the quality, contrast, and clarity of the text. Please output the optimized image."
contents = [
types.Content(
role="user",
parts=[
types.Part(text=prompt_text),
types.Part(
inline_data=types.Blob(
mime_type="image/png",
data=base64.b64decode(img_base64)
)
)
]
)
]
# Configure generation parameters
generate_content_config = types.GenerateContentConfig(
temperature=1,
top_p=0.95,
max_output_tokens=32768,
response_modalities=["IMAGE"],
safety_settings=[
types.SafetySetting(
category="HARM_CATEGORY_HATE_SPEECH",
threshold="OFF"
),
types.SafetySetting(
category="HARM_CATEGORY_DANGEROUS_CONTENT",
threshold="OFF"
),
types.SafetySetting(
category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
threshold="OFF"
),
types.SafetySetting(
category="HARM_CATEGORY_HARASSMENT",
threshold="OFF"
)
],
image_config=types.ImageConfig(
aspect_ratio=aspect_ratio,
image_size="2K"
),
)
# Call the API
response = client.models.generate_content(
model=model,
contents=contents,
config=generate_content_config,
)
# Extract the generated image
if response.candidates and len(response.candidates) > 0:
candidate = response.candidates[0]
if candidate.content.parts:
for part in candidate.content.parts:
if hasattr(part, 'inline_data') and part.inline_data:
image_data = part.inline_data.data
optimized_image = Image.open(io.BytesIO(image_data))
return optimized_image
# If no image is generated, return the original image
st.warning("The API did not return an optimized image, using the original image")
return image
except Exception as e:
st.error(f"Optimization failed: {str(e)}")
return image
Main Process
def main():
st.title("📄 PDF Text Optimization Tool")
st.markdown("### Use Gemini AI to Optimize Text in PDFs")
# Sidebar settings
with st.sidebar:
st.header("Settings")
api_key = st.text_input(
"Google Cloud API Key",
type="password",
value=os.environ.get("GOOGLE_CLOUD_API_KEY", ""),
)
dpi = st.slider("Image Resolution (DPI)", 150, 600, 300, 50)
aspect_ratio = st.selectbox(
"Output Ratio",
options=["16:9", "4:3", "3:4", "9:16", "1:1"],
index=0,
)
# Upload file
uploaded_file = st.file_uploader("Select PDF File", type=['pdf'])
if uploaded_file and st.button("🚀 Start Processing"):
with tempfile.TemporaryDirectory() as temp_dir:
# Step 1: PDF → Image
images = convert_from_path(pdf_path, dpi=dpi)
# Step 2: Optimize each page
optimized_images = []
for idx, img in enumerate(images):
st.write(f"Processing page {idx + 1}/{len(images)}...")
optimized_img = optimize_image_with_gemini(
img, api_key, aspect_ratio
)
optimized_images.append(optimized_img)
# Step 3: Image → PDF
output_pdf = images_to_pdf(optimized_images)
# Step 4: Provide download
st.download_button(
label="⬇️ Download Optimized PDF",
data=output_pdf,
file_name=f"optimized_{uploaded_file.name}",
mime="application/pdf",
)
Actual Results
(The left is generated by NotebookLM, and the right is redrawn by Gemini-3.0-pro-image)
Processing Flow
- Upload a PDF file
- The system automatically converts each page to an image (DPI adjustable)
- Each page calls the Gemini API for optimization
- Display processing progress and success/failure statistics
- Reassemble the optimized images into a PDF
- Provide a download button
Before and After Optimization Comparison
The application will display a before-and-after comparison of the first page:
Before and After Optimization (First Page):
┌────────────┬────────────┐
│ Original Image │ Optimized │
│ (Blurry) │ (Clear) │
└────────────┴────────────┘
Processing Statistics
📄 Processing page 1/10...
→ Initializing Gemini client...
→ Converting image format...
→ Using model: gemini-3-pro-image-preview
→ Calling Gemini API for optimization...
→ Received API response, parsing results...
→ ✅ Successfully generated optimized image
✅ Page 1 optimization successful
...
Successfully optimized: 8 pages | Failed: 2 pages
Development Experience
1. API Documentation Must Be Up-to-Date
The pitfalls I encountered this time were mostly due to API updates:
-
Part.from_text()→Part(text=...) -
use_container_width→width='stretch'
*Lesson: * Don't just look at Stack Overflow or old tutorials; be sure to check the official latest documentation.
2. Pydantic Validation is a Double-Edged Sword
google-genai uses Pydantic for parameter validation. The advantage is that errors can be quickly found, but the disadvantage is that a slight typo will cause an error.
*Suggestion: * Use the IDE's auto-complete function or copy and paste directly from the official examples.
3. Limitations of Image Generation APIs
Currently, the Gemini image generation API has some limitations:
- Must be through Vertex AI (cannot use the general Developer API)
- Requires setting up a GCP project and authentication
- Fixed output ratio (cannot freely specify pixel size)
But the advantages are:
- ✅ Extremely high generation quality (especially text clarity)
- ✅ Can understand semantics (not just simple filters)
- ✅ Supports multiple aspect ratio options
4. User Experience of Batch Processing
When processing multi-page PDFs, user experience is important:
- ✅ Display real-time progress (page X/Y)
- ✅ Display the processing status of each page
- ✅ Statistics of successes/failures
- ✅ Display detailed error messages
These small details make the tool more professional.
5. Cost Considerations
The Gemini image generation API is paid, and processing a 10-page PDF:
- 10 API calls
- Processing 1 2K image each time
*Suggestion: * In a production environment, you need to consider cost control:
- Limit the number of pages processed at a time
- Provide a preview function (only process the first page)
- Cache processed results
6. The Value of Going from Manual to Automated
The original process:
1. Generate slides in NotebookLM
2. Screenshot each page
3. Upload to Gemini AI Studio
4. Copy and paste the prompt
5. Download the optimized image
6. Repeat steps 2-5 N times
7. Manually merge into a PDF
After automation:
1. Upload PDF
2. Click "Start Processing"
3. Download the optimized PDF
Time saved: From 30 minutes for 10 pages → 3 minutes (API call time)
Summary
If you also encounter similar image optimization needs:
- ✅ Use the Gemini image generation API - much better than traditional OCR + filters
- ✅ Pay attention to the API version - SDKs are updated quickly, so check the latest documentation
- ✅ Value user experience - progress display, error handling are important
- ✅ Consider costs - commercial applications need to evaluate API call costs
This tool is simple, but it does solve my pain points. From manual processing to one-click completion, this is the value of automation!
Project Structure
nano-nblm-pdf/
├── app.py # Streamlit main program
├── requirements.txt # Dependency packages
├── .env.example # Environment variable example
└── README.md # Usage instructions
Environment Setup
Required Packages
streamlit>=1.40.0
google-genai>=1.0.0
pdf2image==1.17.0
Pillow==10.2.0
img2pdf==0.5.1
Environment Variables
# Google Cloud API Key (required)
export GOOGLE_CLOUD_API_KEY="your-api-key"
Start the Application
# Install dependencies
pip install -r requirements.txt
# Start Streamlit
streamlit run app.py
Usage Instructions
- Enter your Google Cloud API Key in the sidebar
- Adjust the image resolution (DPI) and output ratio
- Upload the PDF file
- Click "Start Processing"
- Wait for processing to complete
- Download the optimized PDF
Known Limitations
- Requires Vertex AI - Must use a GCP project and authentication
- Processing Time - Approximately 10-15 seconds per page
- API Cost - Billed by the number of API calls
- Fixed Ratio - The output image ratio is limited by the API
Future Improvements
- Support batch processing of multiple PDFs
- Add a preview function (only process the first page)
- Cache processing results
- Support more image formats (JPG, PNG, etc.)
- Add a progress bar and estimated time
- Error retry mechanism
References
- Gemini API Image Generation Documentation
- google-genai Python SDK
- Streamlit Official Documentation
- pdf2image Usage Guide
Project Link
- GitHub Repository: https://github.com/kkdai/nano-nblm-pdf


Top comments (0)