DEV Community

Tushar Sengar
Tushar Sengar

Posted on

Building a Smarter Document Scanner with Gemini: A Developer's Guide

This blog post dives deeper into the technical implementation of AI-powered business card scanner, providing code examples and insights for developers who want to build similar applications.

  1. Project Setup

Next.js: I used create-next-app to initialize a new Next.js project:

npx create-next-app my-card-scanner
cd my-card-scanner
Enter fullscreen mode Exit fullscreen mode

Install Dependencies:

npm install @google/generative-ai sharp
Enter fullscreen mode Exit fullscreen mode

Gemini API Key:
Create a project in the Google Cloud Console.
Enable the "Gemini API" service.
Create an API key and store it securely (e.g., in an environment variable).

import { NextResponse } from 'next/server';
import { GoogleGenerativeAI, Part } from '@google/generative-ai';
import sharp from 'sharp';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY || '');

export async function POST(req: Request) {
  try {
    const formData = await req.formData();
    const file = formData.get('file') as Blob | null;

    if (!file) {
      return NextResponse.json({ error: 'No file uploaded' }, { status: 400 });
    }

    // Image conversion and Base64 encoding
    const originalBuffer = Buffer.from(await file.arrayBuffer());
    const jpegBuffer = await sharp(originalBuffer).jpeg().toBuffer();
    const base64Image = jpegBuffer.toString('base64');

    // Gemini API call
    const model = genAI.getGenerativeModel({ model: "gemini-1.5-flash" });
    const prompt = "This is a business card image. Please extract the data in JSON format.  "; 
    const imageParts: Part[] = [
      {
        inlineData: {
          mimeType: 'image/jpeg',
          data: base64Image,
        },
      },
      { text: prompt },
    ];

    const result = await model.generateContent({
      contents: [{ role: "user", parts: imageParts }],
      generationConfig: {
        maxOutputTokens: 300, 
      },
    });

    // Response handling and JSON parsing
    const content = result.response.text();
    let parsedData;
    try {
      parsedData = JSON.parse(content); 
    } catch (e) {
      console.error('Error parsing JSON:', e);
      // Implement fallback extraction or error handling here
    }

    return NextResponse.json({ success: true, parsedData });
  } catch (error) {
    console.error('API Error:', error);
    return NextResponse.json({ success: false, error: 'Failed to process image' }, { status: 500 });
  }
}
Enter fullscreen mode Exit fullscreen mode
  1. Frontend Component (pages/index.js)
"use client";
import React, { useState } from 'react';

const Home = () => {
  const [cardData, setCardData] = useState(null);
  const [isLoading, setIsLoading] = useState(false);

  const handleSubmit = async (event) => {
    event.preventDefault();
    setIsLoading(true);

    const formData = new FormData(event.target);

    try {
      const response = await fetch('/api/gemini', {
        method: 'POST',
        body: formData,
      });

      const data = await response.json();
      if (data.success) {
        setCardData(data.parsedData);
      } else {
        console.error('API Error:', data.error);
      }
    } catch (error) {
      console.error('Error:', error);
    } finally {
      setIsLoading(false);
    }
  };

  return (
    <div>
      <h1>Business Card Scanner</h1>
      <form onSubmit={handleSubmit}>
        <input type="file" name="file" accept="image/*" />
        <button type="submit" disabled={isLoading}>
          {isLoading ? 'Scanning...' : 'Scan'}
        </button>
      </form>

      {cardData && (
        <div>
          <h2>Extracted Data:</h2>
          <pre>{JSON.stringify(cardData, null, 2)}</pre>
        </div>
      )}
    </div>
  );
};

export default Home;
Enter fullscreen mode Exit fullscreen mode
  1. Key Considerations

Image Optimization: Resize large images on the client-side before uploading to reduce processing time and avoid exceeding API limits.
Prompt Engineering: Experiment with different prompts to improve the accuracy and structure of the extracted data.
Error Handling: Implement robust error handling in both the API route and the frontend component to provide informative messages to the user.
Data Validation: Validate and sanitize the extracted data before displaying or saving it to prevent potential issues.
This implementation provides a solid foundation for building your own AI-powered business card scanner. With further refinements and customizations, you can create a truly innovative and user-friendly application.

I'm incredibly excited about the potential of Gemini to transform how we interact with information and each other. This business card scanner is just one example of what's possible. I'm eager to explore more applications of Gemini and collaborate with other developers to push the boundaries of AI-powered solutions. If you're passionate about AI and want to discuss ideas, share your projects, or even just geek out about the latest advancements, feel free to reach out to me. You can connect with me via email at tusharsengar26@gmail.com or WhatsApp at +918461806721. Let's build the future together!

Top comments (0)