DEV Community: ppaanngggg

YOLOv12: The Next Evolution in Document Layout Analysis

ppaanngggg — Mon, 07 Apr 2025 14:47:33 +0000

Introduction

Building upon the success of YOLOv11, which demonstrated significant improvements over YOLOv8 in document layout analysis, YOLOv12 introduces several architectural innovations and optimization techniques that further advance the field. I retrained the complete YOLOv12 series on the DocLayNet dataset.

The project uses my codebase yolo-doclaynet. You can find all free models on huggingface, while the largest model is available here (trained using rented GPU resources).

Key Improvements in YOLOv12

Area Attention Mechanism: This feature efficiently processes large receptive fields by dividing feature maps into equal-sized regions, typically 4. This approach maintains effectiveness while significantly reducing computational costs compared to standard self-attention.
Residual Efficient Layer Aggregation Networks (R-ELAN): This improved feature aggregation module introduces block-level residual connections with scaling and a redesigned bottleneck-like structure for better optimization in large-scale attention models.
Optimized Attention Architecture: This streamlined attention mechanism incorporates multiple efficiency improvements, including FlashAttention, removed positional encoding, and adjusted MLP ratios. It also uses a 7x7 separable convolution as a "position perceiver" and leverages strategic convolution operations.

Experimental Results

Using the DocLayNet dataset, we conducted comprehensive evaluations of all YOLOv12 variants compared to previous series such as YOLOv11 and YOLOv8. The results show that YOLOv12 significantly outperforms YOLOv8. Smaller YOLOv12 models (nano, small, and medium) also demonstrate substantial improvements over YOLOv11, though larger models (Large and Extra) show comparable performance.

Performance Metrics

The figure shows the comparative performance metrics (model size and mAP scores) across different YOLO versions, from YOLOv8 to YOLOv12, demonstrating the evolution and improvements in model efficiency.

The table above compares model sizes (in millions of parameters) and mAP scores across different YOLO versions from v8 to v12.

Size/Model	YOLOv12	YOLOv11	YOLOv10	YOLOv9	YOLOv8
Nano	2.6M/0.756	2.6M/0.735	2.3M/0.730	2.0M/0.737	3.2M/0.718
Small	9.3M/0.782	9.4M/0.767	7.2M/0.762	7.2M/0.766	11.2M/0.752
Medium	20.2M/0.788	20.1M/0.781	15.4M/0.780	20.1M/0.775	25.9M/0.775
Large	26.4M/0.792	25.3M/0.793	24.4M/0.790	25.5M/0.782	43.7M/0.783
Extra	59.1M/0.794	56.9M/0.794	29.5M/0.793	-	68.2M/0.787

Key Findings

Consistent Small Model Improvements: YOLOv12 shows notable performance gains in nano through medium variants, with mAP improvements of up to 0.021 points compared to YOLOv11.
Efficient Area Attention: The new Area Attention Mechanism successfully reduces computational complexity while maintaining high accuracy, particularly evident in the nano and small models.
Parameter Efficiency: YOLOv12 achieves superior or comparable performance with significantly fewer parameters than YOLOv8, with the large model using only 26.4M parameters compared to YOLOv8's 43.7M.
Competitive Large Model Performance: Larger variants (Large and Extra) maintain performance parity with YOLOv11, demonstrating mAP scores of 0.792 and 0.794 respectively.

Conclusion

YOLOv12 represents a significant step forward in document layout analysis, offering improved accuracy and efficiency across all model sizes. The series demonstrates particular strength in handling complex document structures while maintaining real-time performance capabilities.

The nano and small models show substantial improvements with minimal computational cost, making them ideal for mobile devices and other edge computing applications requiring document layout analysis.

Design0: Effortless Design for Everyone

ppaanngggg — Wed, 06 Nov 2024 01:36:54 +0000

This is a submission for the Open Source AI Challenge with pgai and Ollama

Cover image is designed by Desigin0!

What I Built

As a non-professional designer seeking to create attractive posts and cover images for my blogs and projects, I envisioned a tool that would allow me to select a base image, highlight specific areas, and use natural language to instruct AI on desired edits, overcoming the unpredictability of current text-to-image generators.

Enter Design0—an AI-powered design tool I built to simplify image editing with natural language commands. Using Design0, you can search an image database by description (I've included 5,000 free Unsplash images for this demo). Once you've found your image, simply drag and drop to mask areas for editing, then write prompts describing your desired changes. Click "Edit" and watch the magic unfold! 🎉

Demo

Website

https://design0.ai

Source Code

design0webapp / design0

Effortless Design for Everyone

Design0

This is a submission for the Open Source AI Challenge with pgai and Ollama, read more about it Design0: Effortless Design for Everyone .

Website

https://design0.app

What I Built

Enter Design0—an AI-powered design tool I built to simplify image editing with natural language commands. Using Design0, you can search an image database by description. Once you've found your image, simply drag and drop to mask areas for editing, then write prompts describing your desired changes. Click "Edit" and watch the magic unfold! 🎉

View on GitHub

Screenshots

Search for the base image you want
Select the area you want to edit, then enter your prompt
Click "Edit" and wait for the new image to generate

Tools Used

pgvector: I use pgvector to store image embeddings as vector data types. I also added an HNSW index on the embeddings to accelerate search performance.

CREATE TABLE IF NOT EXISTS images
(
    id          varchar PRIMARY KEY,
    url         varchar,
    category    varchar,
    description varchar,
    embedding   vector(768)
);

CREATE INDEX IF NOT EXISTS images_embedding_idx ON images
    USING hnsw (embedding vector_cosine_ops);

pgai: I use pgai to invoke Ollama's embed API and search for images based on the distance between embeddings.

sql = f"SELECT id,url,category,description,embedding<=>ai.ollama_embed('nomic-embed-text', %s, host=>'{conf.ollama_host}') as distance FROM images"

Ollama: I use Ollama to host the state-of-the-art, compact text embedding model nomic-embed-text. This model embeds both image descriptions and user queries, allowing us to search for images that match users' requirements.

Final Thoughts

This demo showcases my 2D image design tool. I aim to develop it into a fully functional SaaS platform, empowering non-professional users to create stunning designs on their own.

I use Ollama to host the embedding model, which qualifies me for the 'Open-source Models from Ollama' prize.

YOLOv11: A New Breakthrough in Document Layout Analysis

ppaanngggg — Wed, 30 Oct 2024 07:50:38 +0000

Introduction

As mentioned in the previous blog post, YOLOv8 performs exceptionally well in Document Layout Analysis. I trained all models from the YOLOv8 series by DocLayNet dataset and found that even the smallest model achieves an overall mAP50-95 of 71.8, while the largest model reaches an impressive 78.7.

Recently, Ultralytics released YOLOv11, the latest iteration in their YOLO series of real-time object detectors. This new version brings significant improvements to both architecture and training methods.

🚀 The results look promising! I decided to train all YOLOv11 models on the DocLayNet dataset again and compare them with the previous YOLOv8 series.

Training Method

For this experiment, I continued to use my repository https://github.com/ppaanngggg/yolo-doclaynet to prepare the data and train the models using my custom scripts. This approach ensures consistency in the data preparation and training process, allowing for a fair comparison between YOLOv8 and YOLOv11 models.

The training and evaluation process for YOLOv11 models is straightforward and can be executed with simple command-line instructions:

# To train the model
python train.py {base-model}

# To evaluate the model
python eval.py {path-to-your-trained-model}

Comparing the Results

Here is the detailed evaluation table comparing YOLOv8 models with YOLOv11:

label	boxes	yolov8n	yolov11n	yolov8s	yolov11s	yolov8m	yolov11m	yolov8l	yolov11l	yolov8x	yolov11x
Params (M)		3.2	2.6	11.2	9.4	25.9	20.1	43.7	25.3	68.2	56.9
Caption	1542	0.682	0.717	0.721	0.744	0.746	0.746	0.75	0.772	0.753	0.765
Footnote	387	0.614	0.634	0.669	0.683	0.696	0.701	0.702	0.715	0.717	0.71
Formula	1966	0.655	0.673	0.695	0.705	0.723	0.729	0.75	0.75	0.747	0.765
List-item	10521	0.789	0.81	0.818	0.836	0.836	0.843	0.841	0.847	0.841	0.845
Page-footer	3987	0.588	0.591	0.61	0.621	0.64	0.653	0.641	0.678	0.655	0.684
Page-header	3365	0.707	0.704	0.754	0.76	0.769	0.778	0.776	0.788	0.784	0.795
Picture	3497	0.723	0.758	0.762	0.783	0.789	0.8	0.796	0.805	0.805	0.802
Section-header	8544	0.709	0.713	0.727	0.745	0.742	0.753	0.75	0.75	0.748	0.751
Table	2394	0.82	0.846	0.854	0.874	0.88	0.88	0.885	0.891	0.886	0.89
Text	29917	0.845	0.851	0.86	0.869	0.876	0.878	0.878	0.88	0.877	0.883
Title	334	0.762	0.793	0.806	0.817	0.83	0.832	0.846	0.844	0.84	0.848
All	66454	0.718	0.735	0.752	0.767	0.775	0.781	0.783	0.793	0.787	0.794

I've also created a plot to illustrate the relationship between model size and score for these two series:

Conclusion

Based on the table and plot above, we can conclude

Based on the table and plot above, we can conclude that YOLOv11 models consistently outperform their YOLOv8 counterparts across all sizes. The improvements are particularly noticeable in the smaller models, with YOLOv11n achieving a 1.7% increase in mAP50-95 compared to YOLOv8n. Furthermore, YOLOv11 models generally have fewer parameters than their YOLOv8 equivalents, indicating improved efficiency in addition to better performance.

My favorite model is YOLOv11l. It's only about the same size as YOLOv8m, but it outperforms even YOLOv8x!

However, YOLOv11x shows only a slight improvement over YOLOv11l despite having twice the model size.

What are your thoughts on the YOLOv11 results? Have you had experience using YOLO models for document layout analysis? I'd love to hear your insights and experiences in the comments below!

References

YOLOv11 documentation: https://docs.ultralytics.com/models/yolo11/
DocLayNet GitHub repository: https://github.com/DS4SD/DocLayNet
My YOLO-DocLayNet GitHub project: https://github.com/ppaanngggg/yolo-doclaynet

Can AI really solve math from pictures?

ppaanngggg — Fri, 12 Jul 2024 06:52:33 +0000

Introduction

Recently, I found the MathVista benchmark evaluates LLMs solving math problems from pictures. Remarkably, Claude 3.5 Sonnet, Gemini 1.5 Pro (May 2024), and GPT-4o have outperformed the average human.

And the newly released Claude 3.5 Sonnet reached a score of 67, much higher than Gemini 1.5 Pro (May 2024) (63.9) and GPT-4o (63.8).

What is MathVista?

Here is a quota introduction of MathVista from its homepage.

To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation
models find challenging.

This is the leaderboard published:

My Experiments

Data source

I found a website called Math-Exercises that categorizes math problems. It offers pictures of the problems and their answers.

I picked 5 problems from this page and tested them on Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o. Let me show you each problem and their answers. Let's start.

Round 1 - Set Operations

Problem:

Find the intersection A∩B, union A∪B and differences A-B, B-A of sets A, B if :

Claude 3.5 Sonnet:

Gemini 1.5 Pro:

GPT-4o:

Right Answer:

Result:

🌟 Awesome! All three models are correct! Maybe it’s too simple for AI.

Round 2 - Set Operations Again!

This time, I am using a slightly more abstract set problem.

Problem:

Find the intersection A∩B, union A∪B and differences A-B, B-A of sets A, B if :

Claude 3.5 Sonnet:

Gemini 1.5 Pro:

GPT-4o:

Right Answer:

Result:

🌟 Incredible, they are all right again! Is AI really good at set operations?

Round 3 - Algebraic Expressions

I chose a different type of problem. This time, I want to test AI's algebraic skills.

Problem:

By grouping the terms factor the polynomials and algebraic expressions :

Claude 3.5 Sonnet:

Gemini 1.5 Pro:

GPT-4o:

Right Answer:

Result:

😱 What? They are all wrong this time. I can’t understand how they know how to solve this problem but still make mistakes at the end.

Round 4 - Linear Equations

Problem:

Solve the linear equations and check the solution :

Claude 3.5 Sonnet:

Gemini 1.5 Pro:

GPT-4o:

It prints out too much. I only screenshot the final answer.

Right Answer:

Result:

🌟 They all found the right answer again! However, Claude failed in the answer check, even though the answer is correct. It's so odd.

Round 5 - Inequalities

Maybe the Equations is too simple for AIs, this time I use inequalities!

Problem:

Solve the linear inequalities with absolute value :

Claude 3.5 Sonnet:

Gemini 1.5 Pro:

It prints out too much. I only screenshot the final answer.

GPT-4o:

It prints out too much. I only screenshot the final answer.

Right Answer:

Result:

😢 Maybe it’s too difficult for AI. They are trying hard, but it's all wrong.

Conclusion

AI can solve many math problems. Although it may fail in some cases, it can still provide useful hints.

And finally, Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o received the same scores in this math match! In conclusion, state-of-the-art AIs perform similarly at math.

How to Use Ollama for Front-end with Streaming Output

ppaanngggg — Mon, 17 Jun 2024 10:58:52 +0000

Introduction

LLM applications are becoming increasingly popular. However, there are numerous LLM models, each with its differences. Handling streaming output can be complex, especially for new front-end developers.

Thanks to the AI SDK developed by Vercel, implementing LLM chat in next.js with streaming output has become incredibly easy. Next, I'll provide a step-by-step tutorial on how to integrate Ollama into your front-end project.

Install Ollama

Ollama is the premier local LLM inferencer. It allows for direct model downloading and exports APIs for backend use. If you're seeking lower latency or improved privacy through local LLM deployment, Ollama is an excellent choice. For installation, if you're using Linux, simply run the following command:



curl -fsSL https://ollama.com/install.sh | sh

If you're using a different OS, please follow this link.

Create a New Next.js Project

To create a new Next.js project, enter the command npx create-next-app@latest your-new-project. Make sure you choose App route mode. After that, run npm dev and open localhost:3000 in your preferred browser to verify if the new project is set up correctly.

Next, you need to install the AI SDK:



npm install ai

The AI SDK utilizes a sophisticated provider design, enabling you to implement your own LLM provider. At present, it is only necessary to install the Ollama provider offered by third-party support.



npm install ollama-ai-provider

Server-Side Code

Now that you've gathered all the prerequisites for your LLM application, create a new file named actions.ts in the app folder:



"use server";

import { ollama } from "ollama-ai-provider";
import { streamText } from "ai";
import { createStreamableValue } from "ai/rsc";

export interface Message {
  role: "user" | "assistant";
  content: string;
}

export async function continueConversation(history: Message[]) {
  "use server";

  const stream = createStreamableValue();
  const model = ollama("llama3:8b");

  (async () => {
    const { textStream } = await streamText({
      model: model,
      messages: history,
    });

    for await (const text of textStream) {
      stream.update(text);
    }

    stream.done();
  })().then(() => {});

  return {
    messages: history,
    newMessage: stream.value,
  };
}

Let me provide some explanation about this code.

interface Message is a shared interface that establishes the structure of a message. It includes two properties: 'role' (which can be either 'user' or 'assistant') and 'content' (the actual text of the message).
The continueConversation function is a server component that utilizes the conversation history to generate the assistant's response. This function interacts with the Ollama model (specifically llama3:8b, but you can replace it with any model of your choice) to generate a continuous text output.
The streamText function is part of the AI SDK and it creates a text stream that will be updated with the assistant's response as it is generated.

Client-Side Code

Next, replace the contents of page.tsx with the new code:



"use client";

import { useState } from "react";
import { continueConversation, Message } from "./actions";
import { readStreamableValue } from "ai/rsc";

export default function Home() {
  const [conversation, setConversation] = useState<Message[]>([]);
  const [input, setInput] = useState<string>("");

  return (
    <div>
      <div>
        {conversation.map((message, index) => (
          <div key={index}>
            {message.role}: {message.content}
          </div>
        ))}
      </div>

      <div>
        <input
          type="text"
          value={input}
          onChange={(event) => {
            setInput(event.target.value);
          }}
        />
        <button
          onClick={async () => {
            const { messages, newMessage } = await continueConversation([
              ...conversation,
              { role: "user", content: input },
            ]);

            let textContent = "";

            for await (const delta of readStreamableValue(newMessage)) {
              textContent = `${textContent}${delta}`;

              setConversation([
                ...messages,
                { role: "assistant", content: textContent },
              ]);
            }
          }}
        >
          Send Message
        </button>
      </div>
    </div>
  );
}

This is a very simple UI you can continue talk with LLM model now. There are some important snips:

The input field captures the user's input. It is controlled by a React state variable that gets updated every time the input changes.
The button has an onClick event that triggers the continueConversation function. This function takes the current conversation history, appends the user's new message, and waits for the assistant's response.
The conversation array holds the history of the conversation. Each message is displayed on the screen, and new messages are appended at the end. By using readStreamableValue from the AI SDK, we're able to read the streaming output value from the server component function and update the conversation in real-time.

Let’s Test Now

I type "who are you" into the input placeholder.

Here is the output of llama:8b supported by Ollama. You'll notice that the output is printed in a streaming manner.

References

Documentation for the AI SDK: https://sdk.vercel.ai/docs/introduction
Ollama Github: https://github.com/ollama/ollama
Find more models supported oy Ollama: https://ollama.com/library

How to analyze document layout by YOLO

ppaanngggg — Thu, 13 Jun 2024 11:49:13 +0000

Why we need document layout analysis

Analyzing document layout is critical because it aids in the proper interpretation and understanding of document content. It becomes more significant with the rise of RAG, which relies heavily on the ability to parse documents accurately.

RAG systems frequently interact with a variety of documents. Scientific papers, for instance, typically have a complex layout that includes figures, tables, references, and structured sections. Proper parsing is important to avoid content disarray. If not done correctly, the LLM could fail due to the 'garbage in, garbage out' principle.

However, due to the complexity of this problem, it's impossible to apply handcrafted rules as a solution. The best approach is to train a machine learning model.

My solution

You can find my solution in yolo-doclaynet. After examining several models and datasets, I've chosen YOLO as the base model and DocLayNet as the training data. Let's delve into more details.

YOLO is the most advanced vision detection model. It is maintained by Ultralytics, a leading computer vision team. The model is easy to train, evaluate, and deploy. Plus, its size is compact enough to run in a browser or on a smartphone.
DocLayNet is a human-annotated document layout segmentation dataset, containing 80,863 pages from a wide variety of document sources. To the best of my knowledge, it is the highest-quality dataset for document layout analysis. You can download and find more information from this link.

Live Demo

Download the pretrained model from huggingface. Your options include yolov8n-doclaynet, yolov8s-doclaynet, and yolov8m-doclaynet.
Install Ultralytics by executing pip install ultralytics. If you encounter any issues, please refer to https://docs.ultralytics.com/quickstart/#install-ultralytics.

Copy and modify this code snippet to output the detection result:

from ultralytics import YOLO

img = cv2.imread(your_image_path, cv2.IMREAD_COLOR)
model = YOLO(your_model_path)
result = model.predict(img)[0]
print(result)

Debugging the model can be challenging when you can only check the plain-text output. Fortunately, visualizing the results is simple:

from ultralytics.utils.plotting import Annotator, Colors

colors = Colors()
annotator = Annotator(img, line_width=line_width, font_size=font_size)
for label, box in zip(result.boxes.cls.tolist(), result.boxes.xyxyn.tolist()):
    label = int(label)
    annotator.box_label(
        [box[0] * width, box[1] * height, box[2] * width, box[3] * height],
        result.names[label],
        color=colors(label, bgr=True),
    )
annotator.save(
    os.path.join(os.path.dirname(image), "annotated-" + os.path.basename(your_image_path))
)

Examine the annotated image. Here's an example:

Benchmark

I also evaluate the mAP50-95 performance of the entire yolov8 series on the DocLayNet test set.

label	images	boxes	yolov8n	yolov8s	yolov8m	yolov8l	yolov8x
Caption	4983	1542	0.682	0.721	0.746	0.75	0.753
Footnote	4983	387	0.614	0.669	0.696	0.702	0.717
Formula	4983	1966	0.655	0.695	0.723	0.75	0.747
List-item	4983	10521	0.789	0.818	0.836	0.841	0.841
Page-footer	4983	3987	0.588	0.61	0.64	0.641	0.655
Page-header	4983	3365	0.707	0.754	0.769	0.776	0.784
Picture	4983	3497	0.723	0.762	0.789	0.796	0.805
Section-header	4983	8544	0.709	0.727	0.742	0.75	0.748
Table	4983	2394	0.82	0.854	0.88	0.885	0.886
Text	4983	29917	0.845	0.86	0.876	0.878	0.877
Title	4983	334	0.762	0.806	0.83	0.846	0.84
All	4983	66454	0.718	0.752	0.775	0.783	0.787

Here is an overview of the mAP50-95 performance with different model sizes.

How to Upload Images to Google Gemini for Next.js

ppaanngggg — Mon, 03 Jun 2024 13:33:09 +0000

Introduction

Google Gemini exhibits strong performance in multi-model tasks, particularly the latest Gemini 1.5 Flash and Gemini 1.5 Pro. There are two benchmarks for multi-model tasks: reasoning and math. As demonstrated, the Gemini 1.5 Pro performs on par with the latest GPT-4o in visual math tasks 🎉.

Benchmark	Description	Gemini 1.5 Flash	Gemini 1.5 Pro	GPT-4o
MMMU	Multi-discipline college-level reasoning problems	56.1%	62.2%	69.1%
MathVista	Mathematical reasoning in visual contexts	58.4%	63.9%	63.8%

In this blog, I will guide you on how to unlock the vision capabilities of Google Gemini. Let's get started 🚀.

Prerequisite

In my latest blog, I demonstrated how to use Google Gemini with Next.js for streaming output. While the previous guide focused on text input, this article will show you how to upload images to Google Gemini, using a simple demo. If you're unfamiliar with registering a Google AI API Key or using the Vercel AI SDK, I recommend reading the previous blog first.

Server-Side

Here is the complete server-side function. I made a few modifications, namely removing the custom Message and importing CoreMessage instead.

"use server";

import { google } from "@ai-sdk/google";
import { CoreMessage, LanguageModel, streamText } from "ai";
import { createStreamableValue } from "ai/rsc";

export async function continueConversation(history: CoreMessage[]) {
  "use server";

  const stream = createStreamableValue();
  const model = google.chat("models/gemini-1.5-pro-latest");

  (async () => {
    const { textStream } = await streamText({
      model: model,
      messages: history,
    });

    for await (const text of textStream) {
      stream.update(text);
    }

    stream.done();
  })().then(() => {});

  return {
    messages: history,
    newMessage: stream.value,
  };
}

The CoreMessage is a complex structure that can accept various types of data. CoreUserMessage is a message sent by a user, it has a fixed role user and flexible content. The UserContent can either be a plain string, a TextPart object, or an ImagePart object.

type CoreUserMessage = {
    role: 'user';
    content: UserContent;
};

type UserContent = string | Array<TextPart$1 | ImagePart>;

interface TextPart$1 {
    type: 'text';
    text: string;
}

interface ImagePart {
    type: 'image';
    /**
  Image data. Can either be:

  - data: a base64-encoded string, a Uint8Array, an ArrayBuffer, or a Buffer
  - URL: a URL that points to the image
     */
    image: DataContent | URL;
    /**
  Optional mime type of the image.
     */
    mimeType?: string;
}

Delve deep into the ImagePart. You can pass either base64-encoded image data or an image URL into the image field. In this instance, to simplify the system, we will pass base64-encoded image data into the message.

Client-Side

This page requires key modifications. We need to upload an image, encode it into a base64 message, and preview the image within the message. The following are the complete codes for the page after the update. You can copy and paste this code, and I'll explain the key points afterward.

"use client";

import { useState } from "react";
import { continueConversation } from "./actions";
import { readStreamableValue } from "ai/rsc";
import { CoreMessage } from "ai";

export default function Home() {
  const [conversation, setConversation] = useState<CoreMessage[]>([]);
  const [imageInput, setImageInput] = useState<string>("");
  const [textInput, setTextInput] = useState<string>("");

  async function getBase64(file: File): Promise<string> {
    return new Promise((resolve) => {
      const reader = new FileReader();
      reader.readAsDataURL(file);
      reader.onload = () => {
        resolve(reader.result as string);
      };
    });
  }

  return (
    <div>
      <div>
        {conversation.map((message, index) => (
          <div key={index}>
            {message.role}:{" "}
            {
              // if it's string, just show it, else if it is image, preview image, if it is text, show the text
              typeof message.content === "string" ? (
                message.content
              ) : message.content[0].type === "image" ? (
                <img
                  alt=""
                  src={
                    ("data:image;base64," + message.content[0].image) as string
                  }
                  width={640}
                />
              ) : message.content[0].type === "text" ? (
                message.content[0].text
              ) : (
                ""
              )
            }
          </div>
        ))}
      </div>

      <div>
        <input
          type="file"
          onChange={(event) => {
            if (event.target.files) {
              const file = event.target.files[0];
              getBase64(file).then((result) => {
                setImageInput(result);
              });
            } else {
              setImageInput("");
            }
          }}
        />
        <input
          type="text"
          value={textInput}
          onChange={(event) => {
            setTextInput(event.target.value);
          }}
        />
        <button
          onClick={async () => {
            // append user messages
            const userMessages: CoreMessage[] = [];
            if (imageInput.length) {
              // remove data:*/*;base64 from result
              const pureBase64 = imageInput
                .toString()
                .replace(/^data:image\/\w+;base64,/, "");
              userMessages.push({
                role: "user",
                content: [{ type: "image", image: pureBase64 }],
              });
            }
            if (textInput.length) {
              userMessages.push({
                role: "user",
                content: [{ type: "text", text: textInput }],
              });
            }
            const { messages, newMessage } = await continueConversation([
              ...conversation,
              ...userMessages,
            ]);

            // collect assistant message
            let textContent = "";
            for await (const delta of readStreamableValue(newMessage)) {
              textContent = `${textContent}${delta}`;

              setConversation([
                ...messages,
                {
                  role: "assistant",
                  content: [{ type: "text", text: textContent }],
                },
              ]);
            }
          }}
        >
          Send Message
        </button>
      </div>
    </div>
  );
}

Due to the complexity of CoreMessage, I have added some conditional branches to handle message previews. This is particularly the case when using the <img /> tag to display base64-encoded images.
Add another <input> with type="file" to upload an image. When a change occurs, read the image file and convert it into a base64 string.
Finally, when the send button is clicked, we need to convert the image and text inputs into an array of CoreMessage. Please note that the base64 header should be discarded from the image input.

Body Size Config

The default bodySizeLimit for Next.js is set to 1MB. If you wish to upload files larger than 1MB, you need to adjust the configuration as follows.

const nextConfig = {
    experimental: {
        serverActions: {
            bodySizeLimit: '10mb'
        }
    }
};

Let’s Test Now

I upload the cover image from the previous blog and ask, "What is this picture about?" Then, I click the send button.

Examine the assistant's output; it's quite impressive 👏👏👏.

References

Documentation for the AI SDK: https://sdk.vercel.ai/docs/introduction
Google AI Studio: https://ai.google.dev/aistudio

Conclusion

In this post, I've explored the key features and benefits of Google Gemini in front-end.

If you're interested in seeing Google Gemini in action, check out these products that have successfully implemented it:

AI Math Solver - A webapp that help users to solve math problems. Learn more: AIMathSolver

Have you used Google Gemini in your projects? Share your experiences in the comments below!

How to Use Google Gemini for Next.js with Streaming Output

ppaanngggg — Thu, 30 May 2024 08:03:34 +0000

Introduction

Create a Google AI Studio Account

Head to Google AI Studio and signup, after you login, you can find the button “Get API Key” on the left, click it and create a API Key. This API Key will be used later.

Create a New Next.js Project

Next, you need to install the AI SDK:



pnpm install ai

The AI SDK uses an advanced provider design, allowing you to implement your own LLM provider. Currently, we only need to install the official Google Provider.



pnpm install @ai-sdk/google

Set Your API Key in Your Local Environment

Next.js integrates well with environment variables. Simply create a file named .env.local in the root folder of your project.



GOOGLE_GENERATIVE_AI_API_KEY={your API Key}

Afterwards, the AI SDK will automatically load your key when you use Google AI to generate text.

Server-Side Code

Now that you've gathered all the prerequisites for your LLM application, create a new file named actions.ts in the app folder:



"use server";

import { google } from "@ai-sdk/google";
import { streamText } from "ai";
import { createStreamableValue } from "ai/rsc";

export interface Message {
  role: "user" | "assistant";
  content: string;
}

export async function continueConversation(history: Message[]) {
  "use server";

  const stream = createStreamableValue();
  const model = google("models/gemini-1.5-pro-latest");

  (async () => {
    const { textStream } = await streamText({
      model: model,
      messages: history,
    });

    for await (const text of textStream) {
      stream.update(text);
    }

    stream.done();
  })().then(() => {});

  return {
    messages: history,
    newMessage: stream.value,
  };
}

Let me provide some explanation about this code.

interface Message is a shared interface that establishes the structure of a message. It includes two properties: 'role' (which can be either 'user' or 'assistant') and 'content' (the actual text of the message).
The continueConversation function is a server component function which uses the history of the conversation to generate the assistant's response. The function communicates with Google's Gemini model to generate a streaming text output.
The streamText function is part of the AI SDK and it creates a text stream that will be updated with the assistant's response as it is generated.

Client-Side Code

Next, replace the contents of page.tsx with the new code:



"use client";

import { useState } from "react";
import { continueConversation, Message } from "./actions";
import { readStreamableValue } from "ai/rsc";

export default function Home() {
  const [conversation, setConversation] = useState<Message[]>([]);
  const [input, setInput] = useState<string>("");

  return (
    <div>
      <div>
        {conversation.map((message, index) => (
          <div key={index}>
            {message.role}: {message.content}
          </div>
        ))}
      </div>

      <div>
        <input
          type="text"
          value={input}
          onChange={(event) => {
            setInput(event.target.value);
          }}
        />
        <button
          onClick={async () => {
            const { messages, newMessage } = await continueConversation([
              ...conversation,
              { role: "user", content: input },
            ]);

            let textContent = "";

            for await (const delta of readStreamableValue(newMessage)) {
              textContent = `${textContent}${delta}`;

              setConversation([
                ...messages,
                { role: "assistant", content: textContent },
              ]);
            }
          }}
        >
          Send Message
        </button>
      </div>
    </div>
  );
}

This is a very simple UI you can continue talk with LLM model now. There are some important snips:

The input field captures the user's input. It is controlled by a React state variable that gets updated every time the input changes.
The button has an onClick event that triggers the continueConversation function. This function takes the current conversation history, appends the user's new message, and waits for the assistant's response.
The conversation array holds the history of the conversation. Each message is displayed on the screen, and new messages are appended at the end. By using readStreamableValue from the AI SDK, we're able to read the streaming output value from the server component function and update the conversation in real-time.

Let's Test Now

I type "who are you" into the input placeholder.

Here is the output of Google Gemini. You'll notice that the output is printed in a streaming manner.

References

Documentation for the AI SDK: https://sdk.vercel.ai/docs/introduction
Google AI Studio: https://ai.google.dev/aistudio

Conclusion

In this post, I've explored the key features and benefits of Google Gemini in front-end.

If you're interested in seeing Google Gemini in action, check out these products that have successfully implemented it:

AI Math Solver - A webapp that help users to solve math problems. Learn more: AIMathSolver

Have you used Google Gemini in your projects? Share your experiences in the comments below!

Step by Step to deploy Go API on AWS lambda and access by function URL

ppaanngggg — Fri, 24 May 2024 16:22:56 +0000

Introduction

In today's world of cloud computing, AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. You can trigger Lambda from over 200 AWS services and software as a service (SaaS) applications, and only pay for what you use.

Go is a statically typed, compiled language known for its simplicity, efficiency, and ease of use. It's particularly well-suited for building scalable and efficient cloud services.

In this guide, I will demonstrate how to deploy a Go API server on AWS Lambda step by step.

Creating a Lambda Function

First, you need to create a new Lambda function. Log into your AWS console and navigate to the Lambda service. Click the 'Create function' button. You will see many options, but don't worry, we only need to adjust a few. Leaving the rest as they are will be sufficient to host our Go server.

Enter your function name. For example, I used go-api. You should use a name that is meaningful.
Select the runtime. In this case, we'll choose Amazon Linux 2023.
Optionally, if you're building a Go application for ARM, you should change the Architecture to arm64.

Click to open Advanced settings and enable the function URL. This function is powerful and easy to use, and we will be using it.
Setting the Auth type to NONE simplifies its use. For enhanced security, you could use AWS_IAM, but that's a different topic that we won't discuss here.

Edit the Go Source Code

Next, we need to create a minimal Go function for AWS Lambda. Here's a sample main.go.

We import github.com/aws/aws-lambda-go/lambda. This dependency is necessary for running Go as a Lambda function.
In the main function, we use lambda.Start to start the handler.

package main

import (
    "context"
    "fmt"
    "github.com/aws/aws-lambda-go/lambda"
)

type RequestEvent struct {
    RawPath        string `json:"rawPath"`
    RawQueryString string `json:"rawQueryString"`
    Body           string `json:"body"`
}

func HandleRequest(ctx context.Context, event *RequestEvent) (*string, error) {
    if event == nil {
        return nil, fmt.Errorf("received nil event")
    }
    message := fmt.Sprintf(
        "RawPath: %s, RawQueryString: %s, Body: %s",
        event.RawPath, event.RawQueryString, event.Body,
    )
    return &message, nil
}

func main() {
    lambda.Start(HandleRequest)
}

Next, we need to compile the main into a binary. Simply copy the following command line, ensuring the output filename is 'bootstrap'.

GOOS=linux GOARCH=amd64 go build -tags lambda.norpc -o bootstrap main.go

Next, we need to compress the bootstrap file into a zip format. Without doing this, we cannot upload it to AWS Lambda. Be sure to place the bootstrap file at the root of the zip file.

zip myFunction.zip bootstrap

We've now set up the basic Go application. It's time to upload and conduct a test.

Upload to AWS lambda

Click to enter the lambda function we created previously and scroll down. Find the Code Source block within the code tab. To the right of this block, click the Upload from button and select the .zip file option. Locate your zip file and upload it.

After uploading, you should find information similar to the details below the Code Source block. This indicates that your upload was successful.

Test the function

Scroll up, and on the right of Function overview, you'll find the function URL. This is the advanced option we just set. The function URL is a powerful tool that can convert your RESTful requests into lambda handler requests, and then encode the handler output into a RESTful response. It's particularly useful for building API servers based on lambda.

Click the small copy button to duplicate the URL. You can now use curl to test your lambda function. Input this command into your terminal and press enter. Remember, you need to replace the function URL with your own.

curl -XPOST 'https://{your_lambda_function_URL}/hello?
name=world' -H 'Content-Type: application/json' --data '{"age": 16}'

The magic happens, you should see the output like this:

RawPath: /hello, RawQueryString: name=world, Body: {"age": 16}

Review the codes again, and you'll find the structure. This structure is the second parameter of the handler. The magic here is that the lambda library decodes the path, query string, and body into our structure. While many other fields are available, these three are the most crucial.

type RequestEvent struct {
    RawPath        string `json:"rawPath"`
    RawQueryString string `json:"rawQueryString"`
    Body           string `json:"body"`
}

Reference

Building Lambda functions with Go
Lambda function URLs

How to count tokens in frontend for Popular LLM Models: GPT, Claude, and Llama

ppaanngggg — Tue, 21 May 2024 15:44:17 +0000

Introduction

Today, apps using Language Learning Machines (LLM) are growing fast. People use LLMs a lot to solve tough problems. LLMs are important in many areas like education, money matters, health, and more. Seeing this, developers worldwide are making lots of new apps using LLM. These apps are changing how we live, work, and talk to each other.

Counting tokens before sending prompts to the Language Learning Model (LLM) is important for two reasons. First, it helps users manage their budget. Knowing how many tokens a prompt uses can prevent surprise costs. Second, it helps the LLM work better. The total tokens in a prompt should be less than the model's maximum. If it's more, the model might not work as well or might even make mistakes.

Tokenizer in Backend vs Frontend

In text processing, the calculation of prompt tokens is a crucial task and there are essentially two methods to accomplish this.

Backend Implementation

The first, and often most common, solution is to run a tokenizer in the backend system of the application. This approach involves exposing an Application Programming Interface (API) for the frontend to invoke when needed. This method is generally straightforward to implement, especially given the existence of Python libraries like tiktoken and tokenizers that are designed specifically for this purpose and are incredibly user-friendly.

However, there are some drawbacks. Firstly, it's inefficient as it requires sending large volumes of text to the backend to receive a simple number. This can be particularly wasteful when handling exceptionally long text. Secondly, it misuses server CPU resources since the CPUs are constantly calculating tokens, which doesn't significantly contribute to the product's value. Lastly, notable latency occurs when a user is typing and waiting for the token count, leading to a poor user experience.

Frontend Implementation

Thanks to transformers.js, we can run the tokenizer and model locally in the browser. Transformers.js is designed to be functionally equivalent to Hugging Face's transformers python library, meaning you can run the same pretrained models using a very similar API.

Installation

To install via NPM, run:



npm i @xenova/transformers

To run transformers on the client side of next.js, you need to update the next.config.js file:



/** @type {import('next').NextConfig} */
const nextConfig = {
    // (Optional) Export as a static site
    // See https://nextjs.org/docs/pages/building-your-application/deploying/static-exports#configuration
    output: 'export', // Feel free to modify/remove this option

    // Override the default webpack configuration
    webpack: (config) => {
        // See https://webpack.js.org/configuration/resolve/#resolvealias
        config.resolve.alias = {
            ...config.resolve.alias,
            "sharp$": false,
            "onnxruntime-node$": false,
        }
        return config;
    },
}

module.exports = nextConfig

Code Sample

Firstly, you need to import AutoTokenizer from @xenova/transformers:



import { AutoTokenizer } from "@xenova/transformers";

You can create a tokenizer using the AutoTokenizer.from_pretrained function, which requires the pretrained_model_name_or_path parameter. Xenova provides tokenizers designed for widely-used Language Learning Models (LLMs) like GPT-4, Claude-3, and Llama-3. To access these, visit the Hugging Face website, a hub for Machine Learning resources, at huggingface.co/Xenova. The tokenizer configurations for the latest GPT-4o model are available at Xenova/gpt-4o. You can create a tokenizer for GPT-4o now:



const tokenizer = await AutoTokenizer.from_pretrained('Xenova/gpt-4o');

The usage of the tokenizer is very similar to the tokenizer library in Python. The tokenizer.encode method can convert text into tokens.



const tokens = tokenizer.encode('hello world'); // [24912, 2375]

As you can see, the tokenizer of transformers.js is extremely easy to use. Due to its core code's implementation in Rust, it can calculate tokens at an impressive speed.

Demo

Using this pure browser technique, I created an all-in-one website to provide token counters for all popular models.

You can test tokenizer of GPT-4o there. There is a screenshot of page.

Simplifying PyTorch Installation: Introducing Install.PyTorch

ppaanngggg — Mon, 01 Apr 2024 12:55:20 +0000

Understanding the Complexity of ML Engineering:

Machine learning engineering involves working with diverse hardware devices and software dependencies. ML engineers often find themselves in a situation where they need to identify the ideal PyTorch version that supports their specific combination of devices and Python versions.

Introducing Install.PyTorch:

Visiting the Install.PyTorch website, ML engineers can easily determine the exact PyTorch version they need based on their specific requirements.

Let’s consider an example to better illustrate how Install.PyTorch streamlines the PyTorch installation process. Suppose you need to install PyTorch with CUDA 12.1 and Python 3.8. By visiting Install.PyTorch selected CUDA 12.1 and Python 3.8 , you can effortlessly find the compatible PyTorch version. Another example you want to download only CPU PyTorch with Python 3.12, you can visit Install.PyTorch selected CPU and Python 3.12.

After selecting the appropriate PyTorch version using Install.PyTorch, you can download the PyTorch package directly through the provided link, or you can simply copy the pip install command line and execute it in their preferred environment.

Conclusion:

ML engineering can be a complex endeavor, especially when it comes to finding the right PyTorch version that satisfies specific device and Python requirements. Install.PyTorch emerges as a valuable tool, simplifying the process of identifying and installing the ideal PyTorch version for ML engineers.

DEV Community: ppaanngggg

YOLOv12: The Next Evolution in Document Layout Analysis

Introduction

Key Improvements in YOLOv12

Experimental Results

Performance Metrics

Key Findings

Conclusion

Design0: Effortless Design for Everyone

What I Built

Demo

Website

Source Code

design0webapp / design0

Effortless Design for Everyone

Design0

Website

What I Built

Screenshots

Tools Used

Final Thoughts

YOLOv11: A New Breakthrough in Document Layout Analysis

Introduction

Training Method

Comparing the Results

Conclusion

More

References

Can AI really solve math from pictures?

Introduction

What is MathVista?

My Experiments

Data source

Round 1 - Set Operations

Round 2 - Set Operations Again!

Round 3 - Algebraic Expressions

Round 4 - Linear Equations

Round 5 - Inequalities

Conclusion

Links:

How to Use Ollama for Front-end with Streaming Output

Introduction

Install Ollama

Create a New Next.js Project

Server-Side Code

Client-Side Code

Let’s Test Now

References

How to analyze document layout by YOLO

Why we need document layout analysis

My solution

Live Demo

Benchmark

How to Upload Images to Google Gemini for Next.js

Introduction

Prerequisite

Server-Side

Client-Side

Body Size Config

Let’s Test Now

References

Conclusion

How to Use Google Gemini for Next.js with Streaming Output

Introduction

Create a Google AI Studio Account

Create a New Next.js Project

Set Your API Key in Your Local Environment

Server-Side Code

Client-Side Code

Let's Test Now

References

Conclusion

Step by Step to deploy Go API on AWS lambda and access by function URL

Introduction

Creating a Lambda Function

Edit the Go Source Code

Upload to AWS lambda

Test the function

Reference

How to count tokens in frontend for Popular LLM Models: GPT, Claude, and Llama